linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add a page cache-backed balloon device driver.
@ 2012-06-26 20:32 Frank Swiderski
  2012-06-26 20:40 ` Rik van Riel
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Frank Swiderski @ 2012-06-26 20:32 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin, riel, Andrea Arcangeli
  Cc: virtualization, linux-kernel, kvm, mikew, Frank Swiderski

This implementation of a virtio balloon driver uses the page cache to
"store" pages that have been released to the host.  The communication
(outside of target counts) is one way--the guest notifies the host when
it adds a page to the page cache, allowing the host to madvise(2) with
MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
(via the regular page reclaim).  This means that inflating the balloon
is similar to the existing balloon mechanism, but the deflate is
different--it re-uses existing Linux kernel functionality to
automatically reclaim.

Signed-off-by: Frank Swiderski <fes@google.com>
---
 drivers/virtio/Kconfig              |   13 +
 drivers/virtio/Makefile             |    1 +
 drivers/virtio/virtio_fileballoon.c |  636 +++++++++++++++++++++++++++++++++++
 include/linux/virtio_balloon.h      |    9 +
 include/linux/virtio_ids.h          |    1 +
 5 files changed, 660 insertions(+), 0 deletions(-)
 create mode 100644 drivers/virtio/virtio_fileballoon.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index f38b17a..cffa2a7 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -35,6 +35,19 @@ config VIRTIO_BALLOON
 
 	 If unsure, say M.
 
+config VIRTIO_FILEBALLOON
+	tristate "Virtio page cache-backed balloon driver"
+	select VIRTIO
+	select VIRTIO_RING
+	---help---
+	 This driver supports decreasing and automatically reclaiming the
+	 memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
+	 tries to maintain a specific target balloon size using the page cache.
+	 This allows the guest to implicitly deflate the balloon by flushing
+	 pages from the cache and touching the page.
+
+	 If unsure, say N.
+
  config VIRTIO_MMIO
  	tristate "Platform bus driver for memory mapped virtio devices (EXPERIMENTAL)"
  	depends on HAS_IOMEM && EXPERIMENTAL
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 5a4c63c..7ca0a3f 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
 obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
+obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c
new file mode 100644
index 0000000..ff252ec
--- /dev/null
+++ b/drivers/virtio/virtio_fileballoon.c
@@ -0,0 +1,636 @@
+/* Virtio file (page cache-backed) balloon implementation, inspired by
+ * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's
+ * implementation.
+ *
+ * This implementation of the virtio balloon driver re-uses the page cache to
+ * allow memory consumed by inflating the balloon to be reclaimed by linux.  It
+ * creates and mounts a bare-bones filesystem containing a single inode.  When
+ * the host requests the balloon to inflate, it does so by "reading" pages at
+ * offsets into the inode mapping's page_tree.  The host is notified when the
+ * pages are added to the page_tree, allowing it (the host) to madvise(2) the
+ * corresponding host memory, reducing the RSS of the virtual machine.  In this
+ * implementation, the host is only notified when a page is added to the
+ * balloon.  Reclaim happens under the existing TTFP logic, which flushes unused
+ * pages in the page cache.  If the host used MADV_DONTNEED, then when the guest
+ * uses the page, the zero page will be mapped in, allowing automatic (and fast,
+ * compared to requiring a host notification via a virtio queue to get memory
+ * back) reclaim.
+ *
+ *  Copyright 2008 Rusty Russell IBM Corporation
+ *  Copyright 2011 Frank Swiderski Google Inc
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include <linux/backing-dev.h>
+#include <linux/delay.h>
+#include <linux/file.h>
+#include <linux/freezer.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/virtio.h>
+#include <linux/virtio_balloon.h>
+#include <linux/writeback.h>
+
+#define VIRTBALLOON_PFN_ARRAY_SIZE 256
+
+struct virtio_balloon {
+	struct virtio_device *vdev;
+	struct virtqueue *inflate_vq;
+
+	/* Where the ballooning thread waits for config to change. */
+	wait_queue_head_t config_change;
+
+	/* The thread servicing the balloon. */
+	struct task_struct *thread;
+
+	/* Waiting for host to ack the pages we released. */
+	struct completion acked;
+
+	/* The array of pfns we tell the Host about. */
+	unsigned int num_pfns;
+	u32 pfns[VIRTBALLOON_PFN_ARRAY_SIZE];
+
+	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+	/* The last page offset read into the mapping's page_tree */
+	unsigned long last_scan_page_array;
+
+	/* The last time a page was reclaimed */
+	unsigned long last_reclaim;
+};
+
+/* Magic number used for the skeleton filesystem in the call to mount_pseudo */
+#define BALLOONFS_MAGIC 0x42414c4c
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_FILE_BALLOON, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+/*
+ * The skeleton filesystem contains a single inode, held by the structure below.
+ * Using the containing structure below allows easy access to the struct
+ * virtio_balloon.
+ */
+static struct balloon_inode {
+	struct inode inode;
+	struct virtio_balloon *vb;
+} the_inode;
+
+/*
+ * balloon_alloc_inode is called when the single inode for the skeleton
+ * filesystem is created in init() with the call to new_inode.
+ */
+static struct inode *balloon_alloc_inode(struct super_block *sb)
+{
+	static bool already_inited;
+	/* We should only ever be called once! */
+	BUG_ON(already_inited);
+	already_inited = true;
+	inode_init_once(&the_inode.inode);
+	return &the_inode.inode;
+}
+
+/* Noop implementation of destroy_inode.  */
+static void balloon_destroy_inode(struct inode *inode)
+{
+}
+
+static int balloon_sync_fs(struct super_block *sb, int wait)
+{
+	return filemap_write_and_wait(the_inode.inode.i_mapping);
+}
+
+static const struct super_operations balloonfs_ops = {
+	.alloc_inode	= balloon_alloc_inode,
+	.destroy_inode	= balloon_destroy_inode,
+	.sync_fs	= balloon_sync_fs,
+};
+
+static const struct dentry_operations balloonfs_dentry_operations = {
+};
+
+/*
+ * balloonfs_writepage is called when linux needs to reclaim memory held using
+ * the balloonfs' page cache.
+ */
+static int balloonfs_writepage(struct page *page, struct writeback_control *wbc)
+{
+	the_inode.vb->last_reclaim = jiffies;
+	SetPageUptodate(page);
+	ClearPageDirty(page);
+	/*
+	 * If the page isn't being flushed from the page allocator, go ahead and
+	 * drop it from the page cache anyway.
+	 */
+	if (!wbc->for_reclaim)
+		delete_from_page_cache(page);
+	unlock_page(page);
+	return 0;
+}
+
+/* Nearly no-op implementation of readpage */
+static int balloonfs_readpage(struct file *file, struct page *page)
+{
+	SetPageUptodate(page);
+	unlock_page(page);
+	return 0;
+}
+
+static const struct address_space_operations balloonfs_aops = {
+	.writepage	= balloonfs_writepage,
+	.readpage	= balloonfs_readpage
+};
+
+static struct backing_dev_info balloonfs_backing_dev_info = {
+	.name		= "balloonfs",
+	.ra_pages	= 0,
+	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK
+};
+
+static struct dentry *balloonfs_mount(struct file_system_type *fs_type,
+			 int flags, const char *dev_name, void *data)
+{
+	struct dentry *root;
+	struct inode *inode;
+	root = mount_pseudo(fs_type, "balloon:", &balloonfs_ops,
+			    &balloonfs_dentry_operations, BALLOONFS_MAGIC);
+	inode = root->d_inode;
+	inode->i_mapping->a_ops = &balloonfs_aops;
+	mapping_set_gfp_mask(inode->i_mapping,
+			     (GFP_HIGHUSER | __GFP_NOMEMALLOC));
+	inode->i_mapping->backing_dev_info = &balloonfs_backing_dev_info;
+	return root;
+}
+
+/* The single mounted skeleton filesystem */
+static struct vfsmount *balloon_mnt __read_mostly;
+
+static struct file_system_type balloon_fs_type = {
+	.name =		"balloonfs",
+	.mount =	balloonfs_mount,
+	.kill_sb =	kill_anon_super,
+};
+
+/* Acknowledges a message from the specified virtqueue. */
+static void balloon_ack(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb;
+	unsigned int len;
+
+	vb = virtqueue_get_buf(vq, &len);
+	if (vb)
+		complete(&vb->acked);
+}
+
+/*
+ * Scans the page_tree for the inode's mapping, looking for an offset that is
+ * currently empty, returning that index (or 0 if it could not fill the
+ * request).
+ */
+static unsigned long find_available_inode_page(struct virtio_balloon *vb)
+{
+	unsigned long radix_index, index, max_scan;
+	struct address_space *mapping = the_inode.inode.i_mapping;
+
+	/*
+	 * This function is a serialized call (only happens on the free-to-host
+	 * thread), so no locking is necessary here.
+	 */
+	index = vb->last_scan_page_array;
+	max_scan = totalram_pages - vb->last_scan_page_array;
+
+	/*
+	 * Scan starting at the last scanned offset, then wrap around if
+	 * necessary.
+	 */
+	if (index == 0)
+		index = 1;
+	rcu_read_lock();
+	radix_index = radix_tree_next_hole(&mapping->page_tree,
+					   index, max_scan);
+	rcu_read_unlock();
+	/*
+	 * If we hit the end of the tree, wrap and search up to the original
+	 * index.
+	 */
+	if (radix_index - index >= max_scan) {
+		if (index != 1) {
+			rcu_read_lock();
+			radix_index = radix_tree_next_hole(&mapping->page_tree,
+							   1, index);
+			rcu_read_unlock();
+			if (radix_index - 1 >= index)
+				radix_index = 0;
+		} else {
+			radix_index = 0;
+		}
+	}
+	vb->last_scan_page_array = radix_index;
+
+	return radix_index;
+}
+
+/* Notifies the host of pages in the specified virtqueue. */
+static int tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	int err;
+	struct scatterlist sg;
+
+	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+
+	init_completion(&vb->acked);
+
+	/* We should always be able to add one buffer to an empty queue. */
+	err = virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL);
+	if (err  < 0)
+		return err;
+	virtqueue_kick(vq);
+
+	/* When host has read buffer, this completes via balloon_ack */
+	wait_for_completion(&vb->acked);
+	return err;
+}
+
+static void fill_balloon(struct virtio_balloon *vb, size_t num)
+{
+	int err;
+
+	/* We can only do one array worth at a time. */
+	num = min(num, ARRAY_SIZE(vb->pfns));
+
+	for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) {
+		struct page *page;
+		unsigned long inode_pfn = find_available_inode_page(vb);
+		/* Should always be able to find a page. */
+		BUG_ON(!inode_pfn);
+		page = read_mapping_page(the_inode.inode.i_mapping, inode_pfn,
+					 NULL);
+		if (IS_ERR(page)) {
+			if (printk_ratelimit())
+				dev_printk(KERN_INFO, &vb->vdev->dev,
+					   "Out of puff! Can't get %zu pages\n",
+					   num);
+			break;
+		}
+
+		/* Set the page to be dirty */
+		set_page_dirty(page);
+
+		vb->pfns[vb->num_pfns] = page_to_pfn(page);
+	}
+
+	/* Didn't get any?  Oh well. */
+	if (vb->num_pfns == 0)
+		return;
+
+	/* Notify the host of the pages we just added to the page_tree. */
+	err = tell_host(vb, vb->inflate_vq);
+
+	for (; vb->num_pfns != 0; vb->num_pfns--) {
+		struct page *page = pfn_to_page(vb->pfns[vb->num_pfns - 1]);
+		/*
+		 * Release our refcount on the page so that it can be reclaimed
+		 * when necessary.
+		 */
+		page_cache_release(page);
+	}
+	__mark_inode_dirty(&the_inode.inode, I_DIRTY_PAGES);
+}
+
+static inline void update_stat(struct virtio_balloon *vb, int idx,
+			       u64 val)
+{
+	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
+	vb->stats[idx].tag = idx;
+	vb->stats[idx].val = val;
+}
+
+#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
+
+static inline u32 config_pages(struct virtio_balloon *vb);
+static void update_balloon_stats(struct virtio_balloon *vb)
+{
+	unsigned long events[NR_VM_EVENT_ITEMS];
+	struct sysinfo i;
+
+	all_vm_events(events);
+	si_meminfo(&i);
+
+	update_stat(vb, VIRTIO_BALLOON_S_SWAP_IN,
+		    pages_to_bytes(events[PSWPIN]));
+	update_stat(vb, VIRTIO_BALLOON_S_SWAP_OUT,
+		    pages_to_bytes(events[PSWPOUT]));
+	update_stat(vb, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
+	update_stat(vb, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+
+	/* Total and Free Mem */
+	update_stat(vb, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram));
+	update_stat(vb, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram));
+}
+
+static void virtballoon_changed(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+
+	wake_up(&vb->config_change);
+}
+
+static inline bool config_need_stats(struct virtio_balloon *vb)
+{
+	u32 v = 0;
+
+	vb->vdev->config->get(vb->vdev,
+			      offsetof(struct virtio_balloon_config,
+				       need_stats),
+			      &v, sizeof(v));
+	return (v != 0);
+}
+
+static inline u32 config_pages(struct virtio_balloon *vb)
+{
+	u32 v = 0;
+
+	vb->vdev->config->get(vb->vdev,
+			      offsetof(struct virtio_balloon_config, num_pages),
+			      &v, sizeof(v));
+	return v;
+}
+
+static inline s64 towards_target(struct virtio_balloon *vb)
+{
+	struct address_space *mapping = the_inode.inode.i_mapping;
+	u32 v = config_pages(vb);
+
+	return (s64)v - (mapping ? mapping->nrpages : 0);
+}
+
+static void update_balloon_size(struct virtio_balloon *vb)
+{
+	struct address_space *mapping = the_inode.inode.i_mapping;
+	__le32 actual = cpu_to_le32((mapping ? mapping->nrpages : 0));
+
+	vb->vdev->config->set(vb->vdev,
+			      offsetof(struct virtio_balloon_config, actual),
+			      &actual, sizeof(actual));
+}
+
+static void update_free_and_total(struct virtio_balloon *vb)
+{
+	struct sysinfo i;
+	u32 value;
+
+	si_meminfo(&i);
+
+	update_balloon_stats(vb);
+	value = i.totalram;
+	vb->vdev->config->set(vb->vdev,
+			      offsetof(struct virtio_balloon_config,
+				       pages_total),
+			      &value, sizeof(value));
+	value = i.freeram;
+	vb->vdev->config->set(vb->vdev,
+			      offsetof(struct virtio_balloon_config,
+				       pages_free),
+			      &value, sizeof(value));
+	value = 0;
+	vb->vdev->config->set(vb->vdev,
+			      offsetof(struct virtio_balloon_config,
+				       need_stats),
+			      &value, sizeof(value));
+}
+
+static int balloon(void *_vballoon)
+{
+	struct virtio_balloon *vb = _vballoon;
+
+	set_freezable();
+	while (!kthread_should_stop()) {
+		s64 diff;
+		try_to_freeze();
+		wait_event_interruptible(vb->config_change,
+					 (diff = towards_target(vb)) > 0
+					 || config_need_stats(vb)
+					 || kthread_should_stop()
+					 || freezing(current));
+		if (config_need_stats(vb))
+			update_free_and_total(vb);
+		if (diff > 0) {
+			unsigned long reclaim_time = vb->last_reclaim + 2 * HZ;
+			/*
+			 * Don't fill the balloon if a page reclaim happened in
+			 * the past 2 seconds.
+			 */
+			if (time_after_eq(reclaim_time, jiffies)) {
+				/* Inflating too fast--sleep and skip. */
+				msleep(500);
+			} else {
+				fill_balloon(vb, diff);
+			}
+		} else if (diff < 0 && config_pages(vb) == 0) {
+			/*
+			 * Here we are specifically looking to detect the case
+			 * where there are pages in the page cache, but the
+			 * device wants us to go to 0.  This is used in save/
+			 * restore since the host device doesn't keep track of
+			 * PFNs, and must flush the page cache on restore
+			 * (which loses the context of the original device
+			 * instance).  However, we still suggest syncing the
+			 * diff so that we can get within the target range.
+			 */
+			s64 nr_to_write =
+				(!config_pages(vb) ? LONG_MAX : -diff);
+			struct writeback_control wbc = {
+				.sync_mode = WB_SYNC_ALL,
+				.nr_to_write = nr_to_write,
+				.range_start = 0,
+				.range_end = LLONG_MAX,
+			};
+			sync_inode(&the_inode.inode, &wbc);
+		}
+		update_balloon_size(vb);
+	}
+	return 0;
+}
+
+static ssize_t virtballoon_attr_show(struct device *dev,
+				     struct device_attribute *attr,
+				     char *buf);
+
+static DEVICE_ATTR(total_memory, 0644,
+	virtballoon_attr_show, NULL);
+
+static DEVICE_ATTR(free_memory, 0644,
+	virtballoon_attr_show, NULL);
+
+static DEVICE_ATTR(target_pages, 0644,
+	virtballoon_attr_show, NULL);
+
+static DEVICE_ATTR(actual_pages, 0644,
+	virtballoon_attr_show, NULL);
+
+static struct attribute *virtballoon_attrs[] = {
+	&dev_attr_total_memory.attr,
+	&dev_attr_free_memory.attr,
+	&dev_attr_target_pages.attr,
+	&dev_attr_actual_pages.attr,
+	NULL
+};
+static struct attribute_group virtballoon_attr_group = {
+	.name	= "virtballoon",
+	.attrs	= virtballoon_attrs,
+};
+
+static ssize_t virtballoon_attr_show(struct device *dev,
+				     struct device_attribute *attr,
+				     char *buf)
+{
+	struct address_space *mapping = the_inode.inode.i_mapping;
+	struct virtio_device *vdev = container_of(dev, struct virtio_device,
+						  dev);
+	struct virtio_balloon *vb = vdev->priv;
+	unsigned long long value = 0;
+	if (attr == &dev_attr_total_memory)
+		value = vb->stats[VIRTIO_BALLOON_S_MEMTOT].val;
+	else if (attr == &dev_attr_free_memory)
+		value = vb->stats[VIRTIO_BALLOON_S_MEMFREE].val;
+	else if (attr == &dev_attr_target_pages)
+		value = config_pages(vb);
+	else if (attr == &dev_attr_actual_pages)
+		value = cpu_to_le32((mapping ? mapping->nrpages : 0));
+	return sprintf(buf, "%llu\n", value);
+}
+
+static int virtballoon_probe(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb;
+	struct virtqueue *vq[1];
+	vq_callback_t *callback = balloon_ack;
+	const char *name = "inflate";
+	int err;
+
+	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
+	if (!vb) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	init_waitqueue_head(&vb->config_change);
+	vb->vdev = vdev;
+
+	/* We use one virtqueue: inflate */
+	err = vdev->config->find_vqs(vdev, 1, vq, &callback, &name);
+	if (err)
+		goto out_free_vb;
+
+	vb->inflate_vq = vq[0];
+
+	err = sysfs_create_group(&vdev->dev.kobj, &virtballoon_attr_group);
+	if (err) {
+		pr_err("Failed to create virtballoon sysfs node\n");
+		goto out_free_vb;
+	}
+
+	vb->last_scan_page_array = 0;
+	vb->last_reclaim = 0;
+	the_inode.vb = vb;
+
+	vb->thread = kthread_run(balloon, vb, "vballoon");
+	if (IS_ERR(vb->thread)) {
+		err = PTR_ERR(vb->thread);
+		goto out_del_vqs;
+	}
+
+	return 0;
+
+out_del_vqs:
+	vdev->config->del_vqs(vdev);
+out_free_vb:
+	kfree(vb);
+out:
+	return err;
+}
+
+static void __devexit virtballoon_remove(struct virtio_device *vdev)
+{
+	struct virtio_balloon *vb = vdev->priv;
+
+	kthread_stop(vb->thread);
+
+	sysfs_remove_group(&vdev->dev.kobj, &virtballoon_attr_group);
+
+	/* Now we reset the device so we can clean up the queues. */
+	vdev->config->reset(vdev);
+
+	vdev->config->del_vqs(vdev);
+	kfree(vb);
+}
+
+static struct virtio_driver virtio_balloon_driver = {
+	.feature_table		= NULL,
+	.feature_table_size	= 0,
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtballoon_probe,
+	.remove			= __devexit_p(virtballoon_remove),
+	.config_changed		= virtballoon_changed,
+};
+
+static int __init init(void)
+{
+	int err = register_filesystem(&balloon_fs_type);
+	if (err)
+		goto out;
+
+	balloon_mnt = kern_mount(&balloon_fs_type);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		goto out_filesystem;
+	}
+
+	err = register_virtio_driver(&virtio_balloon_driver);
+	if (err)
+		goto out_filesystem;
+
+	goto out;
+
+out_filesystem:
+	unregister_filesystem(&balloon_fs_type);
+
+out:
+	return err;
+}
+
+static void __exit fini(void)
+{
+	if (balloon_mnt) {
+		unregister_filesystem(&balloon_fs_type);
+		balloon_mnt = NULL;
+	}
+	unregister_virtio_driver(&virtio_balloon_driver);
+}
+module_init(init);
+module_exit(fini);
+
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio file (page cache-backed) balloon driver");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/virtio_balloon.h b/include/linux/virtio_balloon.h
index 652dc8b..2be9a02 100644
--- a/include/linux/virtio_balloon.h
+++ b/include/linux/virtio_balloon.h
@@ -41,6 +41,15 @@ struct virtio_balloon_config
 	__le32 num_pages;
 	/* Number of pages we've actually got in balloon. */
 	__le32 actual;
+#if defined(CONFIG_VIRTIO_FILEBALLOON) ||\
+	defined(CONFIG_VIRTIO_FILEBALLOON_MODULE)
+	/* Total pages on this system. */
+	__le32 pages_total;
+	/* Free pages on this system. */
+	__le32 pages_free;
+	/* If the device needs pages_total/pages_free updated. */
+	__le32 need_stats;
+#endif
 };
 
 #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
index 7529b85..2f081d7 100644
--- a/include/linux/virtio_ids.h
+++ b/include/linux/virtio_ids.h
@@ -37,5 +37,6 @@
 #define VIRTIO_ID_RPMSG		7 /* virtio remote processor messaging */
 #define VIRTIO_ID_SCSI		8 /* virtio scsi */
 #define VIRTIO_ID_9P		9 /* 9p virtio console */
+#define VIRTIO_ID_FILE_BALLOON	10 /* virtio file-backed balloon */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
@ 2012-06-26 20:40 ` Rik van Riel
  2012-06-26 21:31   ` Frank Swiderski
  2012-06-26 21:41 ` Michael S. Tsirkin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2012-06-26 20:40 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, Michael S. Tsirkin, Andrea Arcangeli,
	virtualization, linux-kernel, kvm, mikew, Ying Han,
	Rafael Aquini

On 06/26/2012 04:32 PM, Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.
>
> Signed-off-by: Frank Swiderski<fes@google.com>

It is a great idea, but how can this memory balancing
possibly work if someone uses memory cgroups inside a
guest?

Having said that, we currently do not have proper
memory reclaim balancing between cgroups at all, so
requiring that of this balloon driver would be
unreasonable.

The code looks good to me, my only worry is the
code duplication. We now have 5 balloon drivers,
for 4 hypervisors, all implementing everything
from scratch...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:40 ` Rik van Riel
@ 2012-06-26 21:31   ` Frank Swiderski
  2012-06-26 21:45     ` Rik van Riel
  2012-06-26 21:47     ` Michael S. Tsirkin
  0 siblings, 2 replies; 30+ messages in thread
From: Frank Swiderski @ 2012-06-26 21:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rusty Russell, Michael S. Tsirkin, Andrea Arcangeli,
	virtualization, linux-kernel, kvm, mikew, Ying Han,
	Rafael Aquini

On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel <riel@redhat.com> wrote:
> On 06/26/2012 04:32 PM, Frank Swiderski wrote:
>>
>> This implementation of a virtio balloon driver uses the page cache to
>> "store" pages that have been released to the host.  The communication
>> (outside of target counts) is one way--the guest notifies the host when
>> it adds a page to the page cache, allowing the host to madvise(2) with
>> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> (via the regular page reclaim).  This means that inflating the balloon
>> is similar to the existing balloon mechanism, but the deflate is
>> different--it re-uses existing Linux kernel functionality to
>> automatically reclaim.
>>
>> Signed-off-by: Frank Swiderski<fes@google.com>
>
>
> It is a great idea, but how can this memory balancing
> possibly work if someone uses memory cgroups inside a
> guest?

Thanks and good point--this isn't something that I considered in the
implementation.

> Having said that, we currently do not have proper
> memory reclaim balancing between cgroups at all, so
> requiring that of this balloon driver would be
> unreasonable.
>
> The code looks good to me, my only worry is the
> code duplication. We now have 5 balloon drivers,
> for 4 hypervisors, all implementing everything
> from scratch...

Do you have any recommendations on this?  I could (I think reasonably
so) modify the existing virtio_balloon.c and have it change behavior
based on a feature bit or other configuration.  I'm not sure that
really addresses the root of what you're pointing out--it's still
adding a different implementation, but doing so as an extension of an
existing one.

fes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
  2012-06-26 20:40 ` Rik van Riel
@ 2012-06-26 21:41 ` Michael S. Tsirkin
  2012-06-27  2:56   ` Rusty Russell
  2012-06-27  9:40 ` Amit Shah
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-06-26 21:41 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew

On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.
> 
> Signed-off-by: Frank Swiderski <fes@google.com>

I'm pondering this:

Should it really be a separate driver/device ID?
If it behaves the same from host POV, maybe it
should be up to the guest how to inflate/deflate
the balloon internally?

> ---
>  drivers/virtio/Kconfig              |   13 +
>  drivers/virtio/Makefile             |    1 +
>  drivers/virtio/virtio_fileballoon.c |  636 +++++++++++++++++++++++++++++++++++
>  include/linux/virtio_balloon.h      |    9 +
>  include/linux/virtio_ids.h          |    1 +
>  5 files changed, 660 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/virtio/virtio_fileballoon.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index f38b17a..cffa2a7 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>  
>  	 If unsure, say M.
>  
> +config VIRTIO_FILEBALLOON
> +	tristate "Virtio page cache-backed balloon driver"
> +	select VIRTIO
> +	select VIRTIO_RING
> +	---help---
> +	 This driver supports decreasing and automatically reclaiming the
> +	 memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
> +	 tries to maintain a specific target balloon size using the page cache.
> +	 This allows the guest to implicitly deflate the balloon by flushing
> +	 pages from the cache and touching the page.
> +
> +	 If unsure, say N.
> +
>   config VIRTIO_MMIO
>   	tristate "Platform bus driver for memory mapped virtio devices (EXPERIMENTAL)"
>   	depends on HAS_IOMEM && EXPERIMENTAL
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 5a4c63c..7ca0a3f 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
> diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c
> new file mode 100644
> index 0000000..ff252ec
> --- /dev/null
> +++ b/drivers/virtio/virtio_fileballoon.c
> @@ -0,0 +1,636 @@
> +/* Virtio file (page cache-backed) balloon implementation, inspired by
> + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's
> + * implementation.
> + *
> + * This implementation of the virtio balloon driver re-uses the page cache to
> + * allow memory consumed by inflating the balloon to be reclaimed by linux.  It
> + * creates and mounts a bare-bones filesystem containing a single inode.  When
> + * the host requests the balloon to inflate, it does so by "reading" pages at
> + * offsets into the inode mapping's page_tree.  The host is notified when the
> + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
> + * corresponding host memory, reducing the RSS of the virtual machine.  In this
> + * implementation, the host is only notified when a page is added to the
> + * balloon.  Reclaim happens under the existing TTFP logic, which flushes unused
> + * pages in the page cache.  If the host used MADV_DONTNEED, then when the guest
> + * uses the page, the zero page will be mapped in, allowing automatic (and fast,
> + * compared to requiring a host notification via a virtio queue to get memory
> + * back) reclaim.
> + *
> + *  Copyright 2008 Rusty Russell IBM Corporation
> + *  Copyright 2011 Frank Swiderski Google Inc
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +#include <linux/backing-dev.h>
> +#include <linux/delay.h>
> +#include <linux/file.h>
> +#include <linux/freezer.h>
> +#include <linux/fs.h>
> +#include <linux/jiffies.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/mount.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +#include <linux/swap.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_balloon.h>
> +#include <linux/writeback.h>
> +
> +#define VIRTBALLOON_PFN_ARRAY_SIZE 256
> +
> +struct virtio_balloon {
> +	struct virtio_device *vdev;
> +	struct virtqueue *inflate_vq;
> +
> +	/* Where the ballooning thread waits for config to change. */
> +	wait_queue_head_t config_change;
> +
> +	/* The thread servicing the balloon. */
> +	struct task_struct *thread;
> +
> +	/* Waiting for host to ack the pages we released. */
> +	struct completion acked;
> +
> +	/* The array of pfns we tell the Host about. */
> +	unsigned int num_pfns;
> +	u32 pfns[VIRTBALLOON_PFN_ARRAY_SIZE];
> +
> +	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> +
> +	/* The last page offset read into the mapping's page_tree */
> +	unsigned long last_scan_page_array;
> +
> +	/* The last time a page was reclaimed */
> +	unsigned long last_reclaim;
> +};
> +
> +/* Magic number used for the skeleton filesystem in the call to mount_pseudo */
> +#define BALLOONFS_MAGIC 0x42414c4c
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_FILE_BALLOON, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +/*
> + * The skeleton filesystem contains a single inode, held by the structure below.
> + * Using the containing structure below allows easy access to the struct
> + * virtio_balloon.
> + */
> +static struct balloon_inode {
> +	struct inode inode;
> +	struct virtio_balloon *vb;
> +} the_inode;
> +
> +/*
> + * balloon_alloc_inode is called when the single inode for the skeleton
> + * filesystem is created in init() with the call to new_inode.
> + */
> +static struct inode *balloon_alloc_inode(struct super_block *sb)
> +{
> +	static bool already_inited;
> +	/* We should only ever be called once! */
> +	BUG_ON(already_inited);
> +	already_inited = true;
> +	inode_init_once(&the_inode.inode);
> +	return &the_inode.inode;
> +}
> +
> +/* Noop implementation of destroy_inode.  */
> +static void balloon_destroy_inode(struct inode *inode)
> +{
> +}
> +
> +static int balloon_sync_fs(struct super_block *sb, int wait)
> +{
> +	return filemap_write_and_wait(the_inode.inode.i_mapping);
> +}
> +
> +static const struct super_operations balloonfs_ops = {
> +	.alloc_inode	= balloon_alloc_inode,
> +	.destroy_inode	= balloon_destroy_inode,
> +	.sync_fs	= balloon_sync_fs,
> +};
> +
> +static const struct dentry_operations balloonfs_dentry_operations = {
> +};
> +
> +/*
> + * balloonfs_writepage is called when linux needs to reclaim memory held using
> + * the balloonfs' page cache.
> + */
> +static int balloonfs_writepage(struct page *page, struct writeback_control *wbc)
> +{
> +	the_inode.vb->last_reclaim = jiffies;
> +	SetPageUptodate(page);
> +	ClearPageDirty(page);
> +	/*
> +	 * If the page isn't being flushed from the page allocator, go ahead and
> +	 * drop it from the page cache anyway.
> +	 */
> +	if (!wbc->for_reclaim)
> +		delete_from_page_cache(page);
> +	unlock_page(page);
> +	return 0;
> +}
> +
> +/* Nearly no-op implementation of readpage */
> +static int balloonfs_readpage(struct file *file, struct page *page)
> +{
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +	return 0;
> +}
> +
> +static const struct address_space_operations balloonfs_aops = {
> +	.writepage	= balloonfs_writepage,
> +	.readpage	= balloonfs_readpage
> +};
> +
> +static struct backing_dev_info balloonfs_backing_dev_info = {
> +	.name		= "balloonfs",
> +	.ra_pages	= 0,
> +	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK
> +};
> +
> +static struct dentry *balloonfs_mount(struct file_system_type *fs_type,
> +			 int flags, const char *dev_name, void *data)
> +{
> +	struct dentry *root;
> +	struct inode *inode;
> +	root = mount_pseudo(fs_type, "balloon:", &balloonfs_ops,
> +			    &balloonfs_dentry_operations, BALLOONFS_MAGIC);
> +	inode = root->d_inode;
> +	inode->i_mapping->a_ops = &balloonfs_aops;
> +	mapping_set_gfp_mask(inode->i_mapping,
> +			     (GFP_HIGHUSER | __GFP_NOMEMALLOC));
> +	inode->i_mapping->backing_dev_info = &balloonfs_backing_dev_info;
> +	return root;
> +}
> +
> +/* The single mounted skeleton filesystem */
> +static struct vfsmount *balloon_mnt __read_mostly;
> +
> +static struct file_system_type balloon_fs_type = {
> +	.name =		"balloonfs",
> +	.mount =	balloonfs_mount,
> +	.kill_sb =	kill_anon_super,
> +};
> +
> +/* Acknowledges a message from the specified virtqueue. */
> +static void balloon_ack(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb;
> +	unsigned int len;
> +
> +	vb = virtqueue_get_buf(vq, &len);
> +	if (vb)
> +		complete(&vb->acked);
> +}
> +
> +/*
> + * Scans the page_tree for the inode's mapping, looking for an offset that is
> + * currently empty, returning that index (or 0 if it could not fill the
> + * request).
> + */
> +static unsigned long find_available_inode_page(struct virtio_balloon *vb)
> +{
> +	unsigned long radix_index, index, max_scan;
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +
> +	/*
> +	 * This function is a serialized call (only happens on the free-to-host
> +	 * thread), so no locking is necessary here.
> +	 */
> +	index = vb->last_scan_page_array;
> +	max_scan = totalram_pages - vb->last_scan_page_array;
> +
> +	/*
> +	 * Scan starting at the last scanned offset, then wrap around if
> +	 * necessary.
> +	 */
> +	if (index == 0)
> +		index = 1;
> +	rcu_read_lock();
> +	radix_index = radix_tree_next_hole(&mapping->page_tree,
> +					   index, max_scan);
> +	rcu_read_unlock();
> +	/*
> +	 * If we hit the end of the tree, wrap and search up to the original
> +	 * index.
> +	 */
> +	if (radix_index - index >= max_scan) {
> +		if (index != 1) {
> +			rcu_read_lock();
> +			radix_index = radix_tree_next_hole(&mapping->page_tree,
> +							   1, index);
> +			rcu_read_unlock();
> +			if (radix_index - 1 >= index)
> +				radix_index = 0;
> +		} else {
> +			radix_index = 0;
> +		}
> +	}
> +	vb->last_scan_page_array = radix_index;
> +
> +	return radix_index;
> +}
> +
> +/* Notifies the host of pages in the specified virtqueue. */
> +static int tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	int err;
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +
> +	init_completion(&vb->acked);
> +
> +	/* We should always be able to add one buffer to an empty queue. */
> +	err = virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL);
> +	if (err  < 0)
> +		return err;
> +	virtqueue_kick(vq);
> +
> +	/* When host has read buffer, this completes via balloon_ack */
> +	wait_for_completion(&vb->acked);
> +	return err;
> +}
> +
> +static void fill_balloon(struct virtio_balloon *vb, size_t num)
> +{
> +	int err;
> +
> +	/* We can only do one array worth at a time. */
> +	num = min(num, ARRAY_SIZE(vb->pfns));
> +
> +	for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) {
> +		struct page *page;
> +		unsigned long inode_pfn = find_available_inode_page(vb);
> +		/* Should always be able to find a page. */
> +		BUG_ON(!inode_pfn);
> +		page = read_mapping_page(the_inode.inode.i_mapping, inode_pfn,
> +					 NULL);
> +		if (IS_ERR(page)) {
> +			if (printk_ratelimit())
> +				dev_printk(KERN_INFO, &vb->vdev->dev,
> +					   "Out of puff! Can't get %zu pages\n",
> +					   num);
> +			break;
> +		}
> +
> +		/* Set the page to be dirty */
> +		set_page_dirty(page);
> +
> +		vb->pfns[vb->num_pfns] = page_to_pfn(page);
> +	}
> +
> +	/* Didn't get any?  Oh well. */
> +	if (vb->num_pfns == 0)
> +		return;
> +
> +	/* Notify the host of the pages we just added to the page_tree. */
> +	err = tell_host(vb, vb->inflate_vq);
> +
> +	for (; vb->num_pfns != 0; vb->num_pfns--) {
> +		struct page *page = pfn_to_page(vb->pfns[vb->num_pfns - 1]);
> +		/*
> +		 * Release our refcount on the page so that it can be reclaimed
> +		 * when necessary.
> +		 */
> +		page_cache_release(page);
> +	}
> +	__mark_inode_dirty(&the_inode.inode, I_DIRTY_PAGES);
> +}
> +
> +static inline void update_stat(struct virtio_balloon *vb, int idx,
> +			       u64 val)
> +{
> +	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
> +	vb->stats[idx].tag = idx;
> +	vb->stats[idx].val = val;
> +}
> +
> +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
> +
> +static inline u32 config_pages(struct virtio_balloon *vb);
> +static void update_balloon_stats(struct virtio_balloon *vb)
> +{
> +	unsigned long events[NR_VM_EVENT_ITEMS];
> +	struct sysinfo i;
> +
> +	all_vm_events(events);
> +	si_meminfo(&i);
> +
> +	update_stat(vb, VIRTIO_BALLOON_S_SWAP_IN,
> +		    pages_to_bytes(events[PSWPIN]));
> +	update_stat(vb, VIRTIO_BALLOON_S_SWAP_OUT,
> +		    pages_to_bytes(events[PSWPOUT]));
> +	update_stat(vb, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
> +	update_stat(vb, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +
> +	/* Total and Free Mem */
> +	update_stat(vb, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram));
> +	update_stat(vb, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram));
> +}
> +
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +
> +	wake_up(&vb->config_change);
> +}
> +
> +static inline bool config_need_stats(struct virtio_balloon *vb)
> +{
> +	u32 v = 0;
> +
> +	vb->vdev->config->get(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       need_stats),
> +			      &v, sizeof(v));
> +	return (v != 0);
> +}
> +
> +static inline u32 config_pages(struct virtio_balloon *vb)
> +{
> +	u32 v = 0;
> +
> +	vb->vdev->config->get(vb->vdev,
> +			      offsetof(struct virtio_balloon_config, num_pages),
> +			      &v, sizeof(v));
> +	return v;
> +}
> +
> +static inline s64 towards_target(struct virtio_balloon *vb)
> +{
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +	u32 v = config_pages(vb);
> +
> +	return (s64)v - (mapping ? mapping->nrpages : 0);
> +}
> +
> +static void update_balloon_size(struct virtio_balloon *vb)
> +{
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +	__le32 actual = cpu_to_le32((mapping ? mapping->nrpages : 0));
> +
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config, actual),
> +			      &actual, sizeof(actual));
> +}
> +
> +static void update_free_and_total(struct virtio_balloon *vb)
> +{
> +	struct sysinfo i;
> +	u32 value;
> +
> +	si_meminfo(&i);
> +
> +	update_balloon_stats(vb);
> +	value = i.totalram;
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       pages_total),
> +			      &value, sizeof(value));
> +	value = i.freeram;
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       pages_free),
> +			      &value, sizeof(value));
> +	value = 0;
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       need_stats),
> +			      &value, sizeof(value));
> +}
> +
> +static int balloon(void *_vballoon)
> +{
> +	struct virtio_balloon *vb = _vballoon;
> +
> +	set_freezable();
> +	while (!kthread_should_stop()) {
> +		s64 diff;
> +		try_to_freeze();
> +		wait_event_interruptible(vb->config_change,
> +					 (diff = towards_target(vb)) > 0
> +					 || config_need_stats(vb)
> +					 || kthread_should_stop()
> +					 || freezing(current));
> +		if (config_need_stats(vb))
> +			update_free_and_total(vb);
> +		if (diff > 0) {
> +			unsigned long reclaim_time = vb->last_reclaim + 2 * HZ;
> +			/*
> +			 * Don't fill the balloon if a page reclaim happened in
> +			 * the past 2 seconds.
> +			 */
> +			if (time_after_eq(reclaim_time, jiffies)) {
> +				/* Inflating too fast--sleep and skip. */
> +				msleep(500);
> +			} else {
> +				fill_balloon(vb, diff);
> +			}
> +		} else if (diff < 0 && config_pages(vb) == 0) {
> +			/*
> +			 * Here we are specifically looking to detect the case
> +			 * where there are pages in the page cache, but the
> +			 * device wants us to go to 0.  This is used in save/
> +			 * restore since the host device doesn't keep track of
> +			 * PFNs, and must flush the page cache on restore
> +			 * (which loses the context of the original device
> +			 * instance).  However, we still suggest syncing the
> +			 * diff so that we can get within the target range.
> +			 */
> +			s64 nr_to_write =
> +				(!config_pages(vb) ? LONG_MAX : -diff);
> +			struct writeback_control wbc = {
> +				.sync_mode = WB_SYNC_ALL,
> +				.nr_to_write = nr_to_write,
> +				.range_start = 0,
> +				.range_end = LLONG_MAX,
> +			};
> +			sync_inode(&the_inode.inode, &wbc);
> +		}
> +		update_balloon_size(vb);
> +	}
> +	return 0;
> +}
> +
> +static ssize_t virtballoon_attr_show(struct device *dev,
> +				     struct device_attribute *attr,
> +				     char *buf);
> +
> +static DEVICE_ATTR(total_memory, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static DEVICE_ATTR(free_memory, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static DEVICE_ATTR(target_pages, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static DEVICE_ATTR(actual_pages, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static struct attribute *virtballoon_attrs[] = {
> +	&dev_attr_total_memory.attr,
> +	&dev_attr_free_memory.attr,
> +	&dev_attr_target_pages.attr,
> +	&dev_attr_actual_pages.attr,
> +	NULL
> +};
> +static struct attribute_group virtballoon_attr_group = {
> +	.name	= "virtballoon",
> +	.attrs	= virtballoon_attrs,
> +};
> +
> +static ssize_t virtballoon_attr_show(struct device *dev,
> +				     struct device_attribute *attr,
> +				     char *buf)
> +{
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +	struct virtio_device *vdev = container_of(dev, struct virtio_device,
> +						  dev);
> +	struct virtio_balloon *vb = vdev->priv;
> +	unsigned long long value = 0;
> +	if (attr == &dev_attr_total_memory)
> +		value = vb->stats[VIRTIO_BALLOON_S_MEMTOT].val;
> +	else if (attr == &dev_attr_free_memory)
> +		value = vb->stats[VIRTIO_BALLOON_S_MEMFREE].val;
> +	else if (attr == &dev_attr_target_pages)
> +		value = config_pages(vb);
> +	else if (attr == &dev_attr_actual_pages)
> +		value = cpu_to_le32((mapping ? mapping->nrpages : 0));
> +	return sprintf(buf, "%llu\n", value);
> +}
> +
> +static int virtballoon_probe(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb;
> +	struct virtqueue *vq[1];
> +	vq_callback_t *callback = balloon_ack;
> +	const char *name = "inflate";
> +	int err;
> +
> +	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> +	if (!vb) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	init_waitqueue_head(&vb->config_change);
> +	vb->vdev = vdev;
> +
> +	/* We use one virtqueue: inflate */
> +	err = vdev->config->find_vqs(vdev, 1, vq, &callback, &name);
> +	if (err)
> +		goto out_free_vb;
> +
> +	vb->inflate_vq = vq[0];
> +
> +	err = sysfs_create_group(&vdev->dev.kobj, &virtballoon_attr_group);
> +	if (err) {
> +		pr_err("Failed to create virtballoon sysfs node\n");
> +		goto out_free_vb;
> +	}
> +
> +	vb->last_scan_page_array = 0;
> +	vb->last_reclaim = 0;
> +	the_inode.vb = vb;
> +
> +	vb->thread = kthread_run(balloon, vb, "vballoon");
> +	if (IS_ERR(vb->thread)) {
> +		err = PTR_ERR(vb->thread);
> +		goto out_del_vqs;
> +	}
> +
> +	return 0;
> +
> +out_del_vqs:
> +	vdev->config->del_vqs(vdev);
> +out_free_vb:
> +	kfree(vb);
> +out:
> +	return err;
> +}
> +
> +static void __devexit virtballoon_remove(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +
> +	kthread_stop(vb->thread);
> +
> +	sysfs_remove_group(&vdev->dev.kobj, &virtballoon_attr_group);
> +
> +	/* Now we reset the device so we can clean up the queues. */
> +	vdev->config->reset(vdev);
> +
> +	vdev->config->del_vqs(vdev);
> +	kfree(vb);
> +}
> +
> +static struct virtio_driver virtio_balloon_driver = {
> +	.feature_table		= NULL,
> +	.feature_table_size	= 0,
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.probe			= virtballoon_probe,
> +	.remove			= __devexit_p(virtballoon_remove),
> +	.config_changed		= virtballoon_changed,
> +};
> +
> +static int __init init(void)
> +{
> +	int err = register_filesystem(&balloon_fs_type);
> +	if (err)
> +		goto out;
> +
> +	balloon_mnt = kern_mount(&balloon_fs_type);
> +	if (IS_ERR(balloon_mnt)) {
> +		err = PTR_ERR(balloon_mnt);
> +		goto out_filesystem;
> +	}
> +
> +	err = register_virtio_driver(&virtio_balloon_driver);
> +	if (err)
> +		goto out_filesystem;
> +
> +	goto out;
> +
> +out_filesystem:
> +	unregister_filesystem(&balloon_fs_type);
> +
> +out:
> +	return err;
> +}
> +
> +static void __exit fini(void)
> +{
> +	if (balloon_mnt) {
> +		unregister_filesystem(&balloon_fs_type);
> +		balloon_mnt = NULL;
> +	}
> +	unregister_virtio_driver(&virtio_balloon_driver);
> +}
> +module_init(init);
> +module_exit(fini);
> +
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio file (page cache-backed) balloon driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/virtio_balloon.h b/include/linux/virtio_balloon.h
> index 652dc8b..2be9a02 100644
> --- a/include/linux/virtio_balloon.h
> +++ b/include/linux/virtio_balloon.h
> @@ -41,6 +41,15 @@ struct virtio_balloon_config
>  	__le32 num_pages;
>  	/* Number of pages we've actually got in balloon. */
>  	__le32 actual;
> +#if defined(CONFIG_VIRTIO_FILEBALLOON) ||\
> +	defined(CONFIG_VIRTIO_FILEBALLOON_MODULE)
> +	/* Total pages on this system. */
> +	__le32 pages_total;
> +	/* Free pages on this system. */
> +	__le32 pages_free;
> +	/* If the device needs pages_total/pages_free updated. */
> +	__le32 need_stats;
> +#endif
>  };
>  
>  #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
> diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
> index 7529b85..2f081d7 100644
> --- a/include/linux/virtio_ids.h
> +++ b/include/linux/virtio_ids.h
> @@ -37,5 +37,6 @@
>  #define VIRTIO_ID_RPMSG		7 /* virtio remote processor messaging */
>  #define VIRTIO_ID_SCSI		8 /* virtio scsi */
>  #define VIRTIO_ID_9P		9 /* 9p virtio console */
> +#define VIRTIO_ID_FILE_BALLOON	10 /* virtio file-backed balloon */
>  
>  #endif /* _LINUX_VIRTIO_IDS_H */
> -- 
> 1.7.7.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 21:31   ` Frank Swiderski
@ 2012-06-26 21:45     ` Rik van Riel
  2012-06-26 23:45       ` Frank Swiderski
  2012-06-26 21:47     ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2012-06-26 21:45 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, Michael S. Tsirkin, Andrea Arcangeli,
	virtualization, linux-kernel, kvm, mikew, Ying Han,
	Rafael Aquini

On 06/26/2012 05:31 PM, Frank Swiderski wrote:
> On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel<riel@redhat.com>  wrote:

>> The code looks good to me, my only worry is the
>> code duplication. We now have 5 balloon drivers,
>> for 4 hypervisors, all implementing everything
>> from scratch...
>
> Do you have any recommendations on this?  I could (I think reasonably
> so) modify the existing virtio_balloon.c and have it change behavior
> based on a feature bit or other configuration.  I'm not sure that
> really addresses the root of what you're pointing out--it's still
> adding a different implementation, but doing so as an extension of an
> existing one.

Ideally, I believe we would have two balloon
top parts in a guest (one classical balloon,
one on the LRU), and four bottom parts (kvm,
xen, vmware & s390).

That way the virt specific bits of a balloon
driver would be essentially a ->balloon_page
and ->release_page callback for pages, as well
as methods to communicate with the host.

All the management of pages, including stuff
like putting them on the LRU, or isolating
them for migration, would be done with the
same common code, regardless of what virt
software we are running on.

Of course, that is a substantial amount of
work and I feel it would be unreasonable to
block anyone's code on that kind of thing
(especially considering that your code is good),
but I do believe the explosion of balloon
code is a little worrying.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 21:31   ` Frank Swiderski
  2012-06-26 21:45     ` Rik van Riel
@ 2012-06-26 21:47     ` Michael S. Tsirkin
  2012-06-26 23:21       ` Frank Swiderski
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-06-26 21:47 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rik van Riel, Rusty Russell, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew, Ying Han, Rafael Aquini

On Tue, Jun 26, 2012 at 02:31:26PM -0700, Frank Swiderski wrote:
> On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel <riel@redhat.com> wrote:
> > On 06/26/2012 04:32 PM, Frank Swiderski wrote:
> >>
> >> This implementation of a virtio balloon driver uses the page cache to
> >> "store" pages that have been released to the host.  The communication
> >> (outside of target counts) is one way--the guest notifies the host when
> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> (via the regular page reclaim).  This means that inflating the balloon
> >> is similar to the existing balloon mechanism, but the deflate is
> >> different--it re-uses existing Linux kernel functionality to
> >> automatically reclaim.
> >>
> >> Signed-off-by: Frank Swiderski<fes@google.com>
> >
> >
> > It is a great idea, but how can this memory balancing
> > possibly work if someone uses memory cgroups inside a
> > guest?
> 
> Thanks and good point--this isn't something that I considered in the
> implementation.
> 
> > Having said that, we currently do not have proper
> > memory reclaim balancing between cgroups at all, so
> > requiring that of this balloon driver would be
> > unreasonable.
> >
> > The code looks good to me, my only worry is the
> > code duplication. We now have 5 balloon drivers,
> > for 4 hypervisors, all implementing everything
> > from scratch...
> 
> Do you have any recommendations on this?  I could (I think reasonably
> so) modify the existing virtio_balloon.c and have it change behavior
> based on a feature bit or other configuration.  I'm not sure that
> really addresses the root of what you're pointing out--it's still
> adding a different implementation, but doing so as an extension of an
> existing one.
> 
> fes

Let's assume it's a feature bit: how would you
formulate what the feature does *from host point of view*?

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 21:47     ` Michael S. Tsirkin
@ 2012-06-26 23:21       ` Frank Swiderski
  2012-06-27  9:02         ` Michael S. Tsirkin
  2012-07-02  0:29         ` Rusty Russell
  0 siblings, 2 replies; 30+ messages in thread
From: Frank Swiderski @ 2012-06-26 23:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rik van Riel, Rusty Russell, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew, Ying Han, Rafael Aquini

On Tue, Jun 26, 2012 at 2:47 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 26, 2012 at 02:31:26PM -0700, Frank Swiderski wrote:
>> On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel <riel@redhat.com> wrote:
>> > On 06/26/2012 04:32 PM, Frank Swiderski wrote:
>> >>
>> >> This implementation of a virtio balloon driver uses the page cache to
>> >> "store" pages that have been released to the host.  The communication
>> >> (outside of target counts) is one way--the guest notifies the host when
>> >> it adds a page to the page cache, allowing the host to madvise(2) with
>> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> >> (via the regular page reclaim).  This means that inflating the balloon
>> >> is similar to the existing balloon mechanism, but the deflate is
>> >> different--it re-uses existing Linux kernel functionality to
>> >> automatically reclaim.
>> >>
>> >> Signed-off-by: Frank Swiderski<fes@google.com>
>> >
>> >
>> > It is a great idea, but how can this memory balancing
>> > possibly work if someone uses memory cgroups inside a
>> > guest?
>>
>> Thanks and good point--this isn't something that I considered in the
>> implementation.
>>
>> > Having said that, we currently do not have proper
>> > memory reclaim balancing between cgroups at all, so
>> > requiring that of this balloon driver would be
>> > unreasonable.
>> >
>> > The code looks good to me, my only worry is the
>> > code duplication. We now have 5 balloon drivers,
>> > for 4 hypervisors, all implementing everything
>> > from scratch...
>>
>> Do you have any recommendations on this?  I could (I think reasonably
>> so) modify the existing virtio_balloon.c and have it change behavior
>> based on a feature bit or other configuration.  I'm not sure that
>> really addresses the root of what you're pointing out--it's still
>> adding a different implementation, but doing so as an extension of an
>> existing one.
>>
>> fes
>
> Let's assume it's a feature bit: how would you
> formulate what the feature does *from host point of view*?
>
> --
> MST

In this implementation, the host doesn't keep track of pages in the
balloon, as there is no explicit deflate path.  The host device for
this implementation should merely, for example, MADV_DONTNEED on the
pages sent in an inflate.  Thus, the inflate becomes a notification
that the guest doesn't need those pages mapped in, but that they
should be available if the guest touches them.  In that sense, it's
not a rigid shrink of guest memory.  I'm not sure what I'd call the
feature bit though.

Was that the question you were asking, or did I misread?

fes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 21:45     ` Rik van Riel
@ 2012-06-26 23:45       ` Frank Swiderski
  2012-06-27  9:04         ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Frank Swiderski @ 2012-06-26 23:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rusty Russell, Michael S. Tsirkin, Andrea Arcangeli,
	virtualization, linux-kernel, kvm, mikew, Ying Han,
	Rafael Aquini

On Tue, Jun 26, 2012 at 2:45 PM, Rik van Riel <riel@redhat.com> wrote:
> On 06/26/2012 05:31 PM, Frank Swiderski wrote:
>>
>> On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel<riel@redhat.com>  wrote:
>
>
>>> The code looks good to me, my only worry is the
>>> code duplication. We now have 5 balloon drivers,
>>> for 4 hypervisors, all implementing everything
>>> from scratch...
>>
>>
>> Do you have any recommendations on this?  I could (I think reasonably
>> so) modify the existing virtio_balloon.c and have it change behavior
>> based on a feature bit or other configuration.  I'm not sure that
>> really addresses the root of what you're pointing out--it's still
>> adding a different implementation, but doing so as an extension of an
>> existing one.
>
>
> Ideally, I believe we would have two balloon
> top parts in a guest (one classical balloon,
> one on the LRU), and four bottom parts (kvm,
> xen, vmware & s390).
>
> That way the virt specific bits of a balloon
> driver would be essentially a ->balloon_page
> and ->release_page callback for pages, as well
> as methods to communicate with the host.
>
> All the management of pages, including stuff
> like putting them on the LRU, or isolating
> them for migration, would be done with the
> same common code, regardless of what virt
> software we are running on.
>
> Of course, that is a substantial amount of
> work and I feel it would be unreasonable to
> block anyone's code on that kind of thing
> (especially considering that your code is good),
> but I do believe the explosion of balloon
> code is a little worrying.
>

Hm, that makes a lot of sense.  That would be a few patches definitely
worth doing, IMHO.  I'm not entirely sure how I feel about inflating
the balloon drivers in the meantime.  Sigh, and I didn't even mean
that as a pun.

fes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 21:41 ` Michael S. Tsirkin
@ 2012-06-27  2:56   ` Rusty Russell
  2012-06-27 15:48     ` Frank Swiderski
  0 siblings, 1 reply; 30+ messages in thread
From: Rusty Russell @ 2012-06-27  2:56 UTC (permalink / raw)
  To: Michael S. Tsirkin, Frank Swiderski
  Cc: riel, Andrea Arcangeli, virtualization, linux-kernel, kvm, mikew

On Wed, 27 Jun 2012 00:41:06 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> > This implementation of a virtio balloon driver uses the page cache to
> > "store" pages that have been released to the host.  The communication
> > (outside of target counts) is one way--the guest notifies the host when
> > it adds a page to the page cache, allowing the host to madvise(2) with
> > MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> > (via the regular page reclaim).  This means that inflating the balloon
> > is similar to the existing balloon mechanism, but the deflate is
> > different--it re-uses existing Linux kernel functionality to
> > automatically reclaim.
> > 
> > Signed-off-by: Frank Swiderski <fes@google.com>
> 
> I'm pondering this:
> 
> Should it really be a separate driver/device ID?
> If it behaves the same from host POV, maybe it
> should be up to the guest how to inflate/deflate
> the balloon internally?

Well, it shouldn't steal ID 10, either way :)  Either use a completely
bogus number, or ask for an id.

But AFAICT this should be a an alternate driver of for the same device:
it's not really a separate device, is it?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 23:21       ` Frank Swiderski
@ 2012-06-27  9:02         ` Michael S. Tsirkin
  2012-07-02  0:29         ` Rusty Russell
  1 sibling, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-06-27  9:02 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rik van Riel, Rusty Russell, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew, Ying Han, Rafael Aquini

On Tue, Jun 26, 2012 at 04:21:58PM -0700, Frank Swiderski wrote:
> On Tue, Jun 26, 2012 at 2:47 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Jun 26, 2012 at 02:31:26PM -0700, Frank Swiderski wrote:
> >> On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel <riel@redhat.com> wrote:
> >> > On 06/26/2012 04:32 PM, Frank Swiderski wrote:
> >> >>
> >> >> This implementation of a virtio balloon driver uses the page cache to
> >> >> "store" pages that have been released to the host.  The communication
> >> >> (outside of target counts) is one way--the guest notifies the host when
> >> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> >> (via the regular page reclaim).  This means that inflating the balloon
> >> >> is similar to the existing balloon mechanism, but the deflate is
> >> >> different--it re-uses existing Linux kernel functionality to
> >> >> automatically reclaim.
> >> >>
> >> >> Signed-off-by: Frank Swiderski<fes@google.com>
> >> >
> >> >
> >> > It is a great idea, but how can this memory balancing
> >> > possibly work if someone uses memory cgroups inside a
> >> > guest?
> >>
> >> Thanks and good point--this isn't something that I considered in the
> >> implementation.
> >>
> >> > Having said that, we currently do not have proper
> >> > memory reclaim balancing between cgroups at all, so
> >> > requiring that of this balloon driver would be
> >> > unreasonable.
> >> >
> >> > The code looks good to me, my only worry is the
> >> > code duplication. We now have 5 balloon drivers,
> >> > for 4 hypervisors, all implementing everything
> >> > from scratch...
> >>
> >> Do you have any recommendations on this?  I could (I think reasonably
> >> so) modify the existing virtio_balloon.c and have it change behavior
> >> based on a feature bit or other configuration.  I'm not sure that
> >> really addresses the root of what you're pointing out--it's still
> >> adding a different implementation, but doing so as an extension of an
> >> existing one.
> >>
> >> fes
> >
> > Let's assume it's a feature bit: how would you
> > formulate what the feature does *from host point of view*?
> >
> > --
> > MST
> 
> In this implementation, the host doesn't keep track of pages in the
> balloon, as there is no explicit deflate path.  The host device for
> this implementation should merely, for example, MADV_DONTNEED on the
> pages sent in an inflate.  Thus, the inflate becomes a notification
> that the guest doesn't need those pages mapped in, but that they
> should be available if the guest touches them.

So guest access removes the page from the balloon,
since it cancels MADV_DONTNEED, right?
Okay. But what is the meaning of num_pages then?
For example, let's assume I set num_pages to 1,
then guest gives me a page and later accesses this
page. Is guest also required to give me another
page now? Later I send a config interrupt without
changing num_pages. Is guest required to give me another
page now?

> In that sense, it's
> not a rigid shrink of guest memory.  I'm not sure what I'd call the
> feature bit though.
> 
> Was that the question you were asking, or did I misread?
> 
> fes

Yes. It would be a good idea for you to try and write a spec IMO.
Send a patch to virtio.lyx

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 23:45       ` Frank Swiderski
@ 2012-06-27  9:04         ` Michael S. Tsirkin
  0 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-06-27  9:04 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rik van Riel, Rusty Russell, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew, Ying Han, Rafael Aquini

On Tue, Jun 26, 2012 at 04:45:36PM -0700, Frank Swiderski wrote:
> On Tue, Jun 26, 2012 at 2:45 PM, Rik van Riel <riel@redhat.com> wrote:
> > On 06/26/2012 05:31 PM, Frank Swiderski wrote:
> >>
> >> On Tue, Jun 26, 2012 at 1:40 PM, Rik van Riel<riel@redhat.com>  wrote:
> >
> >
> >>> The code looks good to me, my only worry is the
> >>> code duplication. We now have 5 balloon drivers,
> >>> for 4 hypervisors, all implementing everything
> >>> from scratch...
> >>
> >>
> >> Do you have any recommendations on this?  I could (I think reasonably
> >> so) modify the existing virtio_balloon.c and have it change behavior
> >> based on a feature bit or other configuration.  I'm not sure that
> >> really addresses the root of what you're pointing out--it's still
> >> adding a different implementation, but doing so as an extension of an
> >> existing one.
> >
> >
> > Ideally, I believe we would have two balloon
> > top parts in a guest (one classical balloon,
> > one on the LRU), and four bottom parts (kvm,
> > xen, vmware & s390).
> >
> > That way the virt specific bits of a balloon
> > driver would be essentially a ->balloon_page
> > and ->release_page callback for pages, as well
> > as methods to communicate with the host.
> >
> > All the management of pages, including stuff
> > like putting them on the LRU, or isolating
> > them for migration, would be done with the
> > same common code, regardless of what virt
> > software we are running on.
> >
> > Of course, that is a substantial amount of
> > work and I feel it would be unreasonable to
> > block anyone's code on that kind of thing
> > (especially considering that your code is good),
> > but I do believe the explosion of balloon
> > code is a little worrying.
> >
> 
> Hm, that makes a lot of sense.  That would be a few patches definitely
> worth doing, IMHO.  I'm not entirely sure how I feel about inflating
> the balloon drivers in the meantime.  Sigh, and I didn't even mean
> that as a pun.
> 
> fes

Actually I'm not 100% sure the num_pages interface
of the classical balloon is a good fit for the LRU
balloon. Let's figure that out first: if we fork the interface
there might not be all that much common code ...

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
  2012-06-26 20:40 ` Rik van Riel
  2012-06-26 21:41 ` Michael S. Tsirkin
@ 2012-06-27  9:40 ` Amit Shah
  2012-08-30  8:57 ` Michael S. Tsirkin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Amit Shah @ 2012-06-27  9:40 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, Michael S. Tsirkin, riel, Andrea Arcangeli, mikew,
	linux-kernel, kvm, virtualization

On (Tue) 26 Jun 2012 [13:32:58], Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.

This is a good idea for a guest co-operative balloon driver.  I don't
think it'll replace the original driver.  The traditional balloon
model is essentially driven by the host to increase guest density on
the host.  This driver can't work in that case.  However, using both
the types of drivers will be helpful, as unused pages on the guest
will be able to be used by the host.

Balbir Singh had done some work earlier on a guest co-operative
balloon driver, but AFAIR it was with modification of the existing
virtio-balloon driver.

I don't think a separate driver is necessary for the functionality,
though.  Perhaps just a new config space item which mentions how many
pages are present in the page cache, so that host do some accounting
as well.

		Amit

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-27  2:56   ` Rusty Russell
@ 2012-06-27 15:48     ` Frank Swiderski
  2012-06-27 16:06       ` Michael S. Tsirkin
  0 siblings, 1 reply; 30+ messages in thread
From: Frank Swiderski @ 2012-06-27 15:48 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Michael S. Tsirkin, riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew

On Tue, Jun 26, 2012 at 7:56 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> On Wed, 27 Jun 2012 00:41:06 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> > This implementation of a virtio balloon driver uses the page cache to
>> > "store" pages that have been released to the host.  The communication
>> > (outside of target counts) is one way--the guest notifies the host when
>> > it adds a page to the page cache, allowing the host to madvise(2) with
>> > MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> > (via the regular page reclaim).  This means that inflating the balloon
>> > is similar to the existing balloon mechanism, but the deflate is
>> > different--it re-uses existing Linux kernel functionality to
>> > automatically reclaim.
>> >
>> > Signed-off-by: Frank Swiderski <fes@google.com>
>>
>> I'm pondering this:
>>
>> Should it really be a separate driver/device ID?
>> If it behaves the same from host POV, maybe it
>> should be up to the guest how to inflate/deflate
>> the balloon internally?
>
> Well, it shouldn't steal ID 10, either way :)  Either use a completely
> bogus number, or ask for an id.
>
> But AFAICT this should be a an alternate driver of for the same device:
> it's not really a separate device, is it?
>
> Cheers,
> Rusty.

Apologies, Rusty.  Asking for an ID is in the virtio spec, and I
completely neglected that step.  Though as you and others have pointed
out, this probably fits better as a different driver for the same
device.  Since it changes whether or not the deflate operation is
necessary, it also seems that how this should look is different
behavior based on a feature bit in the device.

If that sounds reasonable, then what I'll do with this patch is merge
it with the existing virtio balloon driver with a feature bit for
determining which behavior to use.

I also think the idea of a generic balloon that the different balloon
drivers use for the inflate/deflate operations is interesting and
useful, though I think the suggestion of pending that until later is
correct.

Sounds reasonable?

Regards,
fes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-27 15:48     ` Frank Swiderski
@ 2012-06-27 16:06       ` Michael S. Tsirkin
  2012-06-27 16:08         ` Frank Swiderski
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-06-27 16:06 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew

On Wed, Jun 27, 2012 at 08:48:55AM -0700, Frank Swiderski wrote:
> On Tue, Jun 26, 2012 at 7:56 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> > On Wed, 27 Jun 2012 00:41:06 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> > This implementation of a virtio balloon driver uses the page cache to
> >> > "store" pages that have been released to the host.  The communication
> >> > (outside of target counts) is one way--the guest notifies the host when
> >> > it adds a page to the page cache, allowing the host to madvise(2) with
> >> > MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> > (via the regular page reclaim).  This means that inflating the balloon
> >> > is similar to the existing balloon mechanism, but the deflate is
> >> > different--it re-uses existing Linux kernel functionality to
> >> > automatically reclaim.
> >> >
> >> > Signed-off-by: Frank Swiderski <fes@google.com>
> >>
> >> I'm pondering this:
> >>
> >> Should it really be a separate driver/device ID?
> >> If it behaves the same from host POV, maybe it
> >> should be up to the guest how to inflate/deflate
> >> the balloon internally?
> >
> > Well, it shouldn't steal ID 10, either way :)  Either use a completely
> > bogus number, or ask for an id.
> >
> > But AFAICT this should be a an alternate driver of for the same device:
> > it's not really a separate device, is it?
> >
> > Cheers,
> > Rusty.
> 
> Apologies, Rusty.  Asking for an ID is in the virtio spec, and I
> completely neglected that step.  Though as you and others have pointed
> out, this probably fits better as a different driver for the same
> device.  Since it changes whether or not the deflate operation is
> necessary, it also seems that how this should look is different
> behavior based on a feature bit in the device.
> 
> If that sounds reasonable, then what I'll do with this patch is merge
> it with the existing virtio balloon driver with a feature bit for
> determining which behavior to use.
> 
> I also think the idea of a generic balloon that the different balloon
> drivers use for the inflate/deflate operations is interesting and
> useful, though I think the suggestion of pending that until later is
> correct.
> 
> Sounds reasonable?
> 
> Regards,
> fes

I think a spec patch would be a good spec at this point.
You can get the spec from Rusty, or a mirror
from my git:

git://git.kernel.org/pub/scm/virt/kvm/mst/virtio-spec.git




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-27 16:06       ` Michael S. Tsirkin
@ 2012-06-27 16:08         ` Frank Swiderski
  0 siblings, 0 replies; 30+ messages in thread
From: Frank Swiderski @ 2012-06-27 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew

On Wed, Jun 27, 2012 at 9:06 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jun 27, 2012 at 08:48:55AM -0700, Frank Swiderski wrote:
>> On Tue, Jun 26, 2012 at 7:56 PM, Rusty Russell <rusty@rustcorp.com.au> wrote:
>> > On Wed, 27 Jun 2012 00:41:06 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> >> > This implementation of a virtio balloon driver uses the page cache to
>> >> > "store" pages that have been released to the host.  The communication
>> >> > (outside of target counts) is one way--the guest notifies the host when
>> >> > it adds a page to the page cache, allowing the host to madvise(2) with
>> >> > MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> >> > (via the regular page reclaim).  This means that inflating the balloon
>> >> > is similar to the existing balloon mechanism, but the deflate is
>> >> > different--it re-uses existing Linux kernel functionality to
>> >> > automatically reclaim.
>> >> >
>> >> > Signed-off-by: Frank Swiderski <fes@google.com>
>> >>
>> >> I'm pondering this:
>> >>
>> >> Should it really be a separate driver/device ID?
>> >> If it behaves the same from host POV, maybe it
>> >> should be up to the guest how to inflate/deflate
>> >> the balloon internally?
>> >
>> > Well, it shouldn't steal ID 10, either way :)  Either use a completely
>> > bogus number, or ask for an id.
>> >
>> > But AFAICT this should be a an alternate driver of for the same device:
>> > it's not really a separate device, is it?
>> >
>> > Cheers,
>> > Rusty.
>>
>> Apologies, Rusty.  Asking for an ID is in the virtio spec, and I
>> completely neglected that step.  Though as you and others have pointed
>> out, this probably fits better as a different driver for the same
>> device.  Since it changes whether or not the deflate operation is
>> necessary, it also seems that how this should look is different
>> behavior based on a feature bit in the device.
>>
>> If that sounds reasonable, then what I'll do with this patch is merge
>> it with the existing virtio balloon driver with a feature bit for
>> determining which behavior to use.
>>
>> I also think the idea of a generic balloon that the different balloon
>> drivers use for the inflate/deflate operations is interesting and
>> useful, though I think the suggestion of pending that until later is
>> correct.
>>
>> Sounds reasonable?
>>
>> Regards,
>> fes
>
> I think a spec patch would be a good spec at this point.
> You can get the spec from Rusty, or a mirror
> from my git:
>
> git://git.kernel.org/pub/scm/virt/kvm/mst/virtio-spec.git
>
>
>


Got it, thanks, will do.

Regards,
fes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 23:21       ` Frank Swiderski
  2012-06-27  9:02         ` Michael S. Tsirkin
@ 2012-07-02  0:29         ` Rusty Russell
  2012-09-03  6:35           ` Paolo Bonzini
  1 sibling, 1 reply; 30+ messages in thread
From: Rusty Russell @ 2012-07-02  0:29 UTC (permalink / raw)
  To: Frank Swiderski, Michael S. Tsirkin
  Cc: Rik van Riel, Andrea Arcangeli, virtualization, linux-kernel,
	kvm, mikew, Ying Han, Rafael Aquini

On Tue, 26 Jun 2012 16:21:58 -0700, Frank Swiderski <fes@google.com> wrote:
> On Tue, Jun 26, 2012 at 2:47 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Let's assume it's a feature bit: how would you
> > formulate what the feature does *from host point of view*?
> 
> In this implementation, the host doesn't keep track of pages in the
> balloon, as there is no explicit deflate path.  The host device for
> this implementation should merely, for example, MADV_DONTNEED on the
> pages sent in an inflate.  Thus, the inflate becomes a notification
> that the guest doesn't need those pages mapped in, but that they
> should be available if the guest touches them.  In that sense, it's
> not a rigid shrink of guest memory.  I'm not sure what I'd call the
> feature bit though.
> 
> Was that the question you were asking, or did I misread?

Hmm, the spec is unfortunately vague: !VIRTIO_BALLOON_F_MUST_TELL_HOST
implies you should tell the host (eventually).  I don't know if any
implementations actually care though.

We could add a VIRTIO_BALLOON_F_NEVER_TELL_DEFLATE which would mean the
deflate vq need not be used at all.

Is it altogether impossible to know when a page is reused in your
implementation?  If we could do that, we could replace our balloon with
this one.

(My deep ignorance of vm issues is hurting us here, sorry.)

Cheers,
Rusty.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
                   ` (2 preceding siblings ...)
  2012-06-27  9:40 ` Amit Shah
@ 2012-08-30  8:57 ` Michael S. Tsirkin
  2012-09-03 15:09 ` Avi Kivity
  2012-09-10  9:05 ` Michael S. Tsirkin
  5 siblings, 0 replies; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-08-30  8:57 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew

On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> +static void fill_balloon(struct virtio_balloon *vb, size_t num)
> +{
> +	int err;
> +
> +	/* We can only do one array worth at a time. */
> +	num = min(num, ARRAY_SIZE(vb->pfns));
> +
> +	for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) {
> +		struct page *page;
> +		unsigned long inode_pfn = find_available_inode_page(vb);
> +		/* Should always be able to find a page. */
> +		BUG_ON(!inode_pfn);
> +		page = read_mapping_page(the_inode.inode.i_mapping, inode_pfn,
> +					 NULL);
> +		if (IS_ERR(page)) {
> +			if (printk_ratelimit())
> +				dev_printk(KERN_INFO, &vb->vdev->dev,
> +					   "Out of puff! Can't get %zu pages\n",
> +					   num);
> +			break;
> +		}
> +
> +		/* Set the page to be dirty */
> +		set_page_dirty(page);
> +
> +		vb->pfns[vb->num_pfns] = page_to_pfn(page);
> +	}
> +
> +	/* Didn't get any?  Oh well. */
> +	if (vb->num_pfns == 0)
> +		return;

Went to look at this driver, and noticed caller will re-invoke
this immediately if this triggers, so we busy-wait
re-trying this. When does read_mapping_page fail?
Is there any condition we could wait on?

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-07-02  0:29         ` Rusty Russell
@ 2012-09-03  6:35           ` Paolo Bonzini
  2012-09-06  1:35             ` Rusty Russell
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2012-09-03  6:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Frank Swiderski, Michael S. Tsirkin, Andrea Arcangeli,
	Rik van Riel, Rafael Aquini, kvm, linux-kernel, mikew, Ying Han,
	virtualization

Il 02/07/2012 02:29, Rusty Russell ha scritto:
> VIRTIO_BALLOON_F_MUST_TELL_HOST
> implies you should tell the host (eventually).  I don't know if any
> implementations actually care though.

This is indeed broken, because it is a "negative" feature: it tells you
that "implicit deflate" is _not_ supported.

Right now, QEMU refuses migration if the target does not support all the
features that were negotiated.  But then:

- a migration from non-MUST_TELL_HOST to MUST_TELL_HOST will succeed,
which is wrong;

- a migration from MUST_TELL_HOST to non-MUST_TELL_HOST will fail, which
is useless.

> We could add a VIRTIO_BALLOON_F_NEVER_TELL_DEFLATE which would mean the
> deflate vq need not be used at all.

That would work.  At the same time we could deprecate MUST_TELL_HOST.
Certainly the guest implementations don't care, or we would have
experienced problems such as the one above.  The QEMU implementation
also does not care but, for example, a Xen implementation would care.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
                   ` (3 preceding siblings ...)
  2012-08-30  8:57 ` Michael S. Tsirkin
@ 2012-09-03 15:09 ` Avi Kivity
  2012-09-10  9:05 ` Michael S. Tsirkin
  5 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-09-03 15:09 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, Michael S. Tsirkin, riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm, mikew

On 06/26/2012 11:32 PM, Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.

Interesting idea.

How is the host able to manage overcommit this way?  If deflate is not
host controlled, the host may start swapping guests out to disk if they
all self-deflate.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-03  6:35           ` Paolo Bonzini
@ 2012-09-06  1:35             ` Rusty Russell
  0 siblings, 0 replies; 30+ messages in thread
From: Rusty Russell @ 2012-09-06  1:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Frank Swiderski, Michael S. Tsirkin, Andrea Arcangeli,
	Rik van Riel, Rafael Aquini, kvm, linux-kernel, mikew, Ying Han,
	virtualization

Paolo Bonzini <pbonzini@redhat.com> writes:
> Il 02/07/2012 02:29, Rusty Russell ha scritto:
>> VIRTIO_BALLOON_F_MUST_TELL_HOST
>> implies you should tell the host (eventually).  I don't know if any
>> implementations actually care though.
>
> This is indeed broken, because it is a "negative" feature: it tells you
> that "implicit deflate" is _not_ supported.
>
> Right now, QEMU refuses migration if the target does not support all the
> features that were negotiated.  But then:
>
> - a migration from non-MUST_TELL_HOST to MUST_TELL_HOST will succeed,
> which is wrong;
>
> - a migration from MUST_TELL_HOST to non-MUST_TELL_HOST will fail, which
> is useless.
>
>> We could add a VIRTIO_BALLOON_F_NEVER_TELL_DEFLATE which would mean the
>> deflate vq need not be used at all.
>
> That would work.  At the same time we could deprecate MUST_TELL_HOST.
> Certainly the guest implementations don't care, or we would have
> experienced problems such as the one above.  The QEMU implementation
> also does not care but, for example, a Xen implementation would care.

OK; I'm not sure we need to deprecate MUST_TELL_HOST, though since it's
never actually been used there's a good argument.

VIRTIO_BALLOON_F_SILENT_DEFLATE (or whatever it's called) would
obviously mean you couldn't ack VIRTIO_BALLOON_F_MUST_TELL_HOST.

Patches welcome!

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
                   ` (4 preceding siblings ...)
  2012-09-03 15:09 ` Avi Kivity
@ 2012-09-10  9:05 ` Michael S. Tsirkin
  2012-09-10 17:37   ` Mike Waychison
  5 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-09-10  9:05 UTC (permalink / raw)
  To: Frank Swiderski
  Cc: Rusty Russell, riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm, mikew

On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.
> 
> Signed-off-by: Frank Swiderski <fes@google.com>

I've been trying to understand this, and I have
a question: what exactly is the benefit
of this new device?

Note that users could not care less about how a driver
is implemented internally.

Is there some workload where you see VM working better with
this than regular balloon? Any numbers?

Also, can't we just replace existing balloon implementation
with this one?  Why it is so important to deflate silently?
I guess filesystem does not currently get a callback
before page is reclaimed but this isan implementation detail -
maybe this can be fixed?

Also can you pls answer Avi's question?
How is overcommit managed?


> ---
>  drivers/virtio/Kconfig              |   13 +
>  drivers/virtio/Makefile             |    1 +
>  drivers/virtio/virtio_fileballoon.c |  636 +++++++++++++++++++++++++++++++++++
>  include/linux/virtio_balloon.h      |    9 +
>  include/linux/virtio_ids.h          |    1 +
>  5 files changed, 660 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/virtio/virtio_fileballoon.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index f38b17a..cffa2a7 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>  
>  	 If unsure, say M.
>  
> +config VIRTIO_FILEBALLOON
> +	tristate "Virtio page cache-backed balloon driver"
> +	select VIRTIO
> +	select VIRTIO_RING
> +	---help---
> +	 This driver supports decreasing and automatically reclaiming the
> +	 memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
> +	 tries to maintain a specific target balloon size using the page cache.
> +	 This allows the guest to implicitly deflate the balloon by flushing
> +	 pages from the cache and touching the page.
> +
> +	 If unsure, say N.
> +
>   config VIRTIO_MMIO
>   	tristate "Platform bus driver for memory mapped virtio devices (EXPERIMENTAL)"
>   	depends on HAS_IOMEM && EXPERIMENTAL
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 5a4c63c..7ca0a3f 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
> diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c
> new file mode 100644
> index 0000000..ff252ec
> --- /dev/null
> +++ b/drivers/virtio/virtio_fileballoon.c
> @@ -0,0 +1,636 @@
> +/* Virtio file (page cache-backed) balloon implementation, inspired by
> + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's
> + * implementation.
> + *
> + * This implementation of the virtio balloon driver re-uses the page cache to
> + * allow memory consumed by inflating the balloon to be reclaimed by linux.  It
> + * creates and mounts a bare-bones filesystem containing a single inode.  When
> + * the host requests the balloon to inflate, it does so by "reading" pages at
> + * offsets into the inode mapping's page_tree.  The host is notified when the
> + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
> + * corresponding host memory, reducing the RSS of the virtual machine.  In this
> + * implementation, the host is only notified when a page is added to the
> + * balloon.  Reclaim happens under the existing TTFP logic, which flushes unused
> + * pages in the page cache.  If the host used MADV_DONTNEED, then when the guest
> + * uses the page, the zero page will be mapped in, allowing automatic (and fast,
> + * compared to requiring a host notification via a virtio queue to get memory
> + * back) reclaim.
> + *
> + *  Copyright 2008 Rusty Russell IBM Corporation
> + *  Copyright 2011 Frank Swiderski Google Inc
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +#include <linux/backing-dev.h>
> +#include <linux/delay.h>
> +#include <linux/file.h>
> +#include <linux/freezer.h>
> +#include <linux/fs.h>
> +#include <linux/jiffies.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/mount.h>
> +#include <linux/pagemap.h>
> +#include <linux/slab.h>
> +#include <linux/swap.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_balloon.h>
> +#include <linux/writeback.h>
> +
> +#define VIRTBALLOON_PFN_ARRAY_SIZE 256
> +
> +struct virtio_balloon {
> +	struct virtio_device *vdev;
> +	struct virtqueue *inflate_vq;
> +
> +	/* Where the ballooning thread waits for config to change. */
> +	wait_queue_head_t config_change;
> +
> +	/* The thread servicing the balloon. */
> +	struct task_struct *thread;
> +
> +	/* Waiting for host to ack the pages we released. */
> +	struct completion acked;
> +
> +	/* The array of pfns we tell the Host about. */
> +	unsigned int num_pfns;
> +	u32 pfns[VIRTBALLOON_PFN_ARRAY_SIZE];
> +
> +	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> +
> +	/* The last page offset read into the mapping's page_tree */
> +	unsigned long last_scan_page_array;
> +
> +	/* The last time a page was reclaimed */
> +	unsigned long last_reclaim;
> +};
> +
> +/* Magic number used for the skeleton filesystem in the call to mount_pseudo */
> +#define BALLOONFS_MAGIC 0x42414c4c
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_FILE_BALLOON, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +/*
> + * The skeleton filesystem contains a single inode, held by the structure below.
> + * Using the containing structure below allows easy access to the struct
> + * virtio_balloon.
> + */
> +static struct balloon_inode {
> +	struct inode inode;
> +	struct virtio_balloon *vb;
> +} the_inode;
> +
> +/*
> + * balloon_alloc_inode is called when the single inode for the skeleton
> + * filesystem is created in init() with the call to new_inode.
> + */
> +static struct inode *balloon_alloc_inode(struct super_block *sb)
> +{
> +	static bool already_inited;
> +	/* We should only ever be called once! */
> +	BUG_ON(already_inited);
> +	already_inited = true;
> +	inode_init_once(&the_inode.inode);
> +	return &the_inode.inode;
> +}
> +
> +/* Noop implementation of destroy_inode.  */
> +static void balloon_destroy_inode(struct inode *inode)
> +{
> +}
> +
> +static int balloon_sync_fs(struct super_block *sb, int wait)
> +{
> +	return filemap_write_and_wait(the_inode.inode.i_mapping);
> +}
> +
> +static const struct super_operations balloonfs_ops = {
> +	.alloc_inode	= balloon_alloc_inode,
> +	.destroy_inode	= balloon_destroy_inode,
> +	.sync_fs	= balloon_sync_fs,
> +};
> +
> +static const struct dentry_operations balloonfs_dentry_operations = {
> +};
> +
> +/*
> + * balloonfs_writepage is called when linux needs to reclaim memory held using
> + * the balloonfs' page cache.
> + */
> +static int balloonfs_writepage(struct page *page, struct writeback_control *wbc)
> +{
> +	the_inode.vb->last_reclaim = jiffies;
> +	SetPageUptodate(page);
> +	ClearPageDirty(page);
> +	/*
> +	 * If the page isn't being flushed from the page allocator, go ahead and
> +	 * drop it from the page cache anyway.
> +	 */
> +	if (!wbc->for_reclaim)
> +		delete_from_page_cache(page);
> +	unlock_page(page);
> +	return 0;
> +}
> +
> +/* Nearly no-op implementation of readpage */
> +static int balloonfs_readpage(struct file *file, struct page *page)
> +{
> +	SetPageUptodate(page);
> +	unlock_page(page);
> +	return 0;
> +}
> +
> +static const struct address_space_operations balloonfs_aops = {
> +	.writepage	= balloonfs_writepage,
> +	.readpage	= balloonfs_readpage
> +};
> +
> +static struct backing_dev_info balloonfs_backing_dev_info = {
> +	.name		= "balloonfs",
> +	.ra_pages	= 0,
> +	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK
> +};
> +
> +static struct dentry *balloonfs_mount(struct file_system_type *fs_type,
> +			 int flags, const char *dev_name, void *data)
> +{
> +	struct dentry *root;
> +	struct inode *inode;
> +	root = mount_pseudo(fs_type, "balloon:", &balloonfs_ops,
> +			    &balloonfs_dentry_operations, BALLOONFS_MAGIC);
> +	inode = root->d_inode;
> +	inode->i_mapping->a_ops = &balloonfs_aops;
> +	mapping_set_gfp_mask(inode->i_mapping,
> +			     (GFP_HIGHUSER | __GFP_NOMEMALLOC));
> +	inode->i_mapping->backing_dev_info = &balloonfs_backing_dev_info;
> +	return root;
> +}
> +
> +/* The single mounted skeleton filesystem */
> +static struct vfsmount *balloon_mnt __read_mostly;
> +
> +static struct file_system_type balloon_fs_type = {
> +	.name =		"balloonfs",
> +	.mount =	balloonfs_mount,
> +	.kill_sb =	kill_anon_super,
> +};
> +
> +/* Acknowledges a message from the specified virtqueue. */
> +static void balloon_ack(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb;
> +	unsigned int len;
> +
> +	vb = virtqueue_get_buf(vq, &len);
> +	if (vb)
> +		complete(&vb->acked);
> +}
> +
> +/*
> + * Scans the page_tree for the inode's mapping, looking for an offset that is
> + * currently empty, returning that index (or 0 if it could not fill the
> + * request).
> + */
> +static unsigned long find_available_inode_page(struct virtio_balloon *vb)
> +{
> +	unsigned long radix_index, index, max_scan;
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +
> +	/*
> +	 * This function is a serialized call (only happens on the free-to-host
> +	 * thread), so no locking is necessary here.
> +	 */
> +	index = vb->last_scan_page_array;
> +	max_scan = totalram_pages - vb->last_scan_page_array;
> +
> +	/*
> +	 * Scan starting at the last scanned offset, then wrap around if
> +	 * necessary.
> +	 */
> +	if (index == 0)
> +		index = 1;
> +	rcu_read_lock();
> +	radix_index = radix_tree_next_hole(&mapping->page_tree,
> +					   index, max_scan);
> +	rcu_read_unlock();
> +	/*
> +	 * If we hit the end of the tree, wrap and search up to the original
> +	 * index.
> +	 */
> +	if (radix_index - index >= max_scan) {
> +		if (index != 1) {
> +			rcu_read_lock();
> +			radix_index = radix_tree_next_hole(&mapping->page_tree,
> +							   1, index);
> +			rcu_read_unlock();
> +			if (radix_index - 1 >= index)
> +				radix_index = 0;
> +		} else {
> +			radix_index = 0;
> +		}
> +	}
> +	vb->last_scan_page_array = radix_index;
> +
> +	return radix_index;
> +}
> +
> +/* Notifies the host of pages in the specified virtqueue. */
> +static int tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	int err;
> +	struct scatterlist sg;
> +
> +	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +
> +	init_completion(&vb->acked);
> +
> +	/* We should always be able to add one buffer to an empty queue. */
> +	err = virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL);
> +	if (err  < 0)
> +		return err;
> +	virtqueue_kick(vq);
> +
> +	/* When host has read buffer, this completes via balloon_ack */
> +	wait_for_completion(&vb->acked);
> +	return err;
> +}
> +
> +static void fill_balloon(struct virtio_balloon *vb, size_t num)
> +{
> +	int err;
> +
> +	/* We can only do one array worth at a time. */
> +	num = min(num, ARRAY_SIZE(vb->pfns));
> +
> +	for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) {
> +		struct page *page;
> +		unsigned long inode_pfn = find_available_inode_page(vb);
> +		/* Should always be able to find a page. */
> +		BUG_ON(!inode_pfn);
> +		page = read_mapping_page(the_inode.inode.i_mapping, inode_pfn,
> +					 NULL);
> +		if (IS_ERR(page)) {
> +			if (printk_ratelimit())
> +				dev_printk(KERN_INFO, &vb->vdev->dev,
> +					   "Out of puff! Can't get %zu pages\n",
> +					   num);
> +			break;
> +		}
> +
> +		/* Set the page to be dirty */
> +		set_page_dirty(page);
> +
> +		vb->pfns[vb->num_pfns] = page_to_pfn(page);
> +	}
> +
> +	/* Didn't get any?  Oh well. */
> +	if (vb->num_pfns == 0)
> +		return;
> +
> +	/* Notify the host of the pages we just added to the page_tree. */
> +	err = tell_host(vb, vb->inflate_vq);
> +
> +	for (; vb->num_pfns != 0; vb->num_pfns--) {
> +		struct page *page = pfn_to_page(vb->pfns[vb->num_pfns - 1]);
> +		/*
> +		 * Release our refcount on the page so that it can be reclaimed
> +		 * when necessary.
> +		 */
> +		page_cache_release(page);
> +	}
> +	__mark_inode_dirty(&the_inode.inode, I_DIRTY_PAGES);
> +}
> +
> +static inline void update_stat(struct virtio_balloon *vb, int idx,
> +			       u64 val)
> +{
> +	BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
> +	vb->stats[idx].tag = idx;
> +	vb->stats[idx].val = val;
> +}
> +
> +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
> +
> +static inline u32 config_pages(struct virtio_balloon *vb);
> +static void update_balloon_stats(struct virtio_balloon *vb)
> +{
> +	unsigned long events[NR_VM_EVENT_ITEMS];
> +	struct sysinfo i;
> +
> +	all_vm_events(events);
> +	si_meminfo(&i);
> +
> +	update_stat(vb, VIRTIO_BALLOON_S_SWAP_IN,
> +		    pages_to_bytes(events[PSWPIN]));
> +	update_stat(vb, VIRTIO_BALLOON_S_SWAP_OUT,
> +		    pages_to_bytes(events[PSWPOUT]));
> +	update_stat(vb, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
> +	update_stat(vb, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> +
> +	/* Total and Free Mem */
> +	update_stat(vb, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram));
> +	update_stat(vb, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram));
> +}
> +
> +static void virtballoon_changed(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +
> +	wake_up(&vb->config_change);
> +}
> +
> +static inline bool config_need_stats(struct virtio_balloon *vb)
> +{
> +	u32 v = 0;
> +
> +	vb->vdev->config->get(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       need_stats),
> +			      &v, sizeof(v));
> +	return (v != 0);
> +}
> +
> +static inline u32 config_pages(struct virtio_balloon *vb)
> +{
> +	u32 v = 0;
> +
> +	vb->vdev->config->get(vb->vdev,
> +			      offsetof(struct virtio_balloon_config, num_pages),
> +			      &v, sizeof(v));
> +	return v;
> +}
> +
> +static inline s64 towards_target(struct virtio_balloon *vb)
> +{
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +	u32 v = config_pages(vb);
> +
> +	return (s64)v - (mapping ? mapping->nrpages : 0);
> +}
> +
> +static void update_balloon_size(struct virtio_balloon *vb)
> +{
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +	__le32 actual = cpu_to_le32((mapping ? mapping->nrpages : 0));
> +
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config, actual),
> +			      &actual, sizeof(actual));
> +}
> +
> +static void update_free_and_total(struct virtio_balloon *vb)
> +{
> +	struct sysinfo i;
> +	u32 value;
> +
> +	si_meminfo(&i);
> +
> +	update_balloon_stats(vb);
> +	value = i.totalram;
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       pages_total),
> +			      &value, sizeof(value));
> +	value = i.freeram;
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       pages_free),
> +			      &value, sizeof(value));
> +	value = 0;
> +	vb->vdev->config->set(vb->vdev,
> +			      offsetof(struct virtio_balloon_config,
> +				       need_stats),
> +			      &value, sizeof(value));
> +}
> +
> +static int balloon(void *_vballoon)
> +{
> +	struct virtio_balloon *vb = _vballoon;
> +
> +	set_freezable();
> +	while (!kthread_should_stop()) {
> +		s64 diff;
> +		try_to_freeze();
> +		wait_event_interruptible(vb->config_change,
> +					 (diff = towards_target(vb)) > 0
> +					 || config_need_stats(vb)
> +					 || kthread_should_stop()
> +					 || freezing(current));
> +		if (config_need_stats(vb))
> +			update_free_and_total(vb);
> +		if (diff > 0) {
> +			unsigned long reclaim_time = vb->last_reclaim + 2 * HZ;
> +			/*
> +			 * Don't fill the balloon if a page reclaim happened in
> +			 * the past 2 seconds.
> +			 */
> +			if (time_after_eq(reclaim_time, jiffies)) {
> +				/* Inflating too fast--sleep and skip. */
> +				msleep(500);
> +			} else {
> +				fill_balloon(vb, diff);
> +			}
> +		} else if (diff < 0 && config_pages(vb) == 0) {
> +			/*
> +			 * Here we are specifically looking to detect the case
> +			 * where there are pages in the page cache, but the
> +			 * device wants us to go to 0.  This is used in save/
> +			 * restore since the host device doesn't keep track of
> +			 * PFNs, and must flush the page cache on restore
> +			 * (which loses the context of the original device
> +			 * instance).  However, we still suggest syncing the
> +			 * diff so that we can get within the target range.
> +			 */
> +			s64 nr_to_write =
> +				(!config_pages(vb) ? LONG_MAX : -diff);
> +			struct writeback_control wbc = {
> +				.sync_mode = WB_SYNC_ALL,
> +				.nr_to_write = nr_to_write,
> +				.range_start = 0,
> +				.range_end = LLONG_MAX,
> +			};
> +			sync_inode(&the_inode.inode, &wbc);
> +		}
> +		update_balloon_size(vb);
> +	}
> +	return 0;
> +}
> +
> +static ssize_t virtballoon_attr_show(struct device *dev,
> +				     struct device_attribute *attr,
> +				     char *buf);
> +
> +static DEVICE_ATTR(total_memory, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static DEVICE_ATTR(free_memory, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static DEVICE_ATTR(target_pages, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static DEVICE_ATTR(actual_pages, 0644,
> +	virtballoon_attr_show, NULL);
> +
> +static struct attribute *virtballoon_attrs[] = {
> +	&dev_attr_total_memory.attr,
> +	&dev_attr_free_memory.attr,
> +	&dev_attr_target_pages.attr,
> +	&dev_attr_actual_pages.attr,
> +	NULL
> +};
> +static struct attribute_group virtballoon_attr_group = {
> +	.name	= "virtballoon",
> +	.attrs	= virtballoon_attrs,
> +};
> +
> +static ssize_t virtballoon_attr_show(struct device *dev,
> +				     struct device_attribute *attr,
> +				     char *buf)
> +{
> +	struct address_space *mapping = the_inode.inode.i_mapping;
> +	struct virtio_device *vdev = container_of(dev, struct virtio_device,
> +						  dev);
> +	struct virtio_balloon *vb = vdev->priv;
> +	unsigned long long value = 0;
> +	if (attr == &dev_attr_total_memory)
> +		value = vb->stats[VIRTIO_BALLOON_S_MEMTOT].val;
> +	else if (attr == &dev_attr_free_memory)
> +		value = vb->stats[VIRTIO_BALLOON_S_MEMFREE].val;
> +	else if (attr == &dev_attr_target_pages)
> +		value = config_pages(vb);
> +	else if (attr == &dev_attr_actual_pages)
> +		value = cpu_to_le32((mapping ? mapping->nrpages : 0));
> +	return sprintf(buf, "%llu\n", value);
> +}
> +
> +static int virtballoon_probe(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb;
> +	struct virtqueue *vq[1];
> +	vq_callback_t *callback = balloon_ack;
> +	const char *name = "inflate";
> +	int err;
> +
> +	vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> +	if (!vb) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	init_waitqueue_head(&vb->config_change);
> +	vb->vdev = vdev;
> +
> +	/* We use one virtqueue: inflate */
> +	err = vdev->config->find_vqs(vdev, 1, vq, &callback, &name);
> +	if (err)
> +		goto out_free_vb;
> +
> +	vb->inflate_vq = vq[0];
> +
> +	err = sysfs_create_group(&vdev->dev.kobj, &virtballoon_attr_group);
> +	if (err) {
> +		pr_err("Failed to create virtballoon sysfs node\n");
> +		goto out_free_vb;
> +	}
> +
> +	vb->last_scan_page_array = 0;
> +	vb->last_reclaim = 0;
> +	the_inode.vb = vb;
> +
> +	vb->thread = kthread_run(balloon, vb, "vballoon");
> +	if (IS_ERR(vb->thread)) {
> +		err = PTR_ERR(vb->thread);
> +		goto out_del_vqs;
> +	}
> +
> +	return 0;
> +
> +out_del_vqs:
> +	vdev->config->del_vqs(vdev);
> +out_free_vb:
> +	kfree(vb);
> +out:
> +	return err;
> +}
> +
> +static void __devexit virtballoon_remove(struct virtio_device *vdev)
> +{
> +	struct virtio_balloon *vb = vdev->priv;
> +
> +	kthread_stop(vb->thread);
> +
> +	sysfs_remove_group(&vdev->dev.kobj, &virtballoon_attr_group);
> +
> +	/* Now we reset the device so we can clean up the queues. */
> +	vdev->config->reset(vdev);
> +
> +	vdev->config->del_vqs(vdev);
> +	kfree(vb);
> +}
> +
> +static struct virtio_driver virtio_balloon_driver = {
> +	.feature_table		= NULL,
> +	.feature_table_size	= 0,
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.probe			= virtballoon_probe,
> +	.remove			= __devexit_p(virtballoon_remove),
> +	.config_changed		= virtballoon_changed,
> +};
> +
> +static int __init init(void)
> +{
> +	int err = register_filesystem(&balloon_fs_type);
> +	if (err)
> +		goto out;
> +
> +	balloon_mnt = kern_mount(&balloon_fs_type);
> +	if (IS_ERR(balloon_mnt)) {
> +		err = PTR_ERR(balloon_mnt);
> +		goto out_filesystem;
> +	}
> +
> +	err = register_virtio_driver(&virtio_balloon_driver);
> +	if (err)
> +		goto out_filesystem;
> +
> +	goto out;
> +
> +out_filesystem:
> +	unregister_filesystem(&balloon_fs_type);
> +
> +out:
> +	return err;
> +}
> +
> +static void __exit fini(void)
> +{
> +	if (balloon_mnt) {
> +		unregister_filesystem(&balloon_fs_type);
> +		balloon_mnt = NULL;
> +	}
> +	unregister_virtio_driver(&virtio_balloon_driver);
> +}
> +module_init(init);
> +module_exit(fini);
> +
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio file (page cache-backed) balloon driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/virtio_balloon.h b/include/linux/virtio_balloon.h
> index 652dc8b..2be9a02 100644
> --- a/include/linux/virtio_balloon.h
> +++ b/include/linux/virtio_balloon.h
> @@ -41,6 +41,15 @@ struct virtio_balloon_config
>  	__le32 num_pages;
>  	/* Number of pages we've actually got in balloon. */
>  	__le32 actual;
> +#if defined(CONFIG_VIRTIO_FILEBALLOON) ||\
> +	defined(CONFIG_VIRTIO_FILEBALLOON_MODULE)
> +	/* Total pages on this system. */
> +	__le32 pages_total;
> +	/* Free pages on this system. */
> +	__le32 pages_free;
> +	/* If the device needs pages_total/pages_free updated. */
> +	__le32 need_stats;
> +#endif
>  };
>  
>  #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
> diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
> index 7529b85..2f081d7 100644
> --- a/include/linux/virtio_ids.h
> +++ b/include/linux/virtio_ids.h
> @@ -37,5 +37,6 @@
>  #define VIRTIO_ID_RPMSG		7 /* virtio remote processor messaging */
>  #define VIRTIO_ID_SCSI		8 /* virtio scsi */
>  #define VIRTIO_ID_9P		9 /* 9p virtio console */
> +#define VIRTIO_ID_FILE_BALLOON	10 /* virtio file-backed balloon */
>  
>  #endif /* _LINUX_VIRTIO_IDS_H */
> -- 
> 1.7.7.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10  9:05 ` Michael S. Tsirkin
@ 2012-09-10 17:37   ` Mike Waychison
  2012-09-10 18:04     ` Rik van Riel
  2012-09-10 19:59     ` Michael S. Tsirkin
  0 siblings, 2 replies; 30+ messages in thread
From: Mike Waychison @ 2012-09-10 17:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Frank Swiderski, Rusty Russell, Rik van Riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm

On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> This implementation of a virtio balloon driver uses the page cache to
>> "store" pages that have been released to the host.  The communication
>> (outside of target counts) is one way--the guest notifies the host when
>> it adds a page to the page cache, allowing the host to madvise(2) with
>> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> (via the regular page reclaim).  This means that inflating the balloon
>> is similar to the existing balloon mechanism, but the deflate is
>> different--it re-uses existing Linux kernel functionality to
>> automatically reclaim.
>>
>> Signed-off-by: Frank Swiderski <fes@google.com>

Hi Michael,

I'm very sorry that Frank and I have been silent on these threads.
I've been out of the office and Frank has been been swamped :)

I'll take a stab at answering some of your questions below, and
hopefully we can end up on the same page.

> I've been trying to understand this, and I have
> a question: what exactly is the benefit
> of this new device?

The key difference between this device/driver and the pre-existing
virtio_balloon device/driver is in how the memory pressure loop is
controlled.

With the pre-existing balloon device/driver, the control loop for how
much memory a given VM is allowed to use is controlled completely by
the host.  This is probably fine if the goal is to pack as much work
on a given host as possible, but it says nothing about the expected
performance that any given VM is expecting to have.  Specifically, it
allows the host to set a target goal for the size of a VM, and the
driver in the guest does whatever is needed to get to that goal.  This
is great for systems where one wants to "grow or shrink" a VM from the
outside.


This behaviour however doesn't match what applications actually expect
from a memory control loop however.  In a native setup, an application
can usually expect to allocate memory from the kernel on an as-needed
basis, and can in turn return memory back to the system (using a heap
implementation that actually releases memory that is).  The dynamic
size of an application is completely controlled by the application,
and there is very little that cluster management software can do to
ensure that the application fits some prescribed size.

We recognized this in the development of our cluster management
software long ago, so our systems are designed for managing tasks that
have a dynamic memory footprint.  Overcommit is possible (as most
applications do not use the full reservation of memory they asked for
originally), letting us do things like schedule lower priority/lower
service-classification work using resources that are otherwise
available in stand-by for high-priority/low-latency workloads.

>
> Note that users could not care less about how a driver
> is implemented internally.
>
> Is there some workload where you see VM working better with
> this than regular balloon? Any numbers?

This device is less about performance as it is about getting the
memory size of a job (or in this case, a job in a VM) to grow and
shrink as the application workload sees fit, much like how processes
today can grow and shrink without external direction.

>
> Also, can't we just replace existing balloon implementation
> with this one?

Perhaps, but as described above, both devices have very different
characteristics.

> Why it is so important to deflate silently?

It may not be so important to deflate silently.  I'm not sure why it
is important that we deflate "loudly" though either :)  Doing so seems
like unnecessary guest/host communication IMO, especially if the guest
is expecting to be able to grow to totalram (and the host isn't able
to nack any pages reclaimed anyway...).

> I guess filesystem does not currently get a callback
> before page is reclaimed but this isan implementation detail -
> maybe this can be fixed?

I do not follow this question.

>
> Also can you pls answer Avi's question?
> How is overcommit managed?

Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.

>
>
>> ---
>>  drivers/virtio/Kconfig              |   13 +
>>  drivers/virtio/Makefile             |    1 +
>>  drivers/virtio/virtio_fileballoon.c |  636 +++++++++++++++++++++++++++++++++++
>>  include/linux/virtio_balloon.h      |    9 +
>>  include/linux/virtio_ids.h          |    1 +
>>  5 files changed, 660 insertions(+), 0 deletions(-)
>>  create mode 100644 drivers/virtio/virtio_fileballoon.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index f38b17a..cffa2a7 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>>
>>        If unsure, say M.
>>
>> +config VIRTIO_FILEBALLOON
>> +     tristate "Virtio page cache-backed balloon driver"
>> +     select VIRTIO
>> +     select VIRTIO_RING
>> +     ---help---
>> +      This driver supports decreasing and automatically reclaiming the
>> +      memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
>> +      tries to maintain a specific target balloon size using the page cache.
>> +      This allows the guest to implicitly deflate the balloon by flushing
>> +      pages from the cache and touching the page.
>> +
>> +      If unsure, say N.
>> +
>>   config VIRTIO_MMIO
>>       tristate "Platform bus driver for memory mapped virtio devices (EXPERIMENTAL)"
>>       depends on HAS_IOMEM && EXPERIMENTAL
>> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
>> index 5a4c63c..7ca0a3f 100644
>> --- a/drivers/virtio/Makefile
>> +++ b/drivers/virtio/Makefile
>> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>> +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
>> diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c
>> new file mode 100644
>> index 0000000..ff252ec
>> --- /dev/null
>> +++ b/drivers/virtio/virtio_fileballoon.c
>> @@ -0,0 +1,636 @@
>> +/* Virtio file (page cache-backed) balloon implementation, inspired by
>> + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's
>> + * implementation.
>> + *
>> + * This implementation of the virtio balloon driver re-uses the page cache to
>> + * allow memory consumed by inflating the balloon to be reclaimed by linux.  It
>> + * creates and mounts a bare-bones filesystem containing a single inode.  When
>> + * the host requests the balloon to inflate, it does so by "reading" pages at
>> + * offsets into the inode mapping's page_tree.  The host is notified when the
>> + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
>> + * corresponding host memory, reducing the RSS of the virtual machine.  In this
>> + * implementation, the host is only notified when a page is added to the
>> + * balloon.  Reclaim happens under the existing TTFP logic, which flushes unused
>> + * pages in the page cache.  If the host used MADV_DONTNEED, then when the guest
>> + * uses the page, the zero page will be mapped in, allowing automatic (and fast,
>> + * compared to requiring a host notification via a virtio queue to get memory
>> + * back) reclaim.
>> + *
>> + *  Copyright 2008 Rusty Russell IBM Corporation
>> + *  Copyright 2011 Frank Swiderski Google Inc
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License, or
>> + *  (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, write to the Free Software
>> + *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> + */
>> +#include <linux/backing-dev.h>
>> +#include <linux/delay.h>
>> +#include <linux/file.h>
>> +#include <linux/freezer.h>
>> +#include <linux/fs.h>
>> +#include <linux/jiffies.h>
>> +#include <linux/kthread.h>
>> +#include <linux/module.h>
>> +#include <linux/mount.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/slab.h>
>> +#include <linux/swap.h>
>> +#include <linux/virtio.h>
>> +#include <linux/virtio_balloon.h>
>> +#include <linux/writeback.h>
>> +
>> +#define VIRTBALLOON_PFN_ARRAY_SIZE 256
>> +
>> +struct virtio_balloon {
>> +     struct virtio_device *vdev;
>> +     struct virtqueue *inflate_vq;
>> +
>> +     /* Where the ballooning thread waits for config to change. */
>> +     wait_queue_head_t config_change;
>> +
>> +     /* The thread servicing the balloon. */
>> +     struct task_struct *thread;
>> +
>> +     /* Waiting for host to ack the pages we released. */
>> +     struct completion acked;
>> +
>> +     /* The array of pfns we tell the Host about. */
>> +     unsigned int num_pfns;
>> +     u32 pfns[VIRTBALLOON_PFN_ARRAY_SIZE];
>> +
>> +     struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
>> +
>> +     /* The last page offset read into the mapping's page_tree */
>> +     unsigned long last_scan_page_array;
>> +
>> +     /* The last time a page was reclaimed */
>> +     unsigned long last_reclaim;
>> +};
>> +
>> +/* Magic number used for the skeleton filesystem in the call to mount_pseudo */
>> +#define BALLOONFS_MAGIC 0x42414c4c
>> +
>> +static struct virtio_device_id id_table[] = {
>> +     { VIRTIO_ID_FILE_BALLOON, VIRTIO_DEV_ANY_ID },
>> +     { 0 },
>> +};
>> +
>> +/*
>> + * The skeleton filesystem contains a single inode, held by the structure below.
>> + * Using the containing structure below allows easy access to the struct
>> + * virtio_balloon.
>> + */
>> +static struct balloon_inode {
>> +     struct inode inode;
>> +     struct virtio_balloon *vb;
>> +} the_inode;
>> +
>> +/*
>> + * balloon_alloc_inode is called when the single inode for the skeleton
>> + * filesystem is created in init() with the call to new_inode.
>> + */
>> +static struct inode *balloon_alloc_inode(struct super_block *sb)
>> +{
>> +     static bool already_inited;
>> +     /* We should only ever be called once! */
>> +     BUG_ON(already_inited);
>> +     already_inited = true;
>> +     inode_init_once(&the_inode.inode);
>> +     return &the_inode.inode;
>> +}
>> +
>> +/* Noop implementation of destroy_inode.  */
>> +static void balloon_destroy_inode(struct inode *inode)
>> +{
>> +}
>> +
>> +static int balloon_sync_fs(struct super_block *sb, int wait)
>> +{
>> +     return filemap_write_and_wait(the_inode.inode.i_mapping);
>> +}
>> +
>> +static const struct super_operations balloonfs_ops = {
>> +     .alloc_inode    = balloon_alloc_inode,
>> +     .destroy_inode  = balloon_destroy_inode,
>> +     .sync_fs        = balloon_sync_fs,
>> +};
>> +
>> +static const struct dentry_operations balloonfs_dentry_operations = {
>> +};
>> +
>> +/*
>> + * balloonfs_writepage is called when linux needs to reclaim memory held using
>> + * the balloonfs' page cache.exactlyexactlyexactly
>> + */
>> +static int balloonfs_writepage(struct page *page, struct writeback_control *wbc)
>> +{
>> +     the_inode.vb->last_reclaim = jiffies;
>> +     SetPageUptodate(page);
>> +     ClearPageDirty(page);
>> +     /*
>> +      * If the page isn't being flushed from the page allocator, go ahead and
>> +      * drop it from the page cache anyway.
>> +      */
>> +     if (!wbc->for_reclaim)
>> +             delete_from_page_cache(page);
>> +     unlock_page(page);
>> +     return 0;
>> +}
>> +
>> +/* Nearly no-op implementation of readpage */
>> +static int balloonfs_readpage(struct file *file, struct page *page)
>> +{
>> +     SetPageUptodate(page);
>> +     unlock_page(page);
>> +     return 0;
>> +}
>> +
>> +static const struct address_space_operations balloonfs_aops = {
>> +     .writepage      = balloonfs_writepage,
>> +     .readpage       = balloonfs_readpage
>> +};
>> +
>> +static struct backing_dev_info balloonfs_backing_dev_info = {
>> +     .name           = "balloonfs",
>> +     .ra_pages       = 0,
>> +     .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK
>> +};
>> +
>> +static struct dentry *balloonfs_mount(struct file_system_type *fs_type,
>> +                      int flags, const char *dev_name, void *data)
>> +{
>> +     struct dentry *root;
>> +     struct inode *inode;
>> +     root = mount_pseudo(fs_type, "balloon:", &balloonfs_ops,
>> +                         &balloonfs_dentry_operations, BALLOONFS_MAGIC);
>> +     inode = root->d_inode;
>> +     inode->i_mapping->a_ops = &balloonfs_aops;
>> +     mapping_set_gfp_mask(inode->i_mapping,
>> +                          (GFP_HIGHUSER | __GFP_NOMEMALLOC));
>> +     inode->i_mapping->backing_dev_info = &balloonfs_backing_dev_info;
>> +     return root;
>> +}
>> +
>> +/* The single mounted skeleton filesystem */
>> +static struct vfsmount *balloon_mnt __read_mostly;
>> +
>> +static struct file_system_type balloon_fs_type = {
>> +     .name =         "balloonfs",
>> +     .mount =        balloonfs_mount,
>> +     .kill_sb =      kill_anon_super,
>> +};
>> +
>> +/* Acknowledges a message from the specified virtqueue. */
>> +static void balloon_ack(struct virtqueue *vq)
>> +{
>> +     struct virtio_balloon *vb;
>> +     unsigned int len;
>> +
>> +     vb = virtqueue_get_buf(vq, &len);
>> +     if (vb)
>> +             complete(&vb->acked);
>> +}
>> +
>> +/*
>> + * Scans the page_tree for the inode's mapping, looking for an offset that is
>> + * currently empty, returning that index (or 0 if it could not fill the
>> + * request).
>> + */
>> +static unsigned long find_available_inode_page(struct virtio_balloon *vb)exactlyexactly
>> +{
>> +     unsigned long radix_index, index, max_scan;
>> +     struct address_space *mapping = the_inode.inode.i_mapping;
>> +
>> +     /*
>> +      * This function is a serialized call (only happens on the free-to-host
>> +      * thread), so no locking is necessary here.
>> +      */
>> +     index = vb->last_scan_page_array;
>> +     max_scan = totalram_pages - vb->last_scan_page_array;
>> +
>> +     /*
>> +      * Scan starting at the last scanned offset, then wrap around if
>> +      * necessary.
>> +      */
>> +     if (index == 0)
>> +             index = 1;
>> +     rcu_read_lock();
>> +     radix_index = radix_tree_next_hole(&mapping->page_tree,
>> +                                        index, max_scan);
>> +     rcu_read_unlock();
>> +     /*
>> +      * If we hit the end of the tree, wrap and search up to the original
>> +      * index.
>> +      */
>> +     if (radix_index - index >= max_scan) {
>> +             if (index != 1) {
>> +                     rcu_read_lock();
>> +                     radix_index = radix_tree_next_hole(&mapping->page_tree,
>> +                                                        1, index);
>> +                     rcu_read_unlock();
>> +                     if (radix_index - 1 >= index)
>> +                             radix_index = 0;
>> +             } else {
>> +                     radix_index = 0;
>> +             }
>> +     }
>> +     vb->last_scan_page_array = radix_index;
>> +
>> +     return radix_index;
>> +}
>> +
>> +/* Notifies the host of pages in the specified virtqueue. */
>> +static int tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +{
>> +     int err;
>> +     struct scatterlist sg;
>> +
>> +     sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>> +
>> +     init_completion(&vb->acked);
>> +
>> +     /* We should always be able to add one buffer to an empty queue. */
>> +     err = virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL);
>> +     if (err  < 0)
>> +             return err;
>> +     virtqueue_kick(vq);
>> +
>> +     /* When host has read buffer, this completes via balloon_ack */
>> +     wait_for_completion(&vb->acked);
>> +     return err;
>> +}
>> +
>> +static void fill_balloon(struct virtio_balloon *vb, size_t num)
>> +{
>> +     int err;
>> +
>> +     /* We can only do one array worth at a time. */
>> +     num = min(num, ARRAY_SIZE(vb->pfns));
>> +
>> +     for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) {
>> +             struct page *page;
>> +             unsigned long inode_pfn = find_available_inode_page(vb);
>> +             /* Should always be able to find a page. */
>> +             BUG_ON(!inode_pfn);
>> +             page = read_mapping_page(the_inode.inode.i_mapping, inode_pfn,
>> +                                      NULL);
>> +             if (IS_ERR(page)) {
>> +                     if (printk_ratelimit())
>> +                             dev_printk(KERN_INFO, &vb->vdev->dev,
>> +                                        "Out of puff! Can't get %zu pages\n",
>> +                                        num);
>> +                     break;
>> +             }
>> +
>> +             /* Set the page to be dirty */
>> +             set_page_dirty(page);
>> +
>> +             vb->pfns[vb->num_pfns] = page_to_pfn(page);
>> +     }
>> +
>> +     /* Didn't get any?  Oh well. */
>> +     if (vb->num_pfns == 0)
>> +             return;
>> +
>> +     /* Notify the host of the pages we just added to the page_tree. */
>> +     err = tell_host(vb, vb->inflate_vq);
>> +
>> +     for (; vb->num_pfns != 0; vb->num_pfns--) {
>> +             struct page *page = pfn_to_page(vb->pfns[vb->num_pfns - 1]);
>> +             /*
>> +              * Release our refcount on the page so that it can be reclaimed
>> +              * when necessary.
>> +              */
>> +             page_cache_release(page);
>> +     }
>> +     __mark_inode_dirty(&the_inode.inode, I_DIRTY_PAGES);
>> +}
>> +
>> +static inline void update_stat(struct virtio_balloon *vb, int idx,
>> +                            u64 val)
>> +{
>> +     BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
>> +     vb->stats[idx].tag = idx;
>> +     vb->stats[idx].val = val;
>> +}
>> +
>> +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
>> +
>> +static inline u32 config_pages(struct virtio_balloon *vb);
>> +static void update_balloon_stats(struct virtio_balloon *vb)
>> +{
>> +     unsigned long events[NR_VM_EVENT_ITEMS];
>> +     struct sysinfo i;
>> +
>> +     all_vm_events(events);
>> +     si_meminfo(&i);
>> +
>> +     update_stat(vb, VIRTIO_BALLOON_S_SWAP_IN,
>> +                 pages_to_bytes(events[PSWPIN]));
>> +     update_stat(vb, VIRTIO_BALLOON_S_SWAP_OUT,
>> +                 pages_to_bytes(events[PSWPOUT]));
>> +     update_stat(vb, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
>> +     update_stat(vb, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
>> +
>> +     /* Total and Free Mem */
>> +     update_stat(vb, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram));
>> +     update_stat(vb, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram));
>> +}
>> +
>> +static void virtballoon_changed(struct virtio_device *vdev)
>> +{
>> +     struct virtio_balloon *vb = vdev->priv;
>> +
>> +     wake_up(&vb->config_change);
>> +}
>> +
>> +static inline bool config_need_stats(struct virtio_balloon *vb)
>> +{
>> +     u32 v = 0;
>> +
>> +     vb->vdev->config->get(vb->vdev,
>> +                           offsetof(struct virtio_balloon_config,
>> +                                    need_stats),
>> +                           &v, sizeof(v));
>> +     return (v != 0);
>> +}
>> +
>> +static inline u32 config_pages(struct virtio_balloon *vb)
>> +{
>> +     u32 v = 0;
>> +
>> +     vb->vdev->config->get(vb->vdev,
>> +                           offsetof(struct virtio_balloon_config, num_pages),
>> +                           &v, sizeof(v));
>> +     return v;
>> +}
>> +
>> +static inline s64 towards_target(struct virtio_balloon *vb)
>> +{
>> +     struct address_space *mapping = the_inode.inode.i_mapping;
>> +     u32 v = config_pages(vb);
>> +
>> +     return (s64)v - (mapping ? mapping->nrpages : 0);
>> +}
>> +
>> +static void update_balloon_size(struct virtio_balloon *vb)
>> +{
>> +     struct address_space *mapping = the_inode.inode.i_mapping;
>> +     __le32 actual = cpu_to_le32((mapping ? mapping->nrpages : 0));
>> +
>> +     vb->vdev->config->set(vb->vdev,
>> +                           offsetof(struct virtio_balloon_config, actual),
>> +                           &actual, sizeof(actual));
>> +}
>> +
>> +static void update_free_and_total(struct virtio_balloon *vb)
>> +{
>> +     struct sysinfo i;
>> +     u32 value;
>> +
>> +     si_meminfo(&i);
>> +
>> +     update_balloon_stats(vb);
>> +     value = i.totalram;
>> +     vb->vdev->config->set(vb->vdev,
>> +                           offsetof(struct virtio_balloon_config,
>> +                                    pages_total),
>> +                           &value, sizeof(value));
>> +     value = i.freeram;
>> +     vb->vdev->config->set(vb->vdev,
>> +                           offsetof(struct virtio_balloon_config,
>> +                                    pages_free),
>> +                           &value, sizeof(value));
>> +     value = 0;
>> +     vb->vdev->config->set(vb->vdev,
>> +                           offsetof(struct virtio_balloon_config,
>> +                                    need_stats),
>> +                           &value, sizeof(value));
>> +}
>> +
>> +static int balloon(void *_vballoon)
>> +{
>> +     struct virtio_balloon *vb = _vballoon;
>> +
>> +     set_freezable();
>> +     while (!kthread_should_stop()) {
>> +             s64 diff;
>> +             try_to_freeze();
>> +             wait_event_interruptible(vb->config_change,
>> +                                      (diff = towards_target(vb)) > 0
>> +                                      || config_need_stats(vb)
>> +                                      || kthread_should_stop()
>> +                                      || freezing(current));
>> +             if (config_need_stats(vb))
>> +                     update_free_and_total(vb);
>> +             if (diff > 0) {
>> +                     unsigned long reclaim_time = vb->last_reclaim + 2 * HZ;
>> +                     /*
>> +                      * Don't fill the balloon if a page reclaim happened in
>> +                      * the past 2 seconds.
>> +                      */
>> +                     if (time_after_eq(reclaim_time, jiffies)) {
>> +                             /* Inflating too fast--sleep and skip. */
>> +                             msleep(500);
>> +                     } else {
>> +                             fill_balloon(vb, diff);
>> +                     }
>> +             } else if (diff < 0 && config_pages(vb) == 0) {
>> +                     /*
>> +                      * Here we are specifically looking to detect the case
>> +                      * where there are pages in the page cache, but the
>> +                      * device wants us to go to 0.  This is used in save/
>> +                      * restore since the host device doesn't keep track of
>> +                      * PFNs, and must flush the page cache on restore
>> +                      * (which loses the context of the original device
>> +                      * instance).  However, we still suggest syncing the
>> +                      * diff so that we can get within the target range.
>> +                      */
>> +                     s64 nr_to_write =
>> +                             (!config_pages(vb) ? LONG_MAX : -diff);
>> +                     struct writeback_control wbc = {
>> +                             .sync_mode = WB_SYNC_ALL,
>> +                             .nr_to_write = nr_to_write,
>> +                             .range_start = 0,
>> +                             .range_end = LLONG_MAX,
>> +                     };
>> +                     sync_inode(&the_inode.inode, &wbc);
>> +             }
>> +             update_balloon_size(vb);
>> +     }
>> +     return 0;
>> +}
>> +
>> +static ssize_t virtballoon_attr_show(struct device *dev,
>> +                                  struct device_attribute *attr,
>> +                                  char *buf);
>> +
>> +static DEVICE_ATTR(total_memory, 0644,
>> +     virtballoon_attr_show, NULL);
>> +
>> +static DEVICE_ATTR(free_memory, 0644,
>> +     virtballoon_attr_show, NULL);
>> +
>> +static DEVICE_ATTR(target_pages, 0644,
>> +     virtballoon_attr_show, NULL);
>> +
>> +static DEVICE_ATTR(actual_pages, 0644,
>> +     virtballoon_attr_show, NULL);
>> +
>> +static struct attribute *virtballoon_attrs[] = {
>> +     &dev_attr_total_memory.attr,
>> +     &dev_attr_free_memory.attr,
>> +     &dev_attr_target_pages.attr,
>> +     &dev_attr_actual_pages.attr,
>> +     NULL
>> +};
>> +static struct attribute_group virtballoon_attr_group = {
>> +     .name   = "virtballoon",
>> +     .attrs  = virtballoon_attrs,
>> +};
>> +
>> +static ssize_t virtballoon_attr_show(struct device *dev,
>> +                                  struct device_attribute *attr,
>> +                                  char *buf)
>> +{
>> +     struct address_space *mapping = the_inode.inode.i_mapping;
>> +     struct virtio_device *vdev = container_of(dev, struct virtio_device,
>> +                                               dev);
>> +     struct virtio_balloon *vb = vdev->priv;
>> +     unsigned long long value = 0;
>> +     if (attr == &dev_attr_total_memory)
>> +             value = vb->stats[VIRTIO_BALLOON_S_MEMTOT].val;
>> +     else if (attr == &dev_attr_free_memory)
>> +             value = vb->stats[VIRTIO_BALLOON_S_MEMFREE].val;
>> +     else if (attr == &dev_attr_target_pages)
>> +             value = config_pages(vb);
>> +     else if (attr == &dev_attr_actual_pages)
>> +             value = cpu_to_le32((mapping ? mapping->nrpages : 0));
>> +     return sprintf(buf, "%llu\n", value);
>> +}
>> +
>> +static int virtballoon_probe(struct virtio_device *vdev)
>> +{
>> +     struct virtio_balloon *vb;
>> +     struct virtqueue *vq[1];
>> +     vq_callback_t *callback = balloon_ack;
>> +     const char *name = "inflate";
>> +     int err;
>> +
>> +     vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
>> +     if (!vb) {
>> +             err = -ENOMEM;
>> +             goto out;
>> +     }
>> +
>> +     init_waitqueue_head(&vb->config_change);
>> +     vb->vdev = vdev;
>> +
>> +     /* We use one virtqueue: inflate */
>> +     err = vdev->config->find_vqs(vdev, 1, vq, &callback, &name);
>> +     if (err)
>> +             goto out_free_vb;
>> +
>> +     vb->inflate_vq = vq[0];
>> +
>> +     err = sysfs_create_group(&vdev->dev.kobj, &virtballoon_attr_group);
>> +     if (err) {
>> +             pr_err("Failed to create virtballoon sysfs node\n");
>> +             goto out_free_vb;
>> +     }
>> +
>> +     vb->last_scan_page_array = 0;
>> +     vb->last_reclaim = 0;
>> +     the_inode.vb = vb;
>> +
>> +     vb->thread = kthread_run(balloon, vb, "vballoon");
>> +     if (IS_ERR(vb->thread)) {
>> +             err = PTR_ERR(vb->thread);
>> +             goto out_del_vqs;
>> +     }
>> +
>> +     return 0;
>> +
>> +out_del_vqs:
>> +     vdev->config->del_vqs(vdev);
>> +out_free_vb:
>> +     kfree(vb);
>> +out:
>> +     return err;
>> +}
>> +
>> +static void __devexit virtballoon_remove(struct virtio_device *vdev)
>> +{
>> +     struct virtio_balloon *vb = vdev->priv;
>> +
>> +     kthread_stop(vb->thread);
>> +
>> +     sysfs_remove_group(&vdev->dev.kobj, &virtballoon_attr_group);
>> +
>> +     /* Now we reset the device so we can clean up the queues. */
>> +     vdev->config->reset(vdev);
>> +
>> +     vdev->config->del_vqs(vdev);
>> +     kfree(vb);
>> +}
>> +
>> +static struct virtio_driver virtio_balloon_driver = {
>> +     .feature_table          = NULL,
>> +     .feature_table_size     = 0,
>> +     .driver.name            = KBUILD_MODNAME,
>> +     .driver.owner           = THIS_MODULE,
>> +     .id_table               = id_table,
>> +     .probe                  = virtballoon_probe,
>> +     .remove                 = __devexit_p(virtballoon_remove),
>> +     .config_changed         = virtballoon_changed,
>> +};
>> +
>> +static int __init init(void)
>> +{
>> +     int err = register_filesystem(&balloon_fs_type);
>> +     if (err)
>> +             goto out;
>> +
>> +     balloon_mnt = kern_mount(&balloon_fs_type);
>> +     if (IS_ERR(balloon_mnt)) {
>> +             err = PTR_ERR(balloon_mnt);
>> +             goto out_filesystem;
>> +     }
>> +
>> +     err = register_virtio_driver(&virtio_balloon_driver);
>> +     if (err)
>> +             goto out_filesystem;
>> +
>> +     goto out;
>> +
>> +out_filesystem:
>> +     unregister_filesystem(&balloon_fs_type);
>> +
>> +out:
>> +     return err;
>> +}
>> +
>> +static void __exit fini(void)
>> +{
>> +     if (balloon_mnt) {
>> +             unregister_filesystem(&balloon_fs_type);
>> +             balloon_mnt = NULL;
>> +     }
>> +     unregister_virtio_driver(&virtio_balloon_driver);
>> +}
>> +module_init(init);
>> +module_exit(fini);
>> +
>> +MODULE_DEVICE_TABLE(virtio, id_table);
>> +MODULE_DESCRIPTION("Virtio file (page cache-backed) balloon driver");
>> +MODULE_LICENSE("GPL");
>> diff --git a/include/linux/virtio_balloon.h b/include/linux/virtio_balloon.h
>> index 652dc8b..2be9a02 100644
>> --- a/include/linux/virtio_balloon.h
>> +++ b/include/linux/virtio_balloon.h
>> @@ -41,6 +41,15 @@ struct virtio_balloon_config
>>       __le32 num_pages;
>>       /* Number of pages we've actually got in balloon. */
>>       __le32 actual;
>> +#if defined(CONFIG_VIRTIO_FILEBALLOON) ||\
>> +     defined(CONFIG_VIRTIO_FILEBALLOON_MODULE)
>> +     /* Total pages on this system. */
>> +     __le32 pages_total;
>> +     /* Free pages on this system. */
>> +     __le32 pages_free;
>> +     /* If the device needs pages_total/pages_free updated. */
>> +     __le32 need_stats;
>> +#endif
>>  };
>>
>>  #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
>> diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
>> index 7529b85..2f081d7 100644
>> --- a/include/linux/virtio_ids.h
>> +++ b/include/linux/virtio_ids.h
>> @@ -37,5 +37,6 @@
>>  #define VIRTIO_ID_RPMSG              7 /* virtio remote processor messaging */
>>  #define VIRTIO_ID_SCSI               8 /* virtio scsi */
>>  #define VIRTIO_ID_9P         9 /* 9p virtio console */
>> +#define VIRTIO_ID_FILE_BALLOON       10 /* virtio file-backed balloon */
>>
>>  #endif /* _LINUX_VIRTIO_IDS_H */
>> --
>> 1.7.7.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 17:37   ` Mike Waychison
@ 2012-09-10 18:04     ` Rik van Riel
  2012-09-10 18:29       ` Mike Waychison
  2012-09-10 19:59     ` Michael S. Tsirkin
  1 sibling, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2012-09-10 18:04 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Michael S. Tsirkin, Frank Swiderski, Rusty Russell,
	Andrea Arcangeli, virtualization, linux-kernel, kvm

On 09/10/2012 01:37 PM, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:

>> Also can you pls answer Avi's question?
>> How is overcommit managed?
>
> Overcommit in our deployments is managed using memory cgroups on the
> host.  This allows us to have very directed policies as to how
> competing VMs on a host may overcommit.

How do your memory cgroups lead to guests inflating their balloons?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 18:04     ` Rik van Riel
@ 2012-09-10 18:29       ` Mike Waychison
  0 siblings, 0 replies; 30+ messages in thread
From: Mike Waychison @ 2012-09-10 18:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michael S. Tsirkin, Frank Swiderski, Rusty Russell,
	Andrea Arcangeli, virtualization, linux-kernel, kvm

On Mon, Sep 10, 2012 at 2:04 PM, Rik van Riel <riel@redhat.com> wrote:
> On 09/10/2012 01:37 PM, Mike Waychison wrote:
>>
>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com>
>> wrote:
>
>
>>> Also can you pls answer Avi's question?
>>> How is overcommit managed?
>>
>>
>> Overcommit in our deployments is managed using memory cgroups on the
>> host.  This allows us to have very directed policies as to how
>> competing VMs on a host may overcommit.
>
>
> How do your memory cgroups lead to guests inflating their balloons?

The control loop that is driving the cgroup on the host can still move
the balloon target page count causing the balloon in the guest to try
and inflate.  This allows the host to effectively slowly grow the
balloon in the guest, allowing reclaim of guest free memory, followed
by guest page cache (and memory on the host system).  This can then be
compared with the subsequent growth (as this balloon setup allows the
guest to grow as it sees fit), which in effect gives us a memory
pressure indicator on the host, allowing it to back-off shrinking the
guest if the guest balloon quickly deflates.

The net effect is an opportunistic release of memory from the guest
back to the host, and the ability to quickly grow a VM's memory
footprint as the workload within it requires.

This dynamic memory sizing of VMs is much more in line with what we
can expect from native tasks today (and which is what our resource
management systems are designed to handle).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 17:37   ` Mike Waychison
  2012-09-10 18:04     ` Rik van Riel
@ 2012-09-10 19:59     ` Michael S. Tsirkin
  2012-09-10 20:49       ` Mike Waychison
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-09-10 19:59 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Frank Swiderski, Rusty Russell, Rik van Riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm

On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> This implementation of a virtio balloon driver uses the page cache to
> >> "store" pages that have been released to the host.  The communication
> >> (outside of target counts) is one way--the guest notifies the host when
> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> (via the regular page reclaim).  This means that inflating the balloon
> >> is similar to the existing balloon mechanism, but the deflate is
> >> different--it re-uses existing Linux kernel functionality to
> >> automatically reclaim.
> >>
> >> Signed-off-by: Frank Swiderski <fes@google.com>
> 
> Hi Michael,
> 
> I'm very sorry that Frank and I have been silent on these threads.
> I've been out of the office and Frank has been been swamped :)
> 
> I'll take a stab at answering some of your questions below, and
> hopefully we can end up on the same page.
> 
> > I've been trying to understand this, and I have
> > a question: what exactly is the benefit
> > of this new device?
> 
> The key difference between this device/driver and the pre-existing
> virtio_balloon device/driver is in how the memory pressure loop is
> controlled.
> 
> With the pre-existing balloon device/driver, the control loop for how
> much memory a given VM is allowed to use is controlled completely by
> the host.  This is probably fine if the goal is to pack as much work
> on a given host as possible, but it says nothing about the expected
> performance that any given VM is expecting to have.  Specifically, it
> allows the host to set a target goal for the size of a VM, and the
> driver in the guest does whatever is needed to get to that goal.  This
> is great for systems where one wants to "grow or shrink" a VM from the
> outside.
> 
> 
> This behaviour however doesn't match what applications actually expect
> from a memory control loop however.  In a native setup, an application
> can usually expect to allocate memory from the kernel on an as-needed
> basis, and can in turn return memory back to the system (using a heap
> implementation that actually releases memory that is).  The dynamic
> size of an application is completely controlled by the application,
> and there is very little that cluster management software can do to
> ensure that the application fits some prescribed size.
> 
> We recognized this in the development of our cluster management
> software long ago, so our systems are designed for managing tasks that
> have a dynamic memory footprint.  Overcommit is possible (as most
> applications do not use the full reservation of memory they asked for
> originally), letting us do things like schedule lower priority/lower
> service-classification work using resources that are otherwise
> available in stand-by for high-priority/low-latency workloads.

OK I am not sure I got this right so pls tell me if this summary is
correct (note: this does not talk about what guest does with memory, 
ust what it is that device does):

- existing balloon is told lower limit on target size by host and pulls in at least
  target size. Guest can inflate > target size if it likes
  and then it is OK to deflate back to target size but not less.
- your balloon is told upper limit on target size by host and pulls at most
  target size. Guest can deflate down to 0 at any point.

If so I think both approaches make sense and in fact they
can be useful at the same time for the same guest.
In that case, I see two ways how this can be done:

1. two devices: existing ballon + cache balloon
2. add "upper limit" to existing ballon

A single device looks a bit more natural in that we don't
really care in which balloon a page is as long as we
are between lower and upper limit. Right?
>From implementation POV we could have it use
pagecache for pages above lower limit but that
is a separate question about driver design,
I would like to make sure I understand the high
level design first.





> >
> > Note that users could not care less about how a driver
> > is implemented internally.
> >
> > Is there some workload where you see VM working better with
> > this than regular balloon? Any numbers?
> 
> This device is less about performance as it is about getting the
> memory size of a job (or in this case, a job in a VM) to grow and
> shrink as the application workload sees fit, much like how processes
> today can grow and shrink without external direction.

Still, e.g. swap in host achieves more or less the same functionality.
I am guessing balloon can work better by getting more cooperation
from guest but aren't there any tests showing this is true in practice?


> >
> > Also, can't we just replace existing balloon implementation
> > with this one?
> 
> Perhaps, but as described above, both devices have very different
> characteristics.
> 
> > Why it is so important to deflate silently?
> 
> It may not be so important to deflate silently.  I'm not sure why it
> is important that we deflate "loudly" though either :)  Doing so seems
> like unnecessary guest/host communication IMO, especially if the guest
> is expecting to be able to grow to totalram (and the host isn't able
> to nack any pages reclaimed anyway...).

First, we could add nack easily enough :)
Second, access gets an exit anyway. If you tell
host first you can maybe batch these and actually speed things up.
It remains to be measured but historically we told host
so the onus of proof would be on whoever wants to remove this.

Third, see discussion on ML - we came up with
the idea of locking/unlocking balloon memory
which is useful for an assigned device.
Requires telling host first.

Also knowing how much memory there is in a balloon
would be useful for admin.

There could be other uses.

> > I guess filesystem does not currently get a callback
> > before page is reclaimed but this isan implementation detail -
> > maybe this can be fixed?
> 
> I do not follow this question.

Assume we want to tell host before use.
Can you implement this on top of your patch?

> >
> > Also can you pls answer Avi's question?
> > How is overcommit managed?
> 
> Overcommit in our deployments is managed using memory cgroups on the
> host.  This allows us to have very directed policies as to how
> competing VMs on a host may overcommit.

So you push VM out to swap if it's over allowed memory?
Existing balloon does this better as it is cooperative,
it seems.


> >
> >
> >> ---
> >>  drivers/virtio/Kconfig              |   13 +
> >>  drivers/virtio/Makefile             |    1 +
> >>  drivers/virtio/virtio_fileballoon.c |  636 +++++++++++++++++++++++++++++++++++
> >>  include/linux/virtio_balloon.h      |    9 +
> >>  include/linux/virtio_ids.h          |    1 +
> >>  5 files changed, 660 insertions(+), 0 deletions(-)
> >>  create mode 100644 drivers/virtio/virtio_fileballoon.c
> >>
> >> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> >> index f38b17a..cffa2a7 100644
> >> --- a/drivers/virtio/Kconfig
> >> +++ b/drivers/virtio/Kconfig
> >> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
> >>
> >>        If unsure, say M.
> >>
> >> +config VIRTIO_FILEBALLOON
> >> +     tristate "Virtio page cache-backed balloon driver"
> >> +     select VIRTIO
> >> +     select VIRTIO_RING
> >> +     ---help---
> >> +      This driver supports decreasing and automatically reclaiming the
> >> +      memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
> >> +      tries to maintain a specific target balloon size using the page cache.
> >> +      This allows the guest to implicitly deflate the balloon by flushing
> >> +      pages from the cache and touching the page.
> >> +
> >> +      If unsure, say N.
> >> +
> >>   config VIRTIO_MMIO
> >>       tristate "Platform bus driver for memory mapped virtio devices (EXPERIMENTAL)"
> >>       depends on HAS_IOMEM && EXPERIMENTAL
> >> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> >> index 5a4c63c..7ca0a3f 100644
> >> --- a/drivers/virtio/Makefile
> >> +++ b/drivers/virtio/Makefile
> >> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
> >>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
> >>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
> >>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >> +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
> >> diff --git a/drivers/virtio/virtio_fileballoon.c b/drivers/virtio/virtio_fileballoon.c
> >> new file mode 100644
> >> index 0000000..ff252ec
> >> --- /dev/null
> >> +++ b/drivers/virtio/virtio_fileballoon.c
> >> @@ -0,0 +1,636 @@
> >> +/* Virtio file (page cache-backed) balloon implementation, inspired by
> >> + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty Russel's
> >> + * implementation.
> >> + *
> >> + * This implementation of the virtio balloon driver re-uses the page cache to
> >> + * allow memory consumed by inflating the balloon to be reclaimed by linux.  It
> >> + * creates and mounts a bare-bones filesystem containing a single inode.  When
> >> + * the host requests the balloon to inflate, it does so by "reading" pages at
> >> + * offsets into the inode mapping's page_tree.  The host is notified when the
> >> + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
> >> + * corresponding host memory, reducing the RSS of the virtual machine.  In this
> >> + * implementation, the host is only notified when a page is added to the
> >> + * balloon.  Reclaim happens under the existing TTFP logic, which flushes unused
> >> + * pages in the page cache.  If the host used MADV_DONTNEED, then when the guest
> >> + * uses the page, the zero page will be mapped in, allowing automatic (and fast,
> >> + * compared to requiring a host notification via a virtio queue to get memory
> >> + * back) reclaim.
> >> + *
> >> + *  Copyright 2008 Rusty Russell IBM Corporation
> >> + *  Copyright 2011 Frank Swiderski Google Inc
> >> + *
> >> + *  This program is free software; you can redistribute it and/or modify
> >> + *  it under the terms of the GNU General Public License as published by
> >> + *  the Free Software Foundation; either version 2 of the License, or
> >> + *  (at your option) any later version.
> >> + *
> >> + *  This program is distributed in the hope that it will be useful,
> >> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >> + *  GNU General Public License for more details.
> >> + *
> >> + *  You should have received a copy of the GNU General Public License
> >> + *  along with this program; if not, write to the Free Software
> >> + *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> >> + */
> >> +#include <linux/backing-dev.h>
> >> +#include <linux/delay.h>
> >> +#include <linux/file.h>
> >> +#include <linux/freezer.h>
> >> +#include <linux/fs.h>
> >> +#include <linux/jiffies.h>
> >> +#include <linux/kthread.h>
> >> +#include <linux/module.h>
> >> +#include <linux/mount.h>
> >> +#include <linux/pagemap.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/swap.h>
> >> +#include <linux/virtio.h>
> >> +#include <linux/virtio_balloon.h>
> >> +#include <linux/writeback.h>
> >> +
> >> +#define VIRTBALLOON_PFN_ARRAY_SIZE 256
> >> +
> >> +struct virtio_balloon {
> >> +     struct virtio_device *vdev;
> >> +     struct virtqueue *inflate_vq;
> >> +
> >> +     /* Where the ballooning thread waits for config to change. */
> >> +     wait_queue_head_t config_change;
> >> +
> >> +     /* The thread servicing the balloon. */
> >> +     struct task_struct *thread;
> >> +
> >> +     /* Waiting for host to ack the pages we released. */
> >> +     struct completion acked;
> >> +
> >> +     /* The array of pfns we tell the Host about. */
> >> +     unsigned int num_pfns;
> >> +     u32 pfns[VIRTBALLOON_PFN_ARRAY_SIZE];
> >> +
> >> +     struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> >> +
> >> +     /* The last page offset read into the mapping's page_tree */
> >> +     unsigned long last_scan_page_array;
> >> +
> >> +     /* The last time a page was reclaimed */
> >> +     unsigned long last_reclaim;
> >> +};
> >> +
> >> +/* Magic number used for the skeleton filesystem in the call to mount_pseudo */
> >> +#define BALLOONFS_MAGIC 0x42414c4c
> >> +
> >> +static struct virtio_device_id id_table[] = {
> >> +     { VIRTIO_ID_FILE_BALLOON, VIRTIO_DEV_ANY_ID },
> >> +     { 0 },
> >> +};
> >> +
> >> +/*
> >> + * The skeleton filesystem contains a single inode, held by the structure below.
> >> + * Using the containing structure below allows easy access to the struct
> >> + * virtio_balloon.
> >> + */
> >> +static struct balloon_inode {
> >> +     struct inode inode;
> >> +     struct virtio_balloon *vb;
> >> +} the_inode;
> >> +
> >> +/*
> >> + * balloon_alloc_inode is called when the single inode for the skeleton
> >> + * filesystem is created in init() with the call to new_inode.
> >> + */
> >> +static struct inode *balloon_alloc_inode(struct super_block *sb)
> >> +{
> >> +     static bool already_inited;
> >> +     /* We should only ever be called once! */
> >> +     BUG_ON(already_inited);
> >> +     already_inited = true;
> >> +     inode_init_once(&the_inode.inode);
> >> +     return &the_inode.inode;
> >> +}
> >> +
> >> +/* Noop implementation of destroy_inode.  */
> >> +static void balloon_destroy_inode(struct inode *inode)
> >> +{
> >> +}
> >> +
> >> +static int balloon_sync_fs(struct super_block *sb, int wait)
> >> +{
> >> +     return filemap_write_and_wait(the_inode.inode.i_mapping);
> >> +}
> >> +
> >> +static const struct super_operations balloonfs_ops = {
> >> +     .alloc_inode    = balloon_alloc_inode,
> >> +     .destroy_inode  = balloon_destroy_inode,
> >> +     .sync_fs        = balloon_sync_fs,
> >> +};
> >> +
> >> +static const struct dentry_operations balloonfs_dentry_operations = {
> >> +};
> >> +
> >> +/*
> >> + * balloonfs_writepage is called when linux needs to reclaim memory held using
> >> + * the balloonfs' page cache.exactlyexactlyexactly
> >> + */
> >> +static int balloonfs_writepage(struct page *page, struct writeback_control *wbc)
> >> +{
> >> +     the_inode.vb->last_reclaim = jiffies;
> >> +     SetPageUptodate(page);
> >> +     ClearPageDirty(page);
> >> +     /*
> >> +      * If the page isn't being flushed from the page allocator, go ahead and
> >> +      * drop it from the page cache anyway.
> >> +      */
> >> +     if (!wbc->for_reclaim)
> >> +             delete_from_page_cache(page);
> >> +     unlock_page(page);
> >> +     return 0;
> >> +}
> >> +
> >> +/* Nearly no-op implementation of readpage */
> >> +static int balloonfs_readpage(struct file *file, struct page *page)
> >> +{
> >> +     SetPageUptodate(page);
> >> +     unlock_page(page);
> >> +     return 0;
> >> +}
> >> +
> >> +static const struct address_space_operations balloonfs_aops = {
> >> +     .writepage      = balloonfs_writepage,
> >> +     .readpage       = balloonfs_readpage
> >> +};
> >> +
> >> +static struct backing_dev_info balloonfs_backing_dev_info = {
> >> +     .name           = "balloonfs",
> >> +     .ra_pages       = 0,
> >> +     .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK
> >> +};
> >> +
> >> +static struct dentry *balloonfs_mount(struct file_system_type *fs_type,
> >> +                      int flags, const char *dev_name, void *data)
> >> +{
> >> +     struct dentry *root;
> >> +     struct inode *inode;
> >> +     root = mount_pseudo(fs_type, "balloon:", &balloonfs_ops,
> >> +                         &balloonfs_dentry_operations, BALLOONFS_MAGIC);
> >> +     inode = root->d_inode;
> >> +     inode->i_mapping->a_ops = &balloonfs_aops;
> >> +     mapping_set_gfp_mask(inode->i_mapping,
> >> +                          (GFP_HIGHUSER | __GFP_NOMEMALLOC));
> >> +     inode->i_mapping->backing_dev_info = &balloonfs_backing_dev_info;
> >> +     return root;
> >> +}
> >> +
> >> +/* The single mounted skeleton filesystem */
> >> +static struct vfsmount *balloon_mnt __read_mostly;
> >> +
> >> +static struct file_system_type balloon_fs_type = {
> >> +     .name =         "balloonfs",
> >> +     .mount =        balloonfs_mount,
> >> +     .kill_sb =      kill_anon_super,
> >> +};
> >> +
> >> +/* Acknowledges a message from the specified virtqueue. */
> >> +static void balloon_ack(struct virtqueue *vq)
> >> +{
> >> +     struct virtio_balloon *vb;
> >> +     unsigned int len;
> >> +
> >> +     vb = virtqueue_get_buf(vq, &len);
> >> +     if (vb)
> >> +             complete(&vb->acked);
> >> +}
> >> +
> >> +/*
> >> + * Scans the page_tree for the inode's mapping, looking for an offset that is
> >> + * currently empty, returning that index (or 0 if it could not fill the
> >> + * request).
> >> + */
> >> +static unsigned long find_available_inode_page(struct virtio_balloon *vb)exactlyexactly
> >> +{
> >> +     unsigned long radix_index, index, max_scan;
> >> +     struct address_space *mapping = the_inode.inode.i_mapping;
> >> +
> >> +     /*
> >> +      * This function is a serialized call (only happens on the free-to-host
> >> +      * thread), so no locking is necessary here.
> >> +      */
> >> +     index = vb->last_scan_page_array;
> >> +     max_scan = totalram_pages - vb->last_scan_page_array;
> >> +
> >> +     /*
> >> +      * Scan starting at the last scanned offset, then wrap around if
> >> +      * necessary.
> >> +      */
> >> +     if (index == 0)
> >> +             index = 1;
> >> +     rcu_read_lock();
> >> +     radix_index = radix_tree_next_hole(&mapping->page_tree,
> >> +                                        index, max_scan);
> >> +     rcu_read_unlock();
> >> +     /*
> >> +      * If we hit the end of the tree, wrap and search up to the original
> >> +      * index.
> >> +      */
> >> +     if (radix_index - index >= max_scan) {
> >> +             if (index != 1) {
> >> +                     rcu_read_lock();
> >> +                     radix_index = radix_tree_next_hole(&mapping->page_tree,
> >> +                                                        1, index);
> >> +                     rcu_read_unlock();
> >> +                     if (radix_index - 1 >= index)
> >> +                             radix_index = 0;
> >> +             } else {
> >> +                     radix_index = 0;
> >> +             }
> >> +     }
> >> +     vb->last_scan_page_array = radix_index;
> >> +
> >> +     return radix_index;
> >> +}
> >> +
> >> +/* Notifies the host of pages in the specified virtqueue. */
> >> +static int tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> >> +{
> >> +     int err;
> >> +     struct scatterlist sg;
> >> +
> >> +     sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> >> +
> >> +     init_completion(&vb->acked);
> >> +
> >> +     /* We should always be able to add one buffer to an empty queue. */
> >> +     err = virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL);
> >> +     if (err  < 0)
> >> +             return err;
> >> +     virtqueue_kick(vq);
> >> +
> >> +     /* When host has read buffer, this completes via balloon_ack */
> >> +     wait_for_completion(&vb->acked);
> >> +     return err;
> >> +}
> >> +
> >> +static void fill_balloon(struct virtio_balloon *vb, size_t num)
> >> +{
> >> +     int err;
> >> +
> >> +     /* We can only do one array worth at a time. */
> >> +     num = min(num, ARRAY_SIZE(vb->pfns));
> >> +
> >> +     for (vb->num_pfns = 0; vb->num_pfns < num; vb->num_pfns++) {
> >> +             struct page *page;
> >> +             unsigned long inode_pfn = find_available_inode_page(vb);
> >> +             /* Should always be able to find a page. */
> >> +             BUG_ON(!inode_pfn);
> >> +             page = read_mapping_page(the_inode.inode.i_mapping, inode_pfn,
> >> +                                      NULL);
> >> +             if (IS_ERR(page)) {
> >> +                     if (printk_ratelimit())
> >> +                             dev_printk(KERN_INFO, &vb->vdev->dev,
> >> +                                        "Out of puff! Can't get %zu pages\n",
> >> +                                        num);
> >> +                     break;
> >> +             }
> >> +
> >> +             /* Set the page to be dirty */
> >> +             set_page_dirty(page);
> >> +
> >> +             vb->pfns[vb->num_pfns] = page_to_pfn(page);
> >> +     }
> >> +
> >> +     /* Didn't get any?  Oh well. */
> >> +     if (vb->num_pfns == 0)
> >> +             return;
> >> +
> >> +     /* Notify the host of the pages we just added to the page_tree. */
> >> +     err = tell_host(vb, vb->inflate_vq);
> >> +
> >> +     for (; vb->num_pfns != 0; vb->num_pfns--) {
> >> +             struct page *page = pfn_to_page(vb->pfns[vb->num_pfns - 1]);
> >> +             /*
> >> +              * Release our refcount on the page so that it can be reclaimed
> >> +              * when necessary.
> >> +              */
> >> +             page_cache_release(page);
> >> +     }
> >> +     __mark_inode_dirty(&the_inode.inode, I_DIRTY_PAGES);
> >> +}
> >> +
> >> +static inline void update_stat(struct virtio_balloon *vb, int idx,
> >> +                            u64 val)
> >> +{
> >> +     BUG_ON(idx >= VIRTIO_BALLOON_S_NR);
> >> +     vb->stats[idx].tag = idx;
> >> +     vb->stats[idx].val = val;
> >> +}
> >> +
> >> +#define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
> >> +
> >> +static inline u32 config_pages(struct virtio_balloon *vb);
> >> +static void update_balloon_stats(struct virtio_balloon *vb)
> >> +{
> >> +     unsigned long events[NR_VM_EVENT_ITEMS];
> >> +     struct sysinfo i;
> >> +
> >> +     all_vm_events(events);
> >> +     si_meminfo(&i);
> >> +
> >> +     update_stat(vb, VIRTIO_BALLOON_S_SWAP_IN,
> >> +                 pages_to_bytes(events[PSWPIN]));
> >> +     update_stat(vb, VIRTIO_BALLOON_S_SWAP_OUT,
> >> +                 pages_to_bytes(events[PSWPOUT]));
> >> +     update_stat(vb, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
> >> +     update_stat(vb, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
> >> +
> >> +     /* Total and Free Mem */
> >> +     update_stat(vb, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram));
> >> +     update_stat(vb, VIRTIO_BALLOON_S_MEMTOT, pages_to_bytes(i.totalram));
> >> +}
> >> +
> >> +static void virtballoon_changed(struct virtio_device *vdev)
> >> +{
> >> +     struct virtio_balloon *vb = vdev->priv;
> >> +
> >> +     wake_up(&vb->config_change);
> >> +}
> >> +
> >> +static inline bool config_need_stats(struct virtio_balloon *vb)
> >> +{
> >> +     u32 v = 0;
> >> +
> >> +     vb->vdev->config->get(vb->vdev,
> >> +                           offsetof(struct virtio_balloon_config,
> >> +                                    need_stats),
> >> +                           &v, sizeof(v));
> >> +     return (v != 0);
> >> +}
> >> +
> >> +static inline u32 config_pages(struct virtio_balloon *vb)
> >> +{
> >> +     u32 v = 0;
> >> +
> >> +     vb->vdev->config->get(vb->vdev,
> >> +                           offsetof(struct virtio_balloon_config, num_pages),
> >> +                           &v, sizeof(v));
> >> +     return v;
> >> +}
> >> +
> >> +static inline s64 towards_target(struct virtio_balloon *vb)
> >> +{
> >> +     struct address_space *mapping = the_inode.inode.i_mapping;
> >> +     u32 v = config_pages(vb);
> >> +
> >> +     return (s64)v - (mapping ? mapping->nrpages : 0);
> >> +}
> >> +
> >> +static void update_balloon_size(struct virtio_balloon *vb)
> >> +{
> >> +     struct address_space *mapping = the_inode.inode.i_mapping;
> >> +     __le32 actual = cpu_to_le32((mapping ? mapping->nrpages : 0));
> >> +
> >> +     vb->vdev->config->set(vb->vdev,
> >> +                           offsetof(struct virtio_balloon_config, actual),
> >> +                           &actual, sizeof(actual));
> >> +}
> >> +
> >> +static void update_free_and_total(struct virtio_balloon *vb)
> >> +{
> >> +     struct sysinfo i;
> >> +     u32 value;
> >> +
> >> +     si_meminfo(&i);
> >> +
> >> +     update_balloon_stats(vb);
> >> +     value = i.totalram;
> >> +     vb->vdev->config->set(vb->vdev,
> >> +                           offsetof(struct virtio_balloon_config,
> >> +                                    pages_total),
> >> +                           &value, sizeof(value));
> >> +     value = i.freeram;
> >> +     vb->vdev->config->set(vb->vdev,
> >> +                           offsetof(struct virtio_balloon_config,
> >> +                                    pages_free),
> >> +                           &value, sizeof(value));
> >> +     value = 0;
> >> +     vb->vdev->config->set(vb->vdev,
> >> +                           offsetof(struct virtio_balloon_config,
> >> +                                    need_stats),
> >> +                           &value, sizeof(value));
> >> +}
> >> +
> >> +static int balloon(void *_vballoon)
> >> +{
> >> +     struct virtio_balloon *vb = _vballoon;
> >> +
> >> +     set_freezable();
> >> +     while (!kthread_should_stop()) {
> >> +             s64 diff;
> >> +             try_to_freeze();
> >> +             wait_event_interruptible(vb->config_change,
> >> +                                      (diff = towards_target(vb)) > 0
> >> +                                      || config_need_stats(vb)
> >> +                                      || kthread_should_stop()
> >> +                                      || freezing(current));
> >> +             if (config_need_stats(vb))
> >> +                     update_free_and_total(vb);
> >> +             if (diff > 0) {
> >> +                     unsigned long reclaim_time = vb->last_reclaim + 2 * HZ;
> >> +                     /*
> >> +                      * Don't fill the balloon if a page reclaim happened in
> >> +                      * the past 2 seconds.
> >> +                      */
> >> +                     if (time_after_eq(reclaim_time, jiffies)) {
> >> +                             /* Inflating too fast--sleep and skip. */
> >> +                             msleep(500);
> >> +                     } else {
> >> +                             fill_balloon(vb, diff);
> >> +                     }
> >> +             } else if (diff < 0 && config_pages(vb) == 0) {
> >> +                     /*
> >> +                      * Here we are specifically looking to detect the case
> >> +                      * where there are pages in the page cache, but the
> >> +                      * device wants us to go to 0.  This is used in save/
> >> +                      * restore since the host device doesn't keep track of
> >> +                      * PFNs, and must flush the page cache on restore
> >> +                      * (which loses the context of the original device
> >> +                      * instance).  However, we still suggest syncing the
> >> +                      * diff so that we can get within the target range.
> >> +                      */
> >> +                     s64 nr_to_write =
> >> +                             (!config_pages(vb) ? LONG_MAX : -diff);
> >> +                     struct writeback_control wbc = {
> >> +                             .sync_mode = WB_SYNC_ALL,
> >> +                             .nr_to_write = nr_to_write,
> >> +                             .range_start = 0,
> >> +                             .range_end = LLONG_MAX,
> >> +                     };
> >> +                     sync_inode(&the_inode.inode, &wbc);
> >> +             }
> >> +             update_balloon_size(vb);
> >> +     }
> >> +     return 0;
> >> +}
> >> +
> >> +static ssize_t virtballoon_attr_show(struct device *dev,
> >> +                                  struct device_attribute *attr,
> >> +                                  char *buf);
> >> +
> >> +static DEVICE_ATTR(total_memory, 0644,
> >> +     virtballoon_attr_show, NULL);
> >> +
> >> +static DEVICE_ATTR(free_memory, 0644,
> >> +     virtballoon_attr_show, NULL);
> >> +
> >> +static DEVICE_ATTR(target_pages, 0644,
> >> +     virtballoon_attr_show, NULL);
> >> +
> >> +static DEVICE_ATTR(actual_pages, 0644,
> >> +     virtballoon_attr_show, NULL);
> >> +
> >> +static struct attribute *virtballoon_attrs[] = {
> >> +     &dev_attr_total_memory.attr,
> >> +     &dev_attr_free_memory.attr,
> >> +     &dev_attr_target_pages.attr,
> >> +     &dev_attr_actual_pages.attr,
> >> +     NULL
> >> +};
> >> +static struct attribute_group virtballoon_attr_group = {
> >> +     .name   = "virtballoon",
> >> +     .attrs  = virtballoon_attrs,
> >> +};
> >> +
> >> +static ssize_t virtballoon_attr_show(struct device *dev,
> >> +                                  struct device_attribute *attr,
> >> +                                  char *buf)
> >> +{
> >> +     struct address_space *mapping = the_inode.inode.i_mapping;
> >> +     struct virtio_device *vdev = container_of(dev, struct virtio_device,
> >> +                                               dev);
> >> +     struct virtio_balloon *vb = vdev->priv;
> >> +     unsigned long long value = 0;
> >> +     if (attr == &dev_attr_total_memory)
> >> +             value = vb->stats[VIRTIO_BALLOON_S_MEMTOT].val;
> >> +     else if (attr == &dev_attr_free_memory)
> >> +             value = vb->stats[VIRTIO_BALLOON_S_MEMFREE].val;
> >> +     else if (attr == &dev_attr_target_pages)
> >> +             value = config_pages(vb);
> >> +     else if (attr == &dev_attr_actual_pages)
> >> +             value = cpu_to_le32((mapping ? mapping->nrpages : 0));
> >> +     return sprintf(buf, "%llu\n", value);
> >> +}
> >> +
> >> +static int virtballoon_probe(struct virtio_device *vdev)
> >> +{
> >> +     struct virtio_balloon *vb;
> >> +     struct virtqueue *vq[1];
> >> +     vq_callback_t *callback = balloon_ack;
> >> +     const char *name = "inflate";
> >> +     int err;
> >> +
> >> +     vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL);
> >> +     if (!vb) {
> >> +             err = -ENOMEM;
> >> +             goto out;
> >> +     }
> >> +
> >> +     init_waitqueue_head(&vb->config_change);
> >> +     vb->vdev = vdev;
> >> +
> >> +     /* We use one virtqueue: inflate */
> >> +     err = vdev->config->find_vqs(vdev, 1, vq, &callback, &name);
> >> +     if (err)
> >> +             goto out_free_vb;
> >> +
> >> +     vb->inflate_vq = vq[0];
> >> +
> >> +     err = sysfs_create_group(&vdev->dev.kobj, &virtballoon_attr_group);
> >> +     if (err) {
> >> +             pr_err("Failed to create virtballoon sysfs node\n");
> >> +             goto out_free_vb;
> >> +     }
> >> +
> >> +     vb->last_scan_page_array = 0;
> >> +     vb->last_reclaim = 0;
> >> +     the_inode.vb = vb;
> >> +
> >> +     vb->thread = kthread_run(balloon, vb, "vballoon");
> >> +     if (IS_ERR(vb->thread)) {
> >> +             err = PTR_ERR(vb->thread);
> >> +             goto out_del_vqs;
> >> +     }
> >> +
> >> +     return 0;
> >> +
> >> +out_del_vqs:
> >> +     vdev->config->del_vqs(vdev);
> >> +out_free_vb:
> >> +     kfree(vb);
> >> +out:
> >> +     return err;
> >> +}
> >> +
> >> +static void __devexit virtballoon_remove(struct virtio_device *vdev)
> >> +{
> >> +     struct virtio_balloon *vb = vdev->priv;
> >> +
> >> +     kthread_stop(vb->thread);
> >> +
> >> +     sysfs_remove_group(&vdev->dev.kobj, &virtballoon_attr_group);
> >> +
> >> +     /* Now we reset the device so we can clean up the queues. */
> >> +     vdev->config->reset(vdev);
> >> +
> >> +     vdev->config->del_vqs(vdev);
> >> +     kfree(vb);
> >> +}
> >> +
> >> +static struct virtio_driver virtio_balloon_driver = {
> >> +     .feature_table          = NULL,
> >> +     .feature_table_size     = 0,
> >> +     .driver.name            = KBUILD_MODNAME,
> >> +     .driver.owner           = THIS_MODULE,
> >> +     .id_table               = id_table,
> >> +     .probe                  = virtballoon_probe,
> >> +     .remove                 = __devexit_p(virtballoon_remove),
> >> +     .config_changed         = virtballoon_changed,
> >> +};
> >> +
> >> +static int __init init(void)
> >> +{
> >> +     int err = register_filesystem(&balloon_fs_type);
> >> +     if (err)
> >> +             goto out;
> >> +
> >> +     balloon_mnt = kern_mount(&balloon_fs_type);
> >> +     if (IS_ERR(balloon_mnt)) {
> >> +             err = PTR_ERR(balloon_mnt);
> >> +             goto out_filesystem;
> >> +     }
> >> +
> >> +     err = register_virtio_driver(&virtio_balloon_driver);
> >> +     if (err)
> >> +             goto out_filesystem;
> >> +
> >> +     goto out;
> >> +
> >> +out_filesystem:
> >> +     unregister_filesystem(&balloon_fs_type);
> >> +
> >> +out:
> >> +     return err;
> >> +}
> >> +
> >> +static void __exit fini(void)
> >> +{
> >> +     if (balloon_mnt) {
> >> +             unregister_filesystem(&balloon_fs_type);
> >> +             balloon_mnt = NULL;
> >> +     }
> >> +     unregister_virtio_driver(&virtio_balloon_driver);
> >> +}
> >> +module_init(init);
> >> +module_exit(fini);
> >> +
> >> +MODULE_DEVICE_TABLE(virtio, id_table);
> >> +MODULE_DESCRIPTION("Virtio file (page cache-backed) balloon driver");
> >> +MODULE_LICENSE("GPL");
> >> diff --git a/include/linux/virtio_balloon.h b/include/linux/virtio_balloon.h
> >> index 652dc8b..2be9a02 100644
> >> --- a/include/linux/virtio_balloon.h
> >> +++ b/include/linux/virtio_balloon.h
> >> @@ -41,6 +41,15 @@ struct virtio_balloon_config
> >>       __le32 num_pages;
> >>       /* Number of pages we've actually got in balloon. */
> >>       __le32 actual;
> >> +#if defined(CONFIG_VIRTIO_FILEBALLOON) ||\
> >> +     defined(CONFIG_VIRTIO_FILEBALLOON_MODULE)
> >> +     /* Total pages on this system. */
> >> +     __le32 pages_total;
> >> +     /* Free pages on this system. */
> >> +     __le32 pages_free;
> >> +     /* If the device needs pages_total/pages_free updated. */
> >> +     __le32 need_stats;
> >> +#endif
> >>  };
> >>
> >>  #define VIRTIO_BALLOON_S_SWAP_IN  0   /* Amount of memory swapped in */
> >> diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
> >> index 7529b85..2f081d7 100644
> >> --- a/include/linux/virtio_ids.h
> >> +++ b/include/linux/virtio_ids.h
> >> @@ -37,5 +37,6 @@
> >>  #define VIRTIO_ID_RPMSG              7 /* virtio remote processor messaging */
> >>  #define VIRTIO_ID_SCSI               8 /* virtio scsi */
> >>  #define VIRTIO_ID_9P         9 /* 9p virtio console */
> >> +#define VIRTIO_ID_FILE_BALLOON       10 /* virtio file-backed balloon */
> >>
> >>  #endif /* _LINUX_VIRTIO_IDS_H */
> >> --
> >> 1.7.7.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 19:59     ` Michael S. Tsirkin
@ 2012-09-10 20:49       ` Mike Waychison
  2012-09-10 21:10         ` Michael S. Tsirkin
  2012-09-12  5:25         ` Rusty Russell
  0 siblings, 2 replies; 30+ messages in thread
From: Mike Waychison @ 2012-09-10 20:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Frank Swiderski, Rusty Russell, Rik van Riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm

On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> >> This implementation of a virtio balloon driver uses the page cache to
>> >> "store" pages that have been released to the host.  The communication
>> >> (outside of target counts) is one way--the guest notifies the host when
>> >> it adds a page to the page cache, allowing the host to madvise(2) with
>> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> >> (via the regular page reclaim).  This means that inflating the balloon
>> >> is similar to the existing balloon mechanism, but the deflate is
>> >> different--it re-uses existing Linux kernel functionality to
>> >> automatically reclaim.
>> >>
>> >> Signed-off-by: Frank Swiderski <fes@google.com>
>>
>> Hi Michael,
>>
>> I'm very sorry that Frank and I have been silent on these threads.
>> I've been out of the office and Frank has been been swamped :)
>>
>> I'll take a stab at answering some of your questions below, and
>> hopefully we can end up on the same page.
>>
>> > I've been trying to understand this, and I have
>> > a question: what exactly is the benefit
>> > of this new device?r balloon is told upper limit on target size by host and pulls
>>
>> The key difference between this device/driver and the pre-existing
>> virtio_balloon device/driver is in how the memory pressure loop is
>> controlled.
>>
>> With the pre-existing balloon device/driver, the control loop for how
>> much memory a given VM is allowed to use is controlled completely by
>> the host.  This is probably fine if the goal is to pack as much work
>> on a given host as possible, but it says nothing about the expected
>> performance that any given VM is expecting to have.  Specifically, it
>> allows the host to set a target goal for the size of a VM, and the
>> driver in the guest does whatever is needed to get to that goal.  This
>> is great for systems where one wants to "grow or shrink" a VM from the
>> outside.
>>
>>
>> This behaviour however doesn't match what applications actually expectr balloon is told upper limit on target size by host and pulls
>> from a memory control loop however.  In a native setup, an application
>> can usually expect to allocate memory from the kernel on an as-needed
>> basis, and can in turn return memory back to the system (using a heap
>> implementation that actually releases memory that is).  The dynamic
>> size of an application is completely controlled by the application,
>> and there is very little that cluster management software can do to
>> ensure that the application fits some prescribed size.
>>
>> We recognized this in the development of our cluster management
>> software long ago, so our systems are designed for managing tasks that
>> have a dynamic memory footprint.  Overcommit is possible (as most
>> applications do not use the full reservation of memory they asked for
>> originally), letting us do things like schedule lower priority/lower
>> service-classification work using resources that are otherwise
>> available in stand-by for high-priority/low-latency workloads.
>
> OK I am not sure I got this right so pls tell me if this summary is
> correct (note: this does not talk about what guest does with memory,
> ust what it is that device does):
>
> - existing balloon is told lower limit on target size by host and pulls in at least
>   target size. Guest can inflate > target size if it likes
>   and then it is OK to deflate back to target size but not less.

Is this true?   I take it nothing is keeping the existing balloon
driver from going over the target, but the same can be said about
either balloon implementation.

> - your balloon is told upper limit on target size by host and pulls at most
>   target size. Guest can deflate down to 0 at any point.
>
> If so I think both approaches make sense and in fact they
> can be useful at the same time for the same guest.
> In that case, I see two ways how this can be done:
>
> 1. two devices: existing ballon + cache balloon the
> 2. add "upper limit" to existing ballon
>
> A single device looks a bit more natural in that we don't
> really care in which balloon a page is as long as we
> are between lower and upper limit. Right?

I agree that this may be better done using a single device if possible.

> From implementation POV we could have it use
> pagecache for pages above lower limit but that
> is a separate question about driver design,
> I would like to make sure I understand the highr balloon is told upper limit on tr balloon is told upper limit on target size by host and pullsarget size by host and pulls
> level design first.

I agree that this is an implementation detail that is separate from
discussions of high and low limits.  That said, there are several
advantages to pushing these pages to the page cache (memory defrag
still works for one).

>> > Note that users could not care less about how a driver
>> > is implemented internally.
>> >
>> > Is there some workload where you see VM working better with
>> > this than regular balloon? Any numbers?
>>
>> This device is less about performance as it is about getting the
>> memory size of a job (or in this case, a job in a VM) to grow and
>> shrink as the application workload sees fit, much like how processes
>> today can grow and shrink without external direction.
>
> Still, e.g. swap in host achieves more or less the same functionality.

Swap comes at the extremely prejudiced cost of latency.  Swap is very
very rarely used in our production environment for this reason.

> I am guessing balloon can work better by getting more cooperation
> from guest but aren't there any tests showing this is true in practice?

There aren't any meaningful test-specific numbers that I can readily
share unfortunately :(  If you have suggestions for specific things we
should try, that may be useful.

The way this change is validated on our end is to ensure that VM
processes on the host "shrink" to a reasonable working set in size
that is near-linear with the expected working set size for the
embedded tasks as if they were running native on the host.  Making
this happen with the current balloon just isn't possible as there
isn't enough visibility on the host as to how much pressure there is
in the guest.

>
>
>> >
>> > Also, can't we just replace existing balloon implementation
>> > with this one?
>>
>> Perhaps, but as described above, both devices have very different
>> characteristics.
>>
>> > Why it is so important to deflate silently?
>>
>> It may not be so important to deflate silently.  I'm not sure why it
>> is important that we deflate "loudly" though either :)  Doing so seems
>> like unnecessary guest/host communication IMO, especially if the guest
>> is expecting to be able to grow to totalram (and the host isn't able
>> to nack any pages reclaimed anyway...).
>
> First, we could add nack easily enough :)

:) Sure.  Not sure how the driver is going to expect to handle that though ! :D

> Second, access gets an exit anyway. If you tell
> host first you can maybe batch these and actually speed things up.
> It remains to be measured but historically we told host
> so the onus of proof would be on whoever wants to remove this.

I'll concede that there isn't a very compelling argument as to why the
balloon should deflate silently.  You are right that it may be better
to deflate in batches (amortizing exit costs). That said, it isn't
totally obvious that queue'ing pfns to the virtio queue is the right
thing to do algorithmically either.  Currently, the file balloon
driver can reclaim memory inline with memory reclaim (via the
->writepage callback). Doing otherwise may cause the LRU shrinking to
queue large numbers of pages to the virtio queue, without any
immediate progress made with regards to actually freeing memory.  I'm
worried that such an enqueue scheme will cause large bursts of pages
to be deflated unnecessarily when we go into reclaim.

On the plus side, having an exit taken here on each page turns out to
be relatively cheap, as the vmexit from the page fault should be
faster to process as it is fully handled within the host kernel.

Perhaps some combination of both methods is required? I'm not sure :\

>
> Third, see discussion on ML - we came up with
> the idea of locking/unlocking balloon memory
> which is useful for an assigned device.
> Requires telling host first.

I just skimmed the other thread (sorry, I'm very much backlogged on
email).  By "locking", does this mean pinning the pages so that they
are not changed?

I'll admit that I'm not familiar with the details for device
assignment.  If a page for a given bus address isn't present in the
IOMMU, does this not result in a serviceable fault?

>
> Also knowing how much memory there is in a balloon
> would be useful for admin.

This is just another counter and should already be exposed.

>
> There could be other uses.
>
>> > I guess filesystem does not currently get a callback
>> > before page is reclaimed but this isan implementation detail -
>> > maybe this can be fixed?
>>
>> I do not follow this question.
>
> Assume we want to tell host before use.
> Can you implement this on top of your patch?

Potentially, yes.  Both drivers are bare-bones at the moment IIRC and
don't support sending multiple outstanding commands to the host, but
this could be conceivably fixed (although one would have to work out
what happens when virtio_add_buf() returns -ENOBUFS).

>
>> >
>> > Also can you pls answer Avi's question?
>> > How is overcommit managed?
>>
>> Overcommit in our deployments is managed using memory cgroups on the
>> host.  This allows us to have very directed policies as to how
>> competing VMs on a host may overcommit.
>
> So you push VM out to swap if it's over allowed memory?

As mentioned above, we don't use swap. If the task is of a lower
service band, it may end up blocking a lot more waiting for host
memory to become available, or may even be killed by the system and
restarted elsewhere.  Tasks that are of the higher service bands will
cause other tasks of lower service band to give up the ram (by will or
by force).

> Existing balloon does this better as it is cooperative,
> it seems.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 20:49       ` Mike Waychison
@ 2012-09-10 21:10         ` Michael S. Tsirkin
  2012-10-30 15:29           ` Michael S. Tsirkin
  2012-09-12  5:25         ` Rusty Russell
  1 sibling, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-09-10 21:10 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Frank Swiderski, Rusty Russell, Rik van Riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm

On Mon, Sep 10, 2012 at 04:49:40PM -0400, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
> >> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> >> This implementation of a virtio balloon driver uses the page cache to
> >> >> "store" pages that have been released to the host.  The communication
> >> >> (outside of target counts) is one way--the guest notifies the host when
> >> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> >> (via the regular page reclaim).  This means that inflating the balloon
> >> >> is similar to the existing balloon mechanism, but the deflate is
> >> >> different--it re-uses existing Linux kernel functionality to
> >> >> automatically reclaim.
> >> >>
> >> >> Signed-off-by: Frank Swiderski <fes@google.com>
> >>
> >> Hi Michael,
> >>
> >> I'm very sorry that Frank and I have been silent on these threads.
> >> I've been out of the office and Frank has been been swamped :)
> >>
> >> I'll take a stab at answering some of your questions below, and
> >> hopefully we can end up on the same page.
> >>
> >> > I've been trying to understand this, and I have
> >> > a question: what exactly is the benefit
> >> > of this new device?r balloon is told upper limit on target size by host and pulls
> >>
> >> The key difference between this device/driver and the pre-existing
> >> virtio_balloon device/driver is in how the memory pressure loop is
> >> controlled.
> >>
> >> With the pre-existing balloon device/driver, the control loop for how
> >> much memory a given VM is allowed to use is controlled completely by
> >> the host.  This is probably fine if the goal is to pack as much work
> >> on a given host as possible, but it says nothing about the expected
> >> performance that any given VM is expecting to have.  Specifically, it
> >> allows the host to set a target goal for the size of a VM, and the
> >> driver in the guest does whatever is needed to get to that goal.  This
> >> is great for systems where one wants to "grow or shrink" a VM from the
> >> outside.
> >>
> >>
> >> This behaviour however doesn't match what applications actually expectr balloon is told upper limit on target size by host and pulls
> >> from a memory control loop however.  In a native setup, an application
> >> can usually expect to allocate memory from the kernel on an as-needed
> >> basis, and can in turn return memory back to the system (using a heap
> >> implementation that actually releases memory that is).  The dynamic
> >> size of an application is completely controlled by the application,
> >> and there is very little that cluster management software can do to
> >> ensure that the application fits some prescribed size.
> >>
> >> We recognized this in the development of our cluster management
> >> software long ago, so our systems are designed for managing tasks that
> >> have a dynamic memory footprint.  Overcommit is possible (as most
> >> applications do not use the full reservation of memory they asked for
> >> originally), letting us do things like schedule lower priority/lower
> >> service-classification work using resources that are otherwise
> >> available in stand-by for high-priority/low-latency workloads.
> >
> > OK I am not sure I got this right so pls tell me if this summary is
> > correct (note: this does not talk about what guest does with memory,
> > ust what it is that device does):
> >
> > - existing balloon is told lower limit on target size by host and pulls in at least
> >   target size. Guest can inflate > target size if it likes
> >   and then it is OK to deflate back to target size but not less.
> 
> Is this true?   I take it nothing is keeping the existing balloon
> driver from going over the target, but the same can be said about
> either balloon implementation.
> 
> > - your balloon is told upper limit on target size by host and pulls at most
> >   target size. Guest can deflate down to 0 at any point.
> >
> > If so I think both approaches make sense and in fact they
> > can be useful at the same time for the same guest.
> > In that case, I see two ways how this can be done:
> >
> > 1. two devices: existing ballon + cache balloon the
> > 2. add "upper limit" to existing ballon
> >
> > A single device looks a bit more natural in that we don't
> > really care in which balloon a page is as long as we
> > are between lower and upper limit. Right?
> 
> I agree that this may be better done using a single device if possible.

I am not sure myself, just asking.

> > From implementation POV we could have it use
> > pagecache for pages above lower limit but that
> > is a separate question about driver design,
> > I would like to make sure I understand the highr balloon is told upper limit on tr balloon is told upper limit on target size by host and pullsarget size by host and pulls
> > level design first.
> 
> I agree that this is an implementation detail that is separate from
> discussions of high and low limits.  That said, there are several
> advantages to pushing these pages to the page cache (memory defrag
> still works for one).

I'm not arguing against it at all.

> >> > Note that users could not care less about how a driver
> >> > is implemented internally.
> >> >
> >> > Is there some workload where you see VM working better with
> >> > this than regular balloon? Any numbers?
> >>
> >> This device is less about performance as it is about getting the
> >> memory size of a job (or in this case, a job in a VM) to grow and
> >> shrink as the application workload sees fit, much like how processes
> >> today can grow and shrink without external direction.
> >
> > Still, e.g. swap in host achieves more or less the same functionality.
> 
> Swap comes at the extremely prejudiced cost of latency.  Swap is very
> very rarely used in our production environment for this reason.
> 
> > I am guessing balloon can work better by getting more cooperation
> > from guest but aren't there any tests showing this is true in practice?
> 
> There aren't any meaningful test-specific numbers that I can readily
> share unfortunately :(  If you have suggestions for specific things we
> should try, that may be useful.
> 
> The way this change is validated on our end is to ensure that VM
> processes on the host "shrink" to a reasonable working set in size
> that is near-linear with the expected working set size for the
> embedded tasks as if they were running native on the host.  Making
> this happen with the current balloon just isn't possible as there
> isn't enough visibility on the host as to how much pressure there is
> in the guest.
> 
> >
> >
> >> >
> >> > Also, can't we just replace existing balloon implementation
> >> > with this one?
> >>
> >> Perhaps, but as described above, both devices have very different
> >> characteristics.
> >>
> >> > Why it is so important to deflate silently?
> >>
> >> It may not be so important to deflate silently.  I'm not sure why it
> >> is important that we deflate "loudly" though either :)  Doing so seems
> >> like unnecessary guest/host communication IMO, especially if the guest
> >> is expecting to be able to grow to totalram (and the host isn't able
> >> to nack any pages reclaimed anyway...).
> >
> > First, we could add nack easily enough :)
> 
> :) Sure.  Not sure how the driver is going to expect to handle that though ! :D

Not sure about pagecache backed - regular one can just hang on
to the page for a while more and try later or with another page.

> > Second, access gets an exit anyway. If you tell
> > host first you can maybe batch these and actually speed things up.
> > It remains to be measured but historically we told host
> > so the onus of proof would be on whoever wants to remove this.
> 
> I'll concede that there isn't a very compelling argument as to why the
> balloon should deflate silently.  You are right that it may be better
> to deflate in batches (amortizing exit costs). That said, it isn't
> totally obvious that queue'ing pfns to the virtio queue is the right
> thing to do algorithmically either.  Currently, the file balloon
> driver can reclaim memory inline with memory reclaim (via the
> ->writepage callback). Doing otherwise may cause the LRU shrinking to
> queue large numbers of pages to the virtio queue, without any
> immediate progress made with regards to actually freeing memory.  I'm
> worried that such an enqueue scheme will cause large bursts of pages
> to be deflated unnecessarily when we go into reclaim.

Yes it would seem writepage is not a good mechanism since
it can try to write pages speculatively.
Maybe add a flag to tell LRU to only write pages when
we really need the memory?

> On the plus side, having an exit taken here on each page turns out to
> be relatively cheap, as the vmexit from the page fault should be
> faster to process as it is fully handled within the host kernel.
> 
> Perhaps some combination of both methods is required? I'm not sure :\

Perhaps some benchmarking is in order :)
Can you try telling host, potentially MADV_WILL_NEED
in that case like qemu does, then run your proprietary test
and see if things work well enough?

> >
> > Third, see discussion on ML - we came up with
> > the idea of locking/unlocking balloon memory
> > which is useful for an assigned device.
> > Requires telling host first.
> 
> I just skimmed the other thread (sorry, I'm very much backlogged on
> email).  By "locking", does this mean pinning the pages so that they
> are not changed?

Yes by get user pages.

> I'll admit that I'm not familiar with the details for device
> assignment.  If a page for a given bus address isn't present in the
> IOMMU, does this not result in a serviceable fault?

Yes.

> >
> > Also knowing how much memory there is in a balloon
> > would be useful for admin.
> 
> This is just another counter and should already be exposed.
> 
> >
> > There could be other uses.
> >
> >> > I guess filesystem does not currently get a callback
> >> > before page is reclaimed but this isan implementation detail -
> >> > maybe this can be fixed?
> >>
> >> I do not follow this question.
> >
> > Assume we want to tell host before use.
> > Can you implement this on top of your patch?
> 
> Potentially, yes.  Both drivers are bare-bones at the moment IIRC and
> don't support sending multiple outstanding commands to the host, but
> this could be conceivably fixed (although one would have to work out
> what happens when virtio_add_buf() returns -ENOBUFS).

It's not enough to add buf. You need to wait for host ack.
Once you got ack you know you can add another buf.

> >
> >> >
> >> > Also can you pls answer Avi's question?
> >> > How is overcommit managed?
> >>
> >> Overcommit in our deployments is managed using memory cgroups on the
> >> host.  This allows us to have very directed policies as to how
> >> competing VMs on a host may overcommit.
> >
> > So you push VM out to swap if it's over allowed memory?
> 
> As mentioned above, we don't use swap. If the task is of a lower
> service band, it may end up blocking a lot more waiting for host
> memory to become available, or may even be killed by the system and
> restarted elsewhere.  Tasks that are of the higher service bands will
> cause other tasks of lower service band to give up the ram (by will or
> by force).

Right. I think the comment below applies.

> > Existing balloon does this better as it is cooperative,
> > it seems.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 20:49       ` Mike Waychison
  2012-09-10 21:10         ` Michael S. Tsirkin
@ 2012-09-12  5:25         ` Rusty Russell
  1 sibling, 0 replies; 30+ messages in thread
From: Rusty Russell @ 2012-09-12  5:25 UTC (permalink / raw)
  To: Mike Waychison, Michael S. Tsirkin
  Cc: Frank Swiderski, Rik van Riel, Andrea Arcangeli, virtualization,
	linux-kernel, kvm

Mike Waychison <mikew@google.com> writes:
> On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
>>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>>> >> This implementation of a virtio balloon driver uses the page cache to
>>> >> "store" pages that have been released to the host.  The communication
>>> >> (outside of target counts) is one way--the guest notifies the host when
>>> >> it adds a page to the page cache, allowing the host to madvise(2) with
>>> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>>> >> (via the regular page reclaim).  This means that inflating the balloon
>>> >> is similar to the existing balloon mechanism, but the deflate is
>>> >> different--it re-uses existing Linux kernel functionality to
>>> >> automatically reclaim.
>>> >>
>>> >> Signed-off-by: Frank Swiderski <fes@google.com>
>>>
>>> Hi Michael,
>>>
>>> I'm very sorry that Frank and I have been silent on these threads.
>>> I've been out of the office and Frank has been been swamped :)
>>>
>>> I'll take a stab at answering some of your questions below, and
>>> hopefully we can end up on the same page.

Hi Mike, Frank, Michael,

        Thanks for the explanation and discussion.  I like that this
implementation is more dynamic: the guest can use more pages for a while
(and the balloon kthread will furiously start trying to grab more pages
to give back).  This part is a completely reasonable implementation, and
more sophisticated that what we have.

It doesn't *quite* meet the spec, because we don't notify the host when
we pull a page from the balloon, but I think that is quite possible.  If
this is a performance waster, we should add a "SILENT_DEFLATE" feature
to tell the driver that it doesn't need to, though we should stll
support the !SILENT_DEFLATE case.

And Michael: thanks again for doing the heavy lifting on this!

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-09-10 21:10         ` Michael S. Tsirkin
@ 2012-10-30 15:29           ` Michael S. Tsirkin
  2012-10-30 16:25             ` Mike Waychison
  0 siblings, 1 reply; 30+ messages in thread
From: Michael S. Tsirkin @ 2012-10-30 15:29 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Frank Swiderski, Rusty Russell, Rik van Riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm

On Tue, Sep 11, 2012 at 12:10:18AM +0300, Michael S. Tsirkin wrote:
> > On the plus side, having an exit taken here on each page turns out to
> > be relatively cheap, as the vmexit from the page fault should be
> > faster to process as it is fully handled within the host kernel.
> > 
> > Perhaps some combination of both methods is required? I'm not sure :\
> 
> Perhaps some benchmarking is in order :)
> Can you try telling host, potentially MADV_WILL_NEED
> in that case like qemu does, then run your proprietary test
> and see if things work well enough?

Ping. Had a chance to try that?

-- 
MST

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] Add a page cache-backed balloon device driver.
  2012-10-30 15:29           ` Michael S. Tsirkin
@ 2012-10-30 16:25             ` Mike Waychison
  0 siblings, 0 replies; 30+ messages in thread
From: Mike Waychison @ 2012-10-30 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Frank Swiderski, Rusty Russell, Rik van Riel, Andrea Arcangeli,
	virtualization, linux-kernel, kvm

On Tue, Oct 30, 2012 at 8:29 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Sep 11, 2012 at 12:10:18AM +0300, Michael S. Tsirkin wrote:
>> > On the plus side, having an exit taken here on each page turns out to
>> > be relatively cheap, as the vmexit from the page fault should be
>> > faster to process as it is fully handled within the host kernel.
>> >
>> > Perhaps some combination of both methods is required? I'm not sure :\
>>
>> Perhaps some benchmarking is in order :)
>> Can you try telling host, potentially MADV_WILL_NEED
>> in that case like qemu does, then run your proprietary test
>> and see if things work well enough?
>
> Ping. Had a chance to try that?

Not yet.  We've been focused on other things recently and probably
won't have a chance to return to this for a couple months at best.
Sorry :(

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-10-30 16:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-26 20:32 [PATCH] Add a page cache-backed balloon device driver Frank Swiderski
2012-06-26 20:40 ` Rik van Riel
2012-06-26 21:31   ` Frank Swiderski
2012-06-26 21:45     ` Rik van Riel
2012-06-26 23:45       ` Frank Swiderski
2012-06-27  9:04         ` Michael S. Tsirkin
2012-06-26 21:47     ` Michael S. Tsirkin
2012-06-26 23:21       ` Frank Swiderski
2012-06-27  9:02         ` Michael S. Tsirkin
2012-07-02  0:29         ` Rusty Russell
2012-09-03  6:35           ` Paolo Bonzini
2012-09-06  1:35             ` Rusty Russell
2012-06-26 21:41 ` Michael S. Tsirkin
2012-06-27  2:56   ` Rusty Russell
2012-06-27 15:48     ` Frank Swiderski
2012-06-27 16:06       ` Michael S. Tsirkin
2012-06-27 16:08         ` Frank Swiderski
2012-06-27  9:40 ` Amit Shah
2012-08-30  8:57 ` Michael S. Tsirkin
2012-09-03 15:09 ` Avi Kivity
2012-09-10  9:05 ` Michael S. Tsirkin
2012-09-10 17:37   ` Mike Waychison
2012-09-10 18:04     ` Rik van Riel
2012-09-10 18:29       ` Mike Waychison
2012-09-10 19:59     ` Michael S. Tsirkin
2012-09-10 20:49       ` Mike Waychison
2012-09-10 21:10         ` Michael S. Tsirkin
2012-10-30 15:29           ` Michael S. Tsirkin
2012-10-30 16:25             ` Mike Waychison
2012-09-12  5:25         ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).