All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality
@ 2010-08-12 22:22 bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 1/6] Btrfs: Add experimental hot data hash list index bchociej
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

These patches are a replacement for our previous hot data tracking
patches. They include some bugfixes as well as the previously promised
hot data relocation code for moving frequently accessed data to SSD.
Structurally, the patches are quite similar to the first set, with the
notable addition of new hotdata_relocate.{c,h} files. Matt Lupfer and
Conor Scott have done as much of the coding as I have, if not more. So,
many thanks to those guys, along with Mingming Cao, Steve French, Steve
Pratt, and Chris Mason, without which this little project would have
been impossible.


INTRODUCTION:

This patch series adds experimental support for relocation of hot data
to SSD in Btrfs. Essentially, this means maintaining some key stats
(like number of reads/writes, last read/write time, frequency of
reads/writes), then distilling those numbers down to a single
"temperature" value that reflects what data is "hot," and using that
temperature to move data to SSDs.

The long-term goal of these patches is to allow Btrfs to intelligently
utilize SSDs in a heterogenous volume. Incidentally, this project has
been motivated by the Project Ideas page on the Btrfs wiki.

Of course, users are warned not to run this code outside of development
environments. These patches are EXPERIMENTAL, and as such they might eat
your data and/or memory. That said, the code should be relatively safe
when the hotdatatrack and hotdatamove mount options are disabled.


MOTIVATION:

The overall goal of enabling hot data relocation to SSD has been
motivated by the Project Ideas page on the Btrfs wiki at
<https://btrfs.wiki.kernel.org/index.php/Project_ideas>. It is hoped
that this initial patchset will eventually mature into a usable hybrid
storage feature set for Btrfs.

This is essentially the traditional cache argument: SSD is fast and
expensive; HDD is cheap but slow. ZFS, for example, can already take
advantage of SSD caching. Btrfs should also be able to take advantage of
hybrid storage without many broad, sweeping changes to existing code.

With Btrfs's COW approach, an external cache (where data is *moved* to
SSD, rather than just cached there) makes a lot of sense. These patches,
in contrast to the previous version, now enable the hot data relocation
functionality. While performance testing so far has been extremely
basic, the code has shown promising results in random read tests (about
5x throughput by adding an SSD of about 20% of the total capacity of the
volume).


SUMMARY:

- Hooks in existing Btrfs functions to track data access frequency
  (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)

- New rbtrees for tracking access frequency of inodes and sub-file
  ranges (hotdata_map.c)

- A hash list for indexing data by its temperature (hotdata_hash.c)

- A debugfs interface for dumping data from the rbtrees (debugfs.c)

- A background kthread for relocating data to faster media based on
  temperature

- Mount options for enabling temperature tracking (-o hotdatatrack,
  -o hotdatamove; move implies track; both default to disabled)

- An ioctl to retrieve the frequency information collected for a certain
  file

- Ioctls to enable/disable frequency tracking and relocation per inode.


DIFFSTAT:

$ git diff --stat --summary -M

 fs/btrfs/Makefile           |    3 +-
 fs/btrfs/ctree.h            |   96 ++++
 fs/btrfs/debugfs.c          |  532 ++++++++++++++++++++++
 fs/btrfs/debugfs.h          |   89 ++++
 fs/btrfs/disk-io.c          |   28 ++
 fs/btrfs/extent-tree.c      |   62 +++-
 fs/btrfs/extent_io.c        |   34 ++
 fs/btrfs/extent_io.h        |    7 +
 fs/btrfs/hotdata_hash.c     |  338 ++++++++++++++
 fs/btrfs/hotdata_hash.h     |  155 +++++++
 fs/btrfs/hotdata_map.c      |  804 +++++++++++++++++++++++++++++++++
 fs/btrfs/hotdata_map.h      |  167 +++++++
 fs/btrfs/hotdata_relocate.c |  783 ++++++++++++++++++++++++++++++++
 fs/btrfs/hotdata_relocate.h |   73 +++
 fs/btrfs/inode.c            |  164 +++++++-
 fs/btrfs/ioctl.c            |  142 ++++++-
 fs/btrfs/ioctl.h            |   23 +
 fs/btrfs/super.c            |   62 +++-
 fs/btrfs/volumes.c          |   38 ++-
 19 files changed, 3580 insertions(+), 20 deletions(-)

 create mode 100644 fs/btrfs/debugfs.c
 create mode 100644 fs/btrfs/debugfs.h
 create mode 100644 fs/btrfs/hotdata_hash.c
 create mode 100644 fs/btrfs/hotdata_hash.h
 create mode 100644 fs/btrfs/hotdata_map.c
 create mode 100644 fs/btrfs/hotdata_map.h
 create mode 100644 fs/btrfs/hotdata_relocate.c
 create mode 100644 fs/btrfs/hotdata_relocate.h


IMPLEMENTATION (in a nutshell):

Hooks have been added to various functions (btrfs_writepage(s),
btrfs_readpages, btrfs_direct_IO, and extent_write_cache_pages) in
order to track data access patterns. Each of these hooks calls a new
function, btrfs_update_freqs, that records each access to an inode,
possibly including some sub-file-level information as well. A data
structure containing some various frequency metrics gets updated with
the latest access information.

>From there, a hash list takes over the job of figuring out a total
"temperature" value for the data and indexing that temperature for fast
lookup in the future. The function that does the temperature
distillation is rather sensitive and can be tuned/tweaked by altering
various #defined values in hotdata_hash.h.

As for the actual data relocation, a kthread runs periodically that uses
the hashlist to find data eligible for relocation, either
to or from SSD. It then initiates the transfer of the data to the
preferred media type by allocating to an appropriate block group
type on the destination media, based on the temperature of the file and
the speed of the media.

Aside from the core functionality, there is a debugfs interface to spit
out some of the data that is collected, and ioctls are also introduced
to manipulate the new functionality on a per-inode basis.


HOW TO USE HOTDATA RELOCATION:

First, format like this:

	# mkfs.btrfs -h <spinning_disk_blockdev> [any_blockdev] ...

Note that a spinning disk must be the first block device listed, or you
will receive a warning and unexpected behavior. To use hot data tracking
alone, you only need one block device, and it needn't be an SSD. To use
hot data relocation, you should have at least one spinning disk and at
least one SSD. Then...

	# mount -o hotdatamove <any_blockdev> <mountpoint>

Optionally, view information about hot data from debugfs:

	# cat /sys/kernel/debug/btrfs_data/<blockdev>/inode_data
	# cat /sys/kernel/debug/btrfs_data/<blockdev>/range_data


KNOWN ISSUES:
(When hotdatatrack or hotdatamove mount options are enabled)

- Occasional errors (-EIO) from read/write syscalls.

- Heavy file creation workloads encounter high lock contention,
  significantly impacting performance.


FUTURE GOALS:

- Store more information about data temperature / access frequency
  persistently between mounts.

- Track temperature of and relocate metadata (and inline extents) to
  SSD.


Signed-off-by: Ben Chociej <bchociej@gmail.com>
Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Steve French <smfrench@gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC v2 PATCH 1/6] Btrfs: Add experimental hot data hash list index
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
@ 2010-08-12 22:22 ` bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 2/6] Btrfs: Add data structures for hot data tracking bchociej
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

From: Ben Chociej <bchociej@gmail.com>

Adds a hash table structure to efficiently lookup the data temperature
of a file. Also adds a function to calculate that temperature based on
some metrics kept in custom frequency data structs (in the next patch).

Signed-off-by: Ben Chociej <bchociej@gmail.com>
Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/btrfs/hotdata_hash.c |  338 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hotdata_hash.h |  155 ++++++++++++++++++++++
 2 files changed, 493 insertions(+), 0 deletions(-)
 create mode 100644 fs/btrfs/hotdata_hash.c
 create mode 100644 fs/btrfs/hotdata_hash.h

diff --git a/fs/btrfs/hotdata_hash.c b/fs/btrfs/hotdata_hash.c
new file mode 100644
index 0000000..b789edd
--- /dev/null
+++ b/fs/btrfs/hotdata_hash.c
@@ -0,0 +1,338 @@
+/*
+ * fs/btrfs/hotdata_hash.c
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/hardirq.h>
+#include <linux/hash.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "hotdata_relocate.h"
+#include "async-thread.h"
+#include "ctree.h"
+
+struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask)
+{
+	struct heat_hashlist_node *node;
+
+	node = kmalloc(sizeof(struct heat_hashlist_node), mask);
+	if (!node || IS_ERR(node))
+		return node;
+	INIT_HLIST_NODE(&node->hashnode);
+	node->freq_data = NULL;
+	node->hlist = NULL;
+	node->location = BTRFS_ON_ROTATING;
+	spin_lock_init(&node->lock);
+	spin_lock_init(&node->location_lock);
+	atomic_set(&node->refs, 1);
+
+	return node;
+}
+
+void free_heat_hashlists(struct btrfs_root *root)
+{
+	int i;
+
+	/* Free node/range heat hash lists */
+	for (i = 0; i < HEAT_HASH_SIZE; i++) {
+		struct hlist_node *pos = NULL, *pos2 = NULL;
+		struct heat_hashlist_node *heatnode = NULL;
+
+		hlist_for_each_safe(pos, pos2,
+			&root->heat_inode_hl[i].hashhead) {
+			heatnode = hlist_entry(pos, struct heat_hashlist_node,
+				hashnode);
+			hlist_del(pos);
+			kfree(heatnode);
+		}
+		hlist_for_each_safe(pos, pos2,
+			&root->heat_range_hl[i].hashhead) {
+			heatnode = hlist_entry(pos, struct heat_hashlist_node,
+				hashnode);
+			hlist_del(pos);
+			kfree(heatnode);
+		}
+	}
+}
+
+/*
+ * btrfs_get_temp is responsible for distilling the six heat criteria, which
+ * are described in detail in hotdata_hash.h) down into a single temperature
+ * value for the data, which is an integer between 0 and HEAT_MAX_VALUE.
+ *
+ * To accomplish this, the raw values from the btrfs_freq_data structure
+ * are shifted various ways in order to make the temperature calculation more
+ * or less sensitive to each value.
+ *
+ * Once this calibration has happened, we do some additional normalization and
+ * make sure that everything fits nicely in a u32. From there, we take a very
+ * rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+ * values act as weights for the average.
+ *
+ * Finally, we use the HEAT_HASH_BITS value, which determines the size of the
+ * heat hash list, to normalize the temperature to the proper granularity.
+ */
+int btrfs_get_temp(struct btrfs_freq_data *fdata)
+{
+	u32 result = 0;
+
+	struct timespec ckt = current_kernel_time();
+	u64 cur_time = timespec_to_ns(&ckt);
+
+	u32 nrr_heat = fdata->nr_reads << NRR_MULTIPLIER_POWER;
+	u32 nrw_heat = fdata->nr_writes << NRW_MULTIPLIER_POWER;
+
+	u64 ltr_heat = (cur_time - timespec_to_ns(&fdata->last_read_time))
+			>> LTR_DIVIDER_POWER;
+	u64 ltw_heat = (cur_time - timespec_to_ns(&fdata->last_write_time))
+			>> LTW_DIVIDER_POWER;
+
+	u64 avr_heat = (((u64) -1) - fdata->avg_delta_reads)
+			>> AVR_DIVIDER_POWER;
+	u64 avw_heat = (((u64) -1) - fdata->avg_delta_writes)
+			>> AVR_DIVIDER_POWER;
+
+	if (ltr_heat >= ((u64) 1 << 32))
+		ltr_heat = 0;
+	else
+		ltr_heat = ((u64) 1 << 32) - ltr_heat;
+	/* ltr_heat is now guaranteed to be u32 safe */
+
+	if (ltw_heat >= ((u64) 1 << 32))
+		ltw_heat = 0;
+	else
+		ltw_heat = ((u64) 1 << 32) - ltw_heat;
+	/* ltw_heat is now guaranteed to be u32 safe */
+
+	if (avr_heat >= ((u64) 1 << 32))
+		avr_heat = (u32) -1;
+	/* avr_heat is now guaranteed to be u32 safe */
+
+	if (avw_heat >= ((u64) 1 << 32))
+		avr_heat = (u32) -1;
+	/* avw_heat is now guaranteed to be u32 safe */
+
+	nrr_heat = nrr_heat >> (3 - NRR_COEFF_POWER);
+	nrw_heat = nrw_heat >> (3 - NRW_COEFF_POWER);
+	ltr_heat = ltr_heat >> (3 - LTR_COEFF_POWER);
+	ltw_heat = ltw_heat >> (3 - LTW_COEFF_POWER);
+	avr_heat = avr_heat >> (3 - AVR_COEFF_POWER);
+	avw_heat = avw_heat >> (3 - AVW_COEFF_POWER);
+
+	result = nrr_heat + nrw_heat + (u32) ltr_heat +
+		 (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+	return result >> (32 - HEAT_HASH_BITS);
+}
+
+static int is_old(struct btrfs_freq_data *freq_data)
+{
+	int ret = 0;
+	struct timespec ckt = current_kernel_time();
+
+	u64 cur_time = timespec_to_ns(&ckt);
+	u64 last_read_ns = (cur_time -
+			    timespec_to_ns(&freq_data->last_read_time));
+	u64 last_write_ns = (cur_time -
+			     timespec_to_ns(&freq_data->last_write_time));
+	u64 kick_ns = TIME_TO_KICK * (u64)1000000000;
+	if ((last_read_ns > kick_ns) && (last_write_ns > kick_ns))
+		ret = 1;
+	return ret;
+}
+
+
+/* update temps for each range item for aging purposes */
+static void btrfs_update_range_data(struct hot_inode_item *hot_inode,
+				    struct btrfs_root *root)
+{
+	struct hot_range_tree *inode_range_tree;
+	struct rb_node *node;
+	struct rb_node *old_node;
+	struct hot_range_item *current_range;
+	int location, range_is_old;
+
+	inode_range_tree = &hot_inode->hot_range_tree;
+	write_lock(&inode_range_tree->lock);
+	node = rb_first(&inode_range_tree->map);
+	/* Walk the hot_range_tree for inode */
+	while (node) {
+		current_range = rb_entry(node, struct hot_range_item, rb_node);
+		btrfs_update_heat_index(&current_range->freq_data, root);
+		old_node = node;
+		node = rb_next(node);
+		/* if the inode is cold and off ssd, quit keeping track of it */
+		spin_lock(&current_range->heat_node->location_lock);
+		location = current_range->heat_node->location;
+		spin_unlock(&current_range->heat_node->location_lock);
+
+		spin_lock(&current_range->lock);
+		range_is_old = is_old(&current_range->freq_data);
+		spin_unlock(&current_range->lock);
+
+		if (range_is_old && location == BTRFS_ON_ROTATING) {
+			if (atomic_read(&current_range->heat_node->refs) <= 1)
+				btrfs_remove_range_from_heat_index(hot_inode,
+							current_range, root);
+		}
+	}
+	write_unlock(&inode_range_tree->lock);
+}
+
+/* update temps for each hot inode item and hot range item for aging purposes */
+static void iterate_and_update_heat(struct btrfs_root *root)
+{
+	struct btrfs_root *fs_root;
+	struct hot_inode_item *current_hot_inode;
+	struct hot_inode_tree *hot_inode_tree;
+	unsigned long inode_num;
+
+	hot_inode_tree = &root->hot_inode_tree;
+
+	fs_root = root->fs_info->fs_root;
+	/* walk the inode tree */
+	current_hot_inode = find_next_hot_inode(fs_root, 0);
+	while (current_hot_inode) {
+		btrfs_update_heat_index(&current_hot_inode->freq_data, root);
+		btrfs_update_range_data(current_hot_inode, fs_root);
+		inode_num = current_hot_inode->i_ino;
+		free_hot_inode_item(current_hot_inode);
+		current_hot_inode = find_next_hot_inode(fs_root,
+				inode_num + 1);
+	}
+}
+
+/*
+ * kthread iterates each hot_inode_item and hot_range_item
+ * and update temperatures to be shifted in heat hash table
+ * for purposes of relocation and such hot file detection
+ */
+static int update_inode_kthread(void *arg)
+{
+	struct btrfs_root *root = arg;
+	unsigned long delay;
+	do {
+		delay = HZ * HEAT_UPDATE_DELAY;
+		if (mutex_trylock(&root->fs_info->
+			hot_data_update_kthread_mutex)) {
+			iterate_and_update_heat(root);
+			mutex_unlock(&root->fs_info->
+				     hot_data_update_kthread_mutex);
+		}
+		if (freezing(current)) {
+			refrigerator();
+		} else {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (!kthread_should_stop())
+				schedule_timeout(delay);
+			__set_current_state(TASK_RUNNING);
+		}
+	} while (!kthread_should_stop());
+	return 0;
+}
+
+/* init the kthread to do temp updates */
+void init_hash_list_kthread(struct btrfs_root *root)
+{
+	root->fs_info->hot_data_update_kthread =
+					kthread_run(update_inode_kthread,
+					root,
+					"update_hot_inode_kthread");
+	if (IS_ERR(root->fs_info->hot_data_update_kthread))
+		kthread_stop(root->fs_info->hot_data_update_kthread);
+}
+
+/*
+ * take hot inode that is now cold and remove from indexes and clean up
+ * any memory associted, involves removing hot inode from rb tree, and
+ * heat hash lists, and freeing up all memory and range memory.
+ */
+void btrfs_remove_inode_from_heat_index(struct hot_inode_item *hot_inode,
+				  struct btrfs_root *root)
+{
+	struct rb_node *node2;
+	struct hot_range_item *hr;
+
+	/* remove hot inode item from rb tree */
+	write_lock(&root->hot_inode_tree.lock);
+	remove_hot_inode_item(&root->hot_inode_tree, hot_inode);
+	write_unlock(&root->hot_inode_tree.lock);
+
+	/* remove the hot inode item from hash table */
+	write_lock(&hot_inode->heat_node->hlist->rwlock);
+	hlist_del(&hot_inode->heat_node->hashnode);
+	write_unlock(&hot_inode->heat_node->hlist->rwlock);
+
+	/* remove ranges in inode from rb-tree and heat table first */
+	write_lock(&hot_inode->hot_range_tree.lock);
+	node2 = rb_first(&hot_inode->hot_range_tree.map);
+	while (node2) {
+		hr = rb_entry(node2, struct hot_range_item,
+			rb_node);
+
+		/* remove range from range tree */
+		remove_hot_range_item(&hot_inode->hot_range_tree, hr);
+
+		/* remove range from hash list */
+		write_lock(&hr->heat_node->hlist->rwlock);
+		hlist_del(&hr->heat_node->hashnode);
+		write_unlock(&hr->heat_node->hlist->rwlock);
+
+		/*free up memory */
+		kfree(hr->heat_node);
+		free_hot_range_item(hr);
+
+		node2 = rb_first(&hot_inode->hot_range_tree.map);
+	}
+	write_unlock(&hot_inode->hot_range_tree.lock);
+
+	/* free up associated inode memory */
+	kfree(hot_inode->heat_node);
+	free_hot_inode_item(hot_inode);
+}
+
+/*
+ * take hot range that is now cold and remove from indexes and clean up
+ * any memory associted, involves removing hot range from rb tree, and
+ * heat hash lists, and freeing up all memory.
+ */
+void btrfs_remove_range_from_heat_index(struct hot_inode_item *hot_inode,
+					struct hot_range_item *hr,
+					struct btrfs_root *root)
+{
+	/* remove range from rb tree */
+	remove_hot_range_item(&hot_inode->hot_range_tree, hr);
+
+	/* remove range from hash list */
+	spin_lock(&hr->heat_node->lock);
+	write_lock(&hr->heat_node->hlist->rwlock);
+	hlist_del(&hr->heat_node->hashnode);
+	write_unlock(&hr->heat_node->hlist->rwlock);
+	spin_unlock(&hr->heat_node->lock);
+
+	/*free up memory */
+	kfree(hr->heat_node);
+	free_hot_range_item(hr);
+}
diff --git a/fs/btrfs/hotdata_hash.h b/fs/btrfs/hotdata_hash.h
new file mode 100644
index 0000000..10d28ee
--- /dev/null
+++ b/fs/btrfs/hotdata_hash.h
@@ -0,0 +1,155 @@
+/*
+ * fs/btrfs/hotdata_hash.h
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __HOTDATAHASH__
+#define __HOTDATAHASH__
+
+#include <linux/list.h>
+#include <linux/hash.h>
+
+#define HEAT_HASH_BITS 8
+#define HEAT_HASH_SIZE (1 << HEAT_HASH_BITS)
+#define HEAT_HASH_MASK (HEAT_HASH_SIZE - 1)
+#define HEAT_MIN_VALUE 0
+#define HEAT_MAX_VALUE (HEAT_HASH_SIZE - 1)
+#define HEAT_NO_MIGRATE HEAT_HASH_SIZE
+
+/* time to quit keeping track of tracking data (seconds)*/
+#define TIME_TO_KICK 400
+
+/* set how often to update temps (seconds) */
+#define HEAT_UPDATE_DELAY 400
+
+/* initial heat threshold temperature */
+#define HEAT_INITIAL_THRESH 150
+
+/*
+ * The following comments explain what exactly comprises a unit of heat.
+ *
+ * Each of six values of heat are calculated and combined in order to form an
+ * overall temperature for the data:
+ *
+ * NRR - number of reads since mount
+ * NRW - number of writes since mount
+ * LTR - time elapsed since last read (ns)
+ * LTW - time elapsed since last write (ns)
+ * AVR - average delta between recent reads (ns)
+ * AVW - average delta between recent writes (ns)
+ *
+ * These values are divided (right-shifted) according to the *_DIVIDER_POWER
+ * values defined below to bring the numbers into a reasonable range. You can
+ * modify these values to fit your needs. However, each heat unit is a u32 and
+ * thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+ * carefully or else they could max out or be stuck at zero quite easily.
+ *
+ * (E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+ * delta would bring the temperature above zero, ever.)
+ *
+ * Finally, each value is added to the overall temperature between 0 and 8
+ * times, depending on its *_COEFF_POWER value. Note that the coefficients are
+ * also actually implemented with shifts, so take care to treat these values
+ * as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+ */
+
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ *
+ * E.g., data with an average delta between 0 and 2^X ns will have a cold value
+ * of 0, which means a heat value equal to HEAT_MAX_VALUE.
+ */
+#define AVR_DIVIDER_POWER 40
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40
+#define AVW_COEFF_POWER 0
+
+struct btrfs_root;
+
+/* Hash list heads for heat hash table */
+struct heat_hashlist_entry {
+	struct hlist_head hashhead;
+	rwlock_t rwlock;
+	u32 temperature;
+};
+
+/* Nodes stored in each hash list of hash table */
+struct heat_hashlist_node {
+	struct hlist_node hashnode;
+	struct list_head node;
+	struct btrfs_freq_data *freq_data;
+	struct heat_hashlist_entry *hlist;
+
+	/*
+	 * number of references to this node
+	 * equals 1 (hashlist entry) + number
+	 * of private relocation lists it is on
+	 */
+	atomic_t refs;
+
+	spinlock_t lock; /* protects hlist */
+	spinlock_t location_lock; /* protects location */
+	u8 location; /*flag for whether or not on rotating*/
+};
+
+struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask);
+void free_heat_hashlists(struct btrfs_root *root);
+
+/*
+ * Returns a value from 0 to HEAT_MAX_VALUE indicating the temperature of the
+ * file (and consequently its bucket number in hashlist) (see hotdata_hash.c)
+ */
+int btrfs_get_temp(struct btrfs_freq_data *fdata);
+
+/*
+ * initialize kthread for each new mount point that
+ * periodically goes through hot inodes and hot ranges and ages them
+ * based on frequency of access
+ */
+void init_hash_list_kthread(struct btrfs_root *root);
+
+/*
+ * recalculates temperatures for inode or range
+ * and moves around in heat hash table based on temp
+ */
+void btrfs_update_heat_index(struct btrfs_freq_data *fdata,
+			       struct btrfs_root *root);
+
+/* remove from index and clean up all memory associated with hot range */
+void btrfs_remove_range_from_heat_index(struct hot_inode_item *hot_inode,
+					struct hot_range_item *hr,
+					struct btrfs_root *root);
+
+/* remove form index and clean up all memory associated with hot inode */
+void btrfs_remove_inode_from_heat_index(struct hot_inode_item *hot_inode,
+				  struct btrfs_root *root);
+
+#endif /* __HOTDATAHASH__ */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 PATCH 2/6] Btrfs: Add data structures for hot data tracking
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 1/6] Btrfs: Add experimental hot data hash list index bchociej
@ 2010-08-12 22:22 ` bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 3/6] Btrfs: Add hot data relocation facilities bchociej
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

From: Ben Chociej <bchociej@gmail.com>

Adds hot_inode_tree and hot_range_tree structs to keep track of
frequently accessed files and ranges within files. Trees contain
hot_{inode,range}_items representing those files and ranges, each of
which contains a btrfs_freq_data struct with its frequency of access
metrics (number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}).

Having these trees means that Btrfs can quickly determine the
temperature of some data by doing some calculations on the
btrfs_freq_data struct that hangs off of the tree item.

Also, since it isn't entirely obvious, the "frequency" or reads or
writes is determined by taking a kind of generalized average of the last
few (2^N for some tunable N) reads or writes.

Signed-off-by: Ben Chociej <bchociej@gmail.com>
Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/btrfs/hotdata_map.c |  804 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hotdata_map.h |  167 ++++++++++
 2 files changed, 971 insertions(+), 0 deletions(-)
 create mode 100644 fs/btrfs/hotdata_map.c
 create mode 100644 fs/btrfs/hotdata_map.h

diff --git a/fs/btrfs/hotdata_map.c b/fs/btrfs/hotdata_map.c
new file mode 100644
index 0000000..ddae0c4
--- /dev/null
+++ b/fs/btrfs/hotdata_map.c
@@ -0,0 +1,804 @@
+/*
+ * fs/btrfs/hotdata_map.c
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/hardirq.h>
+#include <linux/blkdev.h>
+#include "ctree.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cache;
+static struct kmem_cache *hot_range_item_cache;
+
+static struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode *inode,
+					       int create);
+
+static int btrfs_update_range_freq(struct hot_inode_item *he,
+					       u64 off, u64 len, int create,
+					       struct btrfs_root *root);
+
+/* init hot_inode_item kmem cache */
+int __init hot_inode_item_init(void)
+{
+	hot_inode_item_cache = kmem_cache_create("hot_inode_item",
+			sizeof(struct hot_inode_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+	if (!hot_inode_item_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+/* init hot_range_item kmem cache */
+int __init hot_range_item_init(void)
+{
+	hot_range_item_cache = kmem_cache_create("hot_range_item",
+			sizeof(struct hot_range_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+	if (!hot_range_item_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void hot_inode_item_exit(void)
+{
+	if (hot_inode_item_cache)
+		kmem_cache_destroy(hot_inode_item_cache);
+}
+
+void hot_range_item_exit(void)
+{
+	if (hot_range_item_cache)
+		kmem_cache_destroy(hot_range_item_cache);
+}
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+void hot_inode_tree_init(struct hot_inode_tree *tree)
+{
+	tree->map = RB_ROOT;
+	rwlock_init(&tree->lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_range_tree_init(struct hot_range_tree *tree)
+{
+	tree->map = RB_ROOT;
+	rwlock_init(&tree->lock);
+}
+
+/*
+ * Allocate a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_inode_item()
+ */
+struct hot_inode_item *alloc_hot_inode_item(unsigned long ino)
+{
+	struct hot_inode_item *he;
+	he = kmem_cache_alloc(hot_inode_item_cache, GFP_KERNEL | GFP_NOFS);
+	if (!he || IS_ERR(he))
+		return he;
+
+	atomic_set(&he->refs, 1);
+	he->in_tree = 0;
+	he->i_ino = ino;
+	he->heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS);
+	he->heat_node->freq_data = &he->freq_data;
+	he->freq_data.avg_delta_reads = (u64) -1;
+	he->freq_data.avg_delta_writes = (u64) -1;
+	he->freq_data.nr_reads = 0;
+	he->freq_data.nr_writes = 0;
+	he->freq_data.last_temp = 0;
+	he->freq_data.flags = FREQ_DATA_TYPE_INODE;
+	hot_range_tree_init(&he->hot_range_tree);
+
+	spin_lock_init(&he->lock);
+
+	return he;
+}
+
+/*
+ * Allocate a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+struct hot_range_item *alloc_hot_range_item(struct hot_inode_item *he,
+					    u64 start, u64 len)
+{
+	struct hot_range_item *hr;
+	hr = kmem_cache_alloc(hot_range_item_cache, GFP_KERNEL | GFP_NOFS);
+	if (!hr || IS_ERR(hr))
+		return hr;
+	atomic_set(&hr->refs, 1);
+	hr->in_tree = 0;
+	hr->start = start & RANGE_SIZE_MASK;
+	hr->len = len;
+	hr->hot_inode = he;
+	hr->heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS);
+	hr->heat_node->freq_data = &hr->freq_data;
+	hr->freq_data.avg_delta_reads = (u64) -1;
+	hr->freq_data.avg_delta_writes = (u64) -1;
+	hr->freq_data.nr_reads = 0;
+	hr->freq_data.nr_writes = 0;
+	hr->freq_data.flags = FREQ_DATA_TYPE_RANGE;
+
+	spin_lock_init(&hr->lock);
+
+	return hr;
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one and free the structure
+ * if the reference count hits zero
+ */
+void free_hot_inode_item(struct hot_inode_item *he)
+{
+	if (!he)
+		return;
+	if (atomic_dec_and_test(&he->refs)) {
+		WARN_ON(he->in_tree);
+		kmem_cache_free(hot_inode_item_cache, he);
+	}
+}
+
+/*
+ * Drops the reference out on hot_range_item by one and free the structure
+ * if the reference count hits zero
+ */
+void free_hot_range_item(struct hot_range_item *hr)
+{
+	if (!hr)
+		return;
+	if (atomic_dec_and_test(&hr->refs)) {
+		WARN_ON(hr->in_tree);
+		kmem_cache_free(hot_range_item_cache, hr);
+	}
+}
+
+/* Frees the entire hot_inode_tree. Called by free_fs_root */
+void free_hot_inode_tree(struct btrfs_root *root)
+{
+	struct rb_node *node, *node2;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+
+	/* Free hot inode and range trees on fs root */
+	node = rb_first(&root->hot_inode_tree.map);
+
+	while (node) {
+		he = rb_entry(node, struct hot_inode_item,
+			rb_node);
+
+		node2 = rb_first(&he->hot_range_tree.map);
+
+		while (node2) {
+			hr = rb_entry(node2, struct hot_range_item,
+				rb_node);
+			remove_hot_range_item(&he->hot_range_tree, hr);
+			free_hot_range_item(hr);
+			node2 = rb_first(&he->hot_range_tree.map);
+		}
+
+		remove_hot_inode_item(&root->hot_inode_tree, he);
+		free_hot_inode_item(he);
+		node = rb_first(&root->hot_inode_tree.map);
+	}
+}
+
+static struct rb_node *tree_insert_inode_item(struct rb_root *root,
+					unsigned long inode_num,
+					struct rb_node *node)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct hot_inode_item *entry;
+
+	/* walk tree to find insertion point */
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct hot_inode_item, rb_node);
+
+		if (inode_num < entry->i_ino)
+			p = &(*p)->rb_left;
+		else if (inode_num > entry->i_ino)
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	entry = rb_entry(node, struct hot_inode_item, rb_node);
+	entry->in_tree = 1;
+	rb_link_node(node, parent, p);
+	rb_insert_color(node, root);
+	return NULL;
+}
+
+static u64 range_map_end(struct hot_range_item *hr)
+{
+	if (hr->start + hr->len < hr->start)
+		return (u64)-1;
+	return hr->start + hr->len - 1;
+}
+
+static struct rb_node *tree_insert_range_item(struct rb_root *root,
+					      u64 start,
+					      struct rb_node *node)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct hot_range_item *entry;
+
+
+	/* walk tree to find insertion point */
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct hot_range_item, rb_node);
+
+		if (start < entry->start)
+			p = &(*p)->rb_left;
+		else if (start >= range_map_end(entry))
+			p = &(*p)->rb_right;
+		else
+			return parent;
+	}
+
+	entry = rb_entry(node, struct hot_range_item, rb_node);
+	entry->in_tree = 1;
+	rb_link_node(node, parent, p);
+	rb_insert_color(node, root);
+	return NULL;
+}
+
+/*
+ * Add a hot_inode_item to a hot_inode_tree. If the tree already contains
+ * an item with the index given, return -EEXIST
+ */
+int add_hot_inode_item(struct hot_inode_tree *tree,
+		       struct hot_inode_item *he)
+{
+	int ret = 0;
+	struct rb_node *rb;
+	struct hot_inode_item *exist;
+
+	exist = lookup_hot_inode_item(tree, he->i_ino);
+	if (exist) {
+		free_hot_inode_item(exist);
+		ret = -EEXIST;
+		goto out;
+	}
+	rb = tree_insert_inode_item(&tree->map, he->i_ino, &he->rb_node);
+	if (rb) {
+		ret = -EEXIST;
+		goto out;
+	}
+	atomic_inc(&he->refs);
+out:
+	return ret;
+}
+
+/*
+ * Add a hot_range_item to a hot_range_tree. If the tree already contains
+ * an item with the index given, return -EEXIST
+ *
+ * Also optionally aggresively merge ranges (currently disabled)
+ */
+int add_hot_range_item(struct hot_range_tree *tree,
+		       struct hot_range_item *hr)
+{
+	int ret = 0;
+	struct rb_node *rb;
+	struct hot_range_item *exist;
+	/* struct hot_range_item *merge = NULL; */
+
+	exist = lookup_hot_range_item(tree, hr->start);
+	if (exist) {
+		free_hot_range_item(exist);
+		ret = -EEXIST;
+		goto out;
+	}
+	rb = tree_insert_range_item(&tree->map, hr->start, &hr->rb_node);
+	if (rb) {
+		ret = -EEXIST;
+		goto out;
+	}
+
+	atomic_inc(&hr->refs);
+
+out:
+	return ret;
+}
+
+/*
+ * Lookup a hot_inode_item in the hot_inode_tree with the given index
+ * (inode_num)
+ */
+struct hot_inode_item *lookup_hot_inode_item(struct hot_inode_tree *tree,
+					 unsigned long inode_num)
+{
+	struct rb_node **p = &(tree->map.rb_node);
+	struct rb_node *parent = NULL;
+	struct hot_inode_item *entry;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct hot_inode_item, rb_node);
+
+		if (inode_num < entry->i_ino)
+			p = &(*p)->rb_left;
+		else if (inode_num > entry->i_ino)
+			p = &(*p)->rb_right;
+		else {
+			atomic_inc(&entry->refs);
+			return entry;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Lookup a hot_range_item in a hot_range_tree with the given index
+ * (start, offset)
+ */
+struct hot_range_item *lookup_hot_range_item(struct hot_range_tree *tree,
+					     u64 start)
+{
+	struct rb_node **p = &(tree->map.rb_node);
+	struct rb_node *parent = NULL;
+	struct hot_range_item *entry;
+
+	/* ensure start is on a range boundary */
+	start = start & RANGE_SIZE_MASK;
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct hot_range_item, rb_node);
+
+		if (start < entry->start)
+			p = &(*p)->rb_left;
+		else if (start > range_map_end(entry))
+			p = &(*p)->rb_right;
+		else {
+			atomic_inc(&entry->refs);
+			return entry;
+		}
+	}
+	return NULL;
+}
+
+int remove_hot_inode_item(struct hot_inode_tree *tree,
+			  struct hot_inode_item *he)
+{
+	int ret = 0;
+	rb_erase(&he->rb_node, &tree->map);
+	he->in_tree = 0;
+	return ret;
+}
+
+int remove_hot_range_item(struct hot_range_tree *tree,
+			  struct hot_range_item *hr)
+{
+	int ret = 0;
+	rb_erase(&hr->rb_node, &tree->map);
+	hr->in_tree = 0;
+	return ret;
+}
+
+/* Returns the percent of SSD that is full. If no SSD is found, returns 101. */
+inline int __btrfs_ssd_filled(struct btrfs_root *root)
+{
+	struct btrfs_space_info *info;
+	struct btrfs_device *device;
+	struct list_head *head = &root->fs_info->fs_devices->devices;
+	int slot_count = 0;
+	u64 total_ssd_bytes = 0;
+	u64 ssd_bytes_used = 0;
+
+	/*
+	 * iterate through devices. if they're nonrotating, add their bytes
+	 * to the total_ssd_bytes
+	 */
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+
+	list_for_each_entry(device, head, dev_list) {
+		if (blk_queue_nonrot(bdev_get_queue(device->bdev)))
+			total_ssd_bytes += device->total_bytes;
+	}
+
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
+	if (total_ssd_bytes == 0)
+		return 101;
+
+	/*
+	 * iterate through space_info. if the SSD data block group is found,
+	 * add the bytes used by that group to ssd_bytes_used
+	 */
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(info, &root->fs_info->space_info, list)
+		slot_count++;
+
+	list_for_each_entry_rcu(info, &root->fs_info->space_info, list) {
+		if (slot_count == 0)
+			break;
+		slot_count--;
+
+		if (info->flags & BTRFS_BLOCK_GROUP_DATA_SSD)
+			ssd_bytes_used += info->bytes_used;
+	}
+
+	rcu_read_unlock();
+
+	/* finish up. return percent of SSD filled. */
+	BUG_ON(ssd_bytes_used >= total_ssd_bytes);
+
+	return (int) div64_u64(ssd_bytes_used * 100, total_ssd_bytes);
+}
+
+/*
+ * updates the current temperature threshold for hot data
+ * migration based on how full the SSDs are.
+ */
+int btrfs_update_threshold(struct btrfs_root *root, int update)
+{
+	int threshold = root->heat_threshold;
+	int full = __btrfs_ssd_filled(root);
+	printk(KERN_INFO "btrfs ssd filled %d\n", full);
+
+	/* Sometimes update the global threshold, others not */
+	if (!update && full < HIGH_WATER_LEVEL)
+		return full;
+
+	if (unlikely(full > 100)) {
+		threshold = HEAT_MAX_VALUE + 1;
+	} else {
+
+		WARN_ON(HIGH_WATER_LEVEL > 100 || LOW_WATER_LEVEL < 0);
+
+		if (full >= HIGH_WATER_LEVEL)
+			threshold += THRESH_UP_SPEED;
+		else if (full <= LOW_WATER_LEVEL)
+			threshold -= THRESH_DOWN_SPEED;
+
+		if (threshold > HEAT_MAX_VALUE)
+			threshold = HEAT_MAX_VALUE + 1;
+		else if (threshold < 0)
+			threshold = 0;
+	}
+
+	root->heat_threshold = threshold;
+	return full;
+}
+
+/* main function to update access frequency from read/writepage(s) hooks */
+inline void btrfs_update_freqs(struct inode *inode, u64 start,
+	u64 len, int create)
+{
+	struct hot_inode_item *he;
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
+
+	he = btrfs_update_inode_freq(btrfs_inode, create);
+
+	/*
+	 * this line was moved to __do_relocate_kthread:
+	 *
+	 * __btrfs_update_threshold(btrfs_inode->root);
+	 */
+
+	WARN_ON(!he || IS_ERR(he));
+
+	if (he && !IS_ERR(he)) {
+		btrfs_update_range_freq(he, start, len,
+			create, btrfs_inode->root);
+
+		free_hot_inode_item(he);
+	}
+
+}
+
+/* Update inode frequency struct */
+static struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode
+						*inode, int create)
+{
+	struct hot_inode_tree *hitree = &inode->root->hot_inode_tree;
+	struct hot_inode_item *he;
+	struct btrfs_root *root = inode->root;
+
+	read_lock(&hitree->lock);
+	he = lookup_hot_inode_item(hitree, inode->vfs_inode.i_ino);
+	read_unlock(&hitree->lock);
+
+	if (!he) {
+		he = alloc_hot_inode_item(inode->vfs_inode.i_ino);
+
+		if (!he || IS_ERR(he))
+			goto out;
+
+		write_lock(&hitree->lock);
+		add_hot_inode_item(hitree, he);
+		write_unlock(&hitree->lock);
+	}
+
+	if ((!root->fs_info->hot_data_relocate_kthread)
+	    || root->fs_info->hot_data_relocate_kthread->pid != current->pid) {
+		spin_lock(&he->lock);
+		btrfs_update_freq(&he->freq_data, create);
+		spin_unlock(&he->lock);
+		btrfs_update_heat_index(&he->freq_data, root);
+	}
+
+out:
+	return he;
+}
+
+/* Update range frequency struct */
+static int btrfs_update_range_freq(struct hot_inode_item *he,
+					       u64 off, u64 len, int create,
+					       struct btrfs_root *root)
+{
+	struct hot_range_tree *hrtree = &he->hot_range_tree;
+	struct hot_range_item *hr = NULL;
+	u64 start_off = off & RANGE_SIZE_MASK;
+	u64 end_off = (off + len - 1) & RANGE_SIZE_MASK;
+	u64 cur;
+	int ret = 0;
+
+	if (len == 0)
+		return 1;
+
+	/*
+	 * Align ranges on RANGE_SIZE boundary to prevent proliferation
+	 * of range structs
+	 */
+	for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) {
+		read_lock(&hrtree->lock);
+		hr = lookup_hot_range_item(hrtree, cur);
+		read_unlock(&hrtree->lock);
+
+		if (!hr) {
+			hr = alloc_hot_range_item(he, cur, RANGE_SIZE);
+			if (!hr || IS_ERR(hr)) {
+				ret = 1;
+				goto out;
+			}
+
+			write_lock(&hrtree->lock);
+			add_hot_range_item(hrtree, hr);
+			write_unlock(&hrtree->lock);
+		}
+
+		if ((!root->fs_info->hot_data_relocate_kthread)
+		     || root->fs_info->hot_data_relocate_kthread->pid
+		     != current->pid) {
+			spin_lock(&hr->lock);
+			btrfs_update_freq(&hr->freq_data, create);
+			spin_unlock(&hr->lock);
+
+			btrfs_update_heat_index(&hr->freq_data, root);
+		}
+
+		free_hot_range_item(hr);
+
+	}
+out:
+	return ret;
+}
+
+/*
+ * This function does the actual work of updating the frequency numbers,
+ * whatever they turn out to be. BTRFS_FREQ_POWER determines how many atime
+ * deltas we keep track of (as a power of 2). So, setting it to anything above
+ * 16ish is probably overkill. Also, the higher the power, the more bits get
+ * right shifted out of the timestamp, reducing precision, so take note of that
+ * as well.
+ *
+ * The caller should have already locked fdata's parent's spinlock.
+ *
+ * BTRFS_FREQ_POWER, defined immediately below, determines how heavily to weight
+ * the current frequency numbers against the newest access. For example, a value
+ * of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+ * as heavily as the existing frequency info. In essence, this is a kludged-
+ * together version of a weighted average, since we can't afford to keep all of
+ * the information that it would take to get a _real_ weighted average.
+ */
+#define BTRFS_FREQ_POWER 4
+void btrfs_update_freq(struct btrfs_freq_data *fdata, int create)
+{
+	struct timespec old_atime;
+	struct timespec current_time;
+	struct timespec delta_ts;
+	u64 new_avg;
+	u64 new_delta;
+
+	if (unlikely(create)) {
+		old_atime = fdata->last_write_time;
+		fdata->nr_writes += 1;
+		new_avg = fdata->avg_delta_writes;
+	} else {
+		old_atime = fdata->last_read_time;
+		fdata->nr_reads += 1;
+		new_avg = fdata->avg_delta_reads;
+	}
+
+	current_time = current_kernel_time();
+	delta_ts = timespec_sub(current_time, old_atime);
+	new_delta = timespec_to_ns(&delta_ts) >> BTRFS_FREQ_POWER;
+
+	new_avg = (new_avg << BTRFS_FREQ_POWER) - new_avg + new_delta;
+	new_avg = new_avg >> BTRFS_FREQ_POWER;
+
+	if (unlikely(create)) {
+		fdata->last_write_time = current_time;
+		fdata->avg_delta_writes = new_avg;
+	} else {
+		fdata->last_read_time = current_time;
+		fdata->avg_delta_reads = new_avg;
+	}
+}
+
+/*
+ * Get a new temperature and, if necessary, move the heat_node corresponding
+ * to this inode or range to the proper hashlist with the new temperature
+ */
+void btrfs_update_heat_index(struct btrfs_freq_data *fdata,
+			       struct btrfs_root *root)
+{
+	int temp = 0;
+	int moved = 0;
+	struct heat_hashlist_entry *buckets, *current_bucket = NULL;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+
+	if (fdata->flags & FREQ_DATA_TYPE_INODE) {
+		he = freq_data_get_he(fdata);
+		buckets = root->heat_inode_hl;
+
+		spin_lock(&he->lock);
+		temp = btrfs_get_temp(fdata);
+		fdata->last_temp = temp;
+		spin_unlock(&he->lock);
+
+		if (he == NULL)
+			return;
+
+		spin_lock(&he->heat_node->lock);
+		if (he->heat_node->hlist == NULL) {
+			current_bucket = buckets +
+					temp;
+			moved = 1;
+		} else {
+			write_lock(&he->heat_node->hlist->rwlock);
+			current_bucket = he->heat_node->hlist;
+			if (current_bucket->temperature != temp) {
+				hlist_del(&he->heat_node->hashnode);
+				current_bucket = buckets + temp;
+				moved = 1;
+			}
+			write_unlock(&he->heat_node->hlist->rwlock);
+		}
+
+		if (moved) {
+			write_lock(&current_bucket->rwlock);
+			hlist_add_head(&he->heat_node->hashnode,
+				&current_bucket->hashhead);
+			he->heat_node->hlist = current_bucket;
+			write_unlock(&current_bucket->rwlock);
+		}
+		spin_unlock(&he->heat_node->lock);
+
+	} else if (fdata->flags & FREQ_DATA_TYPE_RANGE) {
+		hr = freq_data_get_hr(fdata);
+		buckets = root->heat_range_hl;
+
+		spin_lock(&hr->lock);
+		temp = btrfs_get_temp(fdata);
+		fdata->last_temp = temp;
+		spin_unlock(&hr->lock);
+
+		if (hr == NULL)
+			return;
+
+		spin_lock(&hr->heat_node->lock);
+		if (hr->heat_node->hlist == NULL) {
+			current_bucket = buckets +
+					temp;
+			moved = 1;
+		} else {
+			write_lock(&hr->heat_node->hlist->rwlock);
+			current_bucket = hr->heat_node->hlist;
+			if (current_bucket->temperature != temp) {
+				hlist_del(&hr->heat_node->hashnode);
+				current_bucket = buckets + temp;
+				moved = 1;
+			}
+			write_unlock(&hr->heat_node->hlist->rwlock);
+		}
+
+		if (moved) {
+			write_lock(&current_bucket->rwlock);
+			hlist_add_head(&hr->heat_node->hashnode,
+				&current_bucket->hashhead);
+			hr->heat_node->hlist = current_bucket;
+			write_unlock(&current_bucket->rwlock);
+		}
+		spin_unlock(&hr->heat_node->lock);
+	}
+}
+
+/* Walk the hot_inode_tree, locking as necessary */
+struct hot_inode_item *find_next_hot_inode(struct btrfs_root *root,
+						  u64 objectid)
+{
+	struct rb_node *node;
+	struct rb_node *prev;
+	struct hot_inode_item *entry;
+
+	read_lock(&root->hot_inode_tree.lock);
+
+	node = root->hot_inode_tree.map.rb_node;
+	prev = NULL;
+	while (node) {
+		prev = node;
+		entry = rb_entry(node, struct hot_inode_item, rb_node);
+
+		if (objectid < entry->i_ino)
+			node = node->rb_left;
+		else if (objectid > entry->i_ino)
+			node = node->rb_right;
+		else
+			break;
+	}
+	if (!node) {
+		while (prev) {
+			entry = rb_entry(prev, struct hot_inode_item, rb_node);
+			if (objectid <= entry->i_ino) {
+				node = prev;
+				break;
+			}
+			prev = rb_next(prev);
+		}
+	}
+	if (node) {
+		entry = rb_entry(node, struct hot_inode_item, rb_node);
+		/*
+		 * increase reference count to prevent pruning while
+		 * caller is using the hot_inode_item
+		 */
+		atomic_inc(&entry->refs);
+
+		read_unlock(&root->hot_inode_tree.lock);
+		return entry;
+	}
+
+	read_unlock(&root->hot_inode_tree.lock);
+	return NULL;
+}
+
diff --git a/fs/btrfs/hotdata_map.h b/fs/btrfs/hotdata_map.h
new file mode 100644
index 0000000..d359fce
--- /dev/null
+++ b/fs/btrfs/hotdata_map.h
@@ -0,0 +1,167 @@
+/*
+ * fs/btrfs/hotdata_map.h
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __HOTDATAMAP__
+#define __HOTDATAMAP__
+
+#include <linux/rbtree.h>
+
+/* values for btrfs_freq_data flags */
+#define FREQ_DATA_TYPE_INODE 1		/* freq data struct is for an inode */
+#define FREQ_DATA_TYPE_RANGE (1 << 1)	/* freq data struct is for a range */
+#define FREQ_DATA_HEAT_HOT (1 << 2)	/* freq data struct is for hot data */
+					/* (not implemented) */
+/* size of sub-file ranges */
+#define RANGE_SIZE (1<<20)
+#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1)))
+
+/* macros to wrap container_of()'s for hot data structs */
+#define freq_data_get_he(x) (struct hot_inode_item *) container_of(x, \
+					struct hot_inode_item, freq_data)
+#define freq_data_get_hr(x) (struct hot_range_item *) container_of(x, \
+					struct hot_range_item, freq_data)
+#define rb_node_get_he(x) (struct hot_inode_item *) container_of(x, \
+					struct hot_inode_item, rb_node)
+#define rb_node_get_hr(x) (struct hot_range_item *) container_of(x, \
+					struct hot_range_item, rb_node)
+
+#define HIGH_WATER_LEVEL 75	/* when to raise the threshold */
+#define LOW_WATER_LEVEL 50	/* when to lower the threshold */
+#define THRESH_UP_SPEED 10	/* how much to raise it by */
+#define THRESH_DOWN_SPEED 1	/* how much to lower it by */
+
+/* A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item */
+struct btrfs_freq_data {
+	struct timespec last_read_time;
+	struct timespec last_write_time;
+	u32 nr_reads;
+	u32 nr_writes;
+	u64 avg_delta_reads;
+	u64 avg_delta_writes;
+	u8 flags;
+	u32 last_temp;
+};
+
+/* A tree that sits on the fs_root */
+struct hot_inode_tree {
+	struct rb_root map;
+	rwlock_t lock;
+};
+
+/* A tree of ranges for each inode in the hot_inode_tree */
+struct hot_range_tree {
+	struct rb_root map;
+	rwlock_t lock;
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+	struct rb_node rb_node; /* node for hot_inode_tree rb_tree */
+	struct hot_range_tree hot_range_tree; /* tree of ranges in this
+						 inode */
+	struct btrfs_freq_data freq_data; /* frequency data for this inode */
+	struct heat_hashlist_node *heat_node; /* hashlist node for this
+						 inode */
+	unsigned long i_ino; /* inode number, copied from vfs_inode */
+	spinlock_t lock; /* protects freq_data, i_no, in_tree */
+	atomic_t refs; /* prevents kfree */
+	u8 in_tree; /* used to check for errors in ref counting */
+};
+
+/*
+ * An item representing a range inside of an inode whose frequency
+ * is being tracked
+ */
+struct hot_range_item {
+	/* node for hot_range_tree rb_tree */
+	struct rb_node rb_node;
+
+	/* frequency data for this range */
+	struct btrfs_freq_data freq_data;
+
+	/* hashlist node for this range */
+	struct heat_hashlist_node *heat_node;
+
+	/* the hot_inode_item associated with this hot_range_item */
+	struct hot_inode_item *hot_inode;
+
+	/* starting offset of this range */
+	u64 start;
+
+	/* length of this range */
+	u64 len;
+
+	/* protects freq_data, start, len, and in_tree */
+	spinlock_t lock;
+
+	/* prevents kfree */
+	atomic_t refs;
+
+	/* used to check for errors in ref counting */
+	u8 in_tree;
+};
+
+struct btrfs_root;
+struct inode;
+
+void hot_inode_tree_init(struct hot_inode_tree *tree);
+void hot_range_tree_init(struct hot_range_tree *tree);
+
+struct hot_range_item *lookup_hot_range_item(struct hot_range_tree *tree,
+					    u64 start);
+
+struct hot_inode_item *lookup_hot_inode_item(struct hot_inode_tree *tree,
+					    unsigned long inode_num);
+
+int add_hot_inode_item(struct hot_inode_tree *tree,
+		       struct hot_inode_item *he);
+int add_hot_range_item(struct hot_range_tree *tree,
+		       struct hot_range_item *hr);
+
+int remove_hot_inode_item(struct hot_inode_tree *tree,
+			struct hot_inode_item *he);
+int remove_hot_range_item(struct hot_range_tree *tree,
+			 struct hot_range_item *hr);
+
+struct hot_inode_item *alloc_hot_inode_item(unsigned long ino);
+struct hot_range_item *alloc_hot_range_item(struct hot_inode_item *he,
+					    u64 start,
+					    u64 len);
+
+void free_hot_inode_item(struct hot_inode_item *he);
+void free_hot_range_item(struct hot_range_item *hr);
+void free_hot_inode_tree(struct btrfs_root *root);
+
+int __init hot_inode_item_init(void);
+int __init hot_range_item_init(void);
+
+void hot_inode_item_exit(void);
+void hot_range_item_exit(void);
+
+struct hot_inode_item *find_next_hot_inode(struct btrfs_root *root,
+						  u64 objectid);
+int btrfs_update_threshold(struct btrfs_root *, int update);
+void btrfs_update_freq(struct btrfs_freq_data *fdata, int create);
+void btrfs_update_freqs(struct inode *inode, u64 start, u64 len,
+	int create);
+
+#endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 PATCH 3/6] Btrfs: Add hot data relocation facilities
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 1/6] Btrfs: Add experimental hot data hash list index bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 2/6] Btrfs: Add data structures for hot data tracking bchociej
@ 2010-08-12 22:22 ` bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 4/6] Btrfs: Add debugfs interface for hot data stats bchociej
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

From: Ben Chociej <bchociej@gmail.com>

The relocation code operates on the heat hash lists to identify hot or
cold data logical file ranges that are candidates for relocation. The
triggering mechanism for relocation is controlled by a global heat
threshold integer value (fs_root->heat_threshold). Ranges are queued for
relocation by the periodically-executing relocate kthread, which updates
the global heat threshold and responds to space pressure on the SSDs.

The heat hash lists index logical ranges by heat and provide a
constant-time access path to hot or cold range items. The relocation
kthread uses this path to find hot or cold items to move to/from SSD. To
ensure that the relocation kthread has a chance to sleep, and to prevent
thrashing between SSD and HDD, there is a configurable limit to how many
ranges are moved per iteration of the kthread. This limit may be overrun
in the case where space pressure requires that items be aggressively
moved from SSD back to HDD.

This needs still more resistance to thrashing and stronger (read:
actual) guarantees that relocation operations won't -ENOSPC.

The relocation code has introduced two new btrfs block group types:
BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD. The later
is not currently implemented; to wit, this implementation does not move
any metadata, including inlined extents, to SSD.

When mkfs'ing a volume with the hot data relocation option, initial
block groups are allocated to the proper disks. Runtime block group
allocation only allocates BTRFS_BLOCK_GROUP_DATA
BTRFS_BLOCK_GROUP_METADATA and BTRFS_BLOCK_GROUP_SYSTEM to HDD, and
likewise only allocates BTRFS_BLOCK_GROUP_DATA_SSD and
BTRFS_BLOCK_GROUP_METADATA_SSD to SSD (assuming, critically, the
HOTDATAMOVE option is set at mount time).

Signed-off-by: Ben Chociej <bchociej@gmail.com>
Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/btrfs/hotdata_relocate.c |  783 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hotdata_relocate.h |   73 ++++
 2 files changed, 856 insertions(+), 0 deletions(-)
 create mode 100644 fs/btrfs/hotdata_relocate.c
 create mode 100644 fs/btrfs/hotdata_relocate.h

diff --git a/fs/btrfs/hotdata_relocate.c b/fs/btrfs/hotdata_relocate.c
new file mode 100644
index 0000000..c5060c4
--- /dev/null
+++ b/fs/btrfs/hotdata_relocate.c
@@ -0,0 +1,783 @@
+/*
+ * fs/btrfs/hotdata_relocate.c
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/kthread.h>
+#include <linux/list.h>
+#include <linux/freezer.h>
+#include <linux/spinlock.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include "hotdata_map.h"
+#include "hotdata_relocate.h"
+#include "btrfs_inode.h"
+#include "ctree.h"
+#include "volumes.h"
+
+/*
+ * Hot data relocation strategy:
+ *
+ * The relocation code below operates on the heat hash lists to identify
+ * hot or cold data logical file ranges that are candidates for relocation.
+ * The triggering mechanism for relocation is controlled by a global heat
+ * threshold integer value (fs_root->heat_threshold). Ranges are queued
+ * for relocation by the periodically executing relocate kthread, which
+ * updates the global heat threshold and responds to space pressure on the
+ * SSDs.
+ *
+ * The heat hash lists index logical ranges by heat and provide a constant-time
+ * access path to hot or cold range items. The relocation kthread uses this
+ * path to find hot or cold items to move to/from SSD. To ensure that the
+ * relocation kthread has a chance to sleep, and to prevent thrashing between
+ * SSD and HDD, there is a configurable limit to how many ranges are moved per
+ * iteration of the kthread. This limit may be overrun in the case where space
+ * pressure requires that items be aggressively moved from SSD back to HDD.
+ *
+ * This needs still more resistance to thrashing and stronger (read: actual)
+ * guarantees that relocation operations won't -ENOSPC.
+ *
+ * The relocation code has introduced two new btrfs block group types:
+ * BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD. The later is
+ * not currently implemented; to wit, this implementation does not move any
+ * metadata *including inlined extents* to SSD.
+ *
+ * When mkfs'ing a volume with the hot data relocation option, initial block
+ * groups are allocated to the proper disks. Runtime block group allocation
+ * only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and
+ * BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates
+ * BTRFS_BLOCK_GROUP_DATA_SSD and BTRFS_BLOCK_GROUP_METADATA_SSD to SSD
+ * (assuming, critically, the HOTDATAMOVE option is set at mount time).
+ */
+
+/*
+ * prepares hot or cold nodes to be moved to the location specified,
+ * sets up range args based on whether moving entire inode or range
+ */
+static int move_item(struct heat_hashlist_node *heatnode,
+		     struct btrfs_root *fs_root,
+		     int location)
+{
+	struct hot_inode_item *hot_inode_item;
+	struct hot_range_item *hot_range_item;
+	struct btrfs_relocate_range_args range_args;
+	int ret = 0;
+
+	if (heatnode->freq_data->flags & FREQ_DATA_TYPE_INODE) {
+
+		hot_inode_item = container_of(heatnode->freq_data,
+					      struct hot_inode_item,
+					      freq_data);
+		range_args.start = 0;
+		/* (u64)-1 moves the whole inode */
+		range_args.len = (u64)-1;
+		range_args.flags = 0;
+		range_args.extent_thresh = 1;
+		ret = btrfs_relocate_inode(hot_inode_item->i_ino,
+				     &range_args,
+				     fs_root,
+				     location);
+	} else if (heatnode->freq_data->flags & FREQ_DATA_TYPE_RANGE) {
+		hot_range_item = container_of(heatnode->freq_data,
+					      struct hot_range_item,
+					      freq_data);
+		range_args.start = hot_range_item->start;
+		range_args.len = hot_range_item->len;
+		range_args.flags = 0;
+		range_args.extent_thresh = 1;
+		ret = btrfs_relocate_inode(hot_range_item->hot_inode->i_ino,
+				     &range_args,
+				     fs_root,
+				     location);
+		}
+	return ret;
+}
+
+/*
+ * thread iterates through heat hash table and finds hot
+ * and cold data to move based on ssd pressure.
+ *
+ * first iterates through cold items below the heat
+ * threshold, if the item is on
+ * ssd and is now cold, we queue it up for relocation
+ * back to spinning disk. After scanning these items
+ * we call relocation code on all ranges that have been
+ * queued up for moving back to hdd.
+ *
+ * we then iterate through items above the heat threshold
+ * and if they are on hdd we que them up to be moved to
+ * ssd. We then iterate through queue and move hot ranges
+ * to ssd if they are not already
+ */
+static void __do_relocate_kthread(struct btrfs_root *root)
+{
+	int i;
+	int counter;
+	int heat_threshold;
+	int location;
+	int percent_ssd = 0;
+	struct btrfs_root *fs_root;
+	struct list_head *relocate_pos, *relocate_pos2;
+	struct heat_hashlist_node *relocate_heatnode = NULL;
+	struct list_head relocate_queue_to_rot;
+	struct list_head relocate_queue_to_nonrot;
+	static u32 run_count = 1;
+
+	run_count++;
+
+	fs_root = root->fs_info->fs_root;
+	percent_ssd = btrfs_update_threshold(fs_root, !(run_count % 15));
+	heat_threshold = fs_root->heat_threshold;
+
+do_cold:
+	INIT_LIST_HEAD(&relocate_queue_to_rot);
+
+	/* Don't move cold data to HDD unless there's space pressure */
+	if (percent_ssd < HIGH_WATER_LEVEL)
+		goto do_hot;
+
+	counter = 0;
+
+	/*
+	 * Move up to RELOCATE_MAX_ITEMS cold ranges back to spinning.
+	 * First, queue up items to move on the relocate_queue_to_rot.
+	 * Using (heat_threshold - 5) to control relocation hopefully
+	 * prevents some thrashing between SSD and HDD.
+	 */
+	for (i = 0; i <  heat_threshold - 5; i++) {
+		struct hlist_node *pos = NULL, *pos2 = NULL;
+		struct heat_hashlist_node *heatnode = NULL;
+		struct hlist_head *hashhead;
+		rwlock_t *lock;
+
+		hashhead = &fs_root->heat_range_hl[i].hashhead;
+		lock = &fs_root->heat_range_hl[i].rwlock;
+		read_lock(lock);
+
+		hlist_for_each_safe(pos, pos2, hashhead) {
+			heatnode = hlist_entry(pos,
+					struct heat_hashlist_node,
+					hashnode);
+
+			/* queue up on relocate list */
+			spin_lock(&heatnode->location_lock);
+			location = heatnode->location;
+			spin_unlock(&heatnode->location_lock);
+
+			if (location != BTRFS_ON_ROTATING) {
+				atomic_inc(&heatnode->refs);
+				list_add(&heatnode->node,
+					 &relocate_queue_to_rot);
+				counter++;
+			}
+
+			if (counter >= RELOCATE_MAX_ITEMS)
+				break;
+		}
+
+		read_unlock(lock);
+	}
+
+	/* Second, do the relocation */
+	list_for_each_safe(relocate_pos, relocate_pos2,
+		&relocate_queue_to_rot) {
+
+		relocate_heatnode = list_entry(relocate_pos,
+			struct heat_hashlist_node, node);
+
+		spin_lock(&relocate_heatnode->location_lock);
+		location = relocate_heatnode->location;
+		spin_unlock(&relocate_heatnode->location_lock);
+
+		if (location != BTRFS_ON_ROTATING) {
+			move_item(relocate_heatnode, fs_root,
+				BTRFS_ON_ROTATING);
+			relocate_heatnode->location = BTRFS_ON_ROTATING;
+		}
+
+		list_del(relocate_pos);
+		atomic_dec(&relocate_heatnode->refs);
+
+		if (kthread_should_stop())
+			return;
+	}
+
+	/*
+	 * Move up to RELOCATE_MAX_ITEMS ranges to SSD. Periodically check
+	 * for space pressure on SSD and goto do_cold if we've exceeded
+	 * the SSD capacity high water mark.
+	 * First, queue up items to move on relocate_queue_to_nonrot.
+	 */
+do_hot:
+	INIT_LIST_HEAD(&relocate_queue_to_nonrot);
+	counter = 0;
+
+	for (i = HEAT_MAX_VALUE; i >= heat_threshold; i--) {
+		struct hlist_node *pos = NULL, *pos2 = NULL;
+		struct heat_hashlist_node *heatnode = NULL;
+		struct hlist_head *hashhead;
+		rwlock_t *lock;
+
+		/* move hot ranges */
+		hashhead = &fs_root->heat_range_hl[i].hashhead;
+		lock =  &fs_root->heat_range_hl[i].rwlock;
+		read_lock(lock);
+
+		hlist_for_each_safe(pos, pos2, hashhead) {
+			heatnode = hlist_entry(pos,
+					struct heat_hashlist_node,
+					hashnode);
+
+			/* queue up on relocate list */
+			spin_lock(&heatnode->location_lock);
+			location = heatnode->location;
+			spin_unlock(&heatnode->location_lock);
+
+			if (location != BTRFS_ON_NONROTATING) {
+				atomic_inc(&heatnode->refs);
+				list_add(&heatnode->node,
+					 &relocate_queue_to_nonrot);
+				counter++;
+			}
+
+			if (counter >= RELOCATE_MAX_ITEMS)
+				break;
+		}
+
+		read_unlock(lock);
+	}
+
+	counter = 0;
+
+	/* Second, do the relocation */
+	list_for_each_safe(relocate_pos, relocate_pos2,
+		&relocate_queue_to_nonrot) {
+
+		relocate_heatnode = list_entry(relocate_pos,
+			struct heat_hashlist_node, node);
+
+		spin_lock(&relocate_heatnode->location_lock);
+		location = relocate_heatnode->location;
+		spin_unlock(&relocate_heatnode->location_lock);
+
+		if (location != BTRFS_ON_NONROTATING) {
+			move_item(relocate_heatnode, fs_root,
+				BTRFS_ON_NONROTATING);
+			relocate_heatnode->location = BTRFS_ON_NONROTATING;
+		}
+
+		list_del(relocate_pos);
+		atomic_dec(&relocate_heatnode->refs);
+
+		if (kthread_should_stop())
+			return;
+
+		/*
+		 * If we've exceeded the SSD capacity high water mark,
+		 * goto do_cold to relieve the pressure
+		 */
+		if (counter % 50 == 0) {
+			percent_ssd = btrfs_update_threshold(fs_root, 0);
+			heat_threshold = fs_root->heat_threshold;
+
+			if (percent_ssd >= HIGH_WATER_LEVEL)
+				goto do_cold;
+		}
+
+		counter++;
+	}
+}
+
+/* main loop for running relcation thread */
+static int do_relocate_kthread(void *arg)
+{
+	struct btrfs_root *root = arg;
+	unsigned long delay;
+	do {
+		delay = HZ * RELOCATE_TIME_DELAY;
+		if (mutex_trylock(
+			&root->fs_info->hot_data_relocate_kthread_mutex)) {
+			if (btrfs_test_opt(root, HOTDATA_MOVE))
+				__do_relocate_kthread(root);
+			mutex_unlock(
+				&root->fs_info->
+				hot_data_relocate_kthread_mutex);
+		}
+		if (freezing(current)) {
+			refrigerator();
+		} else {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (!kthread_should_stop())
+				schedule_timeout(delay);
+			__set_current_state(TASK_RUNNING);
+		}
+	} while (!kthread_should_stop());
+	return 0;
+}
+
+/* kick off the relocate kthread */
+void init_hot_data_relocate_kthread(struct btrfs_root *root)
+{
+	root->fs_info->hot_data_relocate_kthread =
+					kthread_run(do_relocate_kthread,
+					root,
+					"hot_data_relocate_kthread");
+	if (IS_ERR(root->fs_info->hot_data_relocate_kthread))
+		kthread_stop(root->fs_info->hot_data_relocate_kthread);
+}
+
+/*
+ * placeholder for function to scan SSDs on startup with HOTDATAMOVE to bring
+ * access frequency structs into memory to allow that data to be eligible for
+ * relocation to spinning disk
+ */
+static inline void __do_ssd_scan(struct btrfs_device *device)
+{
+	return;
+}
+
+static int do_ssd_scan_kthread(void *arg)
+{
+	struct btrfs_root *root = arg;
+	struct btrfs_root *dev_root;
+	struct btrfs_device *device;
+	struct list_head *devices = &root->fs_info->fs_devices->devices;
+	int ret = 0;
+
+	mutex_lock(&root->fs_info->ssd_scan_kthread_mutex);
+
+	if (root->fs_info->sb->s_flags & MS_RDONLY) {
+		ret = -EROFS;
+		goto out;
+	}
+
+	dev_root = root->fs_info->dev_root;
+	mutex_lock(&dev_root->fs_info->volume_mutex);
+
+	list_for_each_entry(device, devices, dev_list) {
+		int device_rotating;
+		if (!device->writeable)
+			continue;
+
+		device_rotating =
+			!blk_queue_nonrot(bdev_get_queue(device->bdev));
+
+		if (!device_rotating)
+			__do_ssd_scan(device);
+
+		if (ret == -ENOSPC)
+			break;
+		BUG_ON(ret);
+
+	}
+	mutex_unlock(&dev_root->fs_info->volume_mutex);
+
+	do {
+		break;
+	} while (!kthread_should_stop());
+
+out:
+	mutex_unlock(&root->fs_info->ssd_scan_kthread_mutex);
+
+	return ret;
+}
+
+void init_ssd_scan_kthread(struct btrfs_root *root)
+{
+	root->fs_info->ssd_scan_kthread =
+					kthread_run(do_ssd_scan_kthread,
+					root,
+					"ssd_scan_kthread");
+	if (IS_ERR(root->fs_info->ssd_scan_kthread))
+		kthread_stop(root->fs_info->ssd_scan_kthread);
+}
+
+/* returns non-zero if any part of the range is on rotating disk */
+int btrfs_range_on_rotating(struct btrfs_root *root,
+			    struct hot_inode_item *hot_inode,
+			    u64 start, u64 len)
+{
+	struct inode *inode;
+	struct btrfs_key key;
+	struct extent_map *em = NULL;
+	struct btrfs_multi_bio *multi_ret = NULL;
+	struct btrfs_inode *btrfs_inode;
+	struct btrfs_bio_stripe *bio_stripe;
+	struct btrfs_multi_bio *multi_bio;
+	struct block_device *bdev;
+	int rotating = 0;
+	int ret_val = 0;
+	u64 length = 0;
+	u64 pos = 0, pos2 = 0;
+	int new = 0;
+	int i;
+	unsigned long inode_size = 0;
+
+	spin_lock(&hot_inode->lock);
+	key.objectid = hot_inode->i_ino;
+	spin_unlock(&hot_inode->lock);
+
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+	inode = btrfs_iget(root->fs_info->sb, &key, root, &new);
+
+	if (IS_ERR(inode)) {
+		ret_val = -ENOENT;
+		goto out;
+	} else if (is_bad_inode(inode)) {
+		iput(inode);
+		ret_val = -ENOENT;
+		goto out;
+	}
+
+	btrfs_inode = BTRFS_I(inode);
+	inode_size = (unsigned long) i_size_read(inode);
+
+	if (start >= inode_size) {
+		iput(inode);
+		ret_val = -ENOENT;
+		goto out;
+	}
+
+	if (len == (u64) -1 || start + len > inode_size)
+		len = inode_size - start;
+	else
+		len = start + len;
+
+	for (pos = start; pos < len - 1; pos += length) {
+		em = btrfs_get_extent(inode, NULL, 0, pos, pos + 1, 0);
+
+		length = em->block_len;
+
+		/* Location of delayed allocation and inline extents
+		 * can't be determined */
+		if (em->block_start == EXTENT_MAP_INLINE ||
+			em->block_start == EXTENT_MAP_DELALLOC ||
+			em->block_start == EXTENT_MAP_HOLE) {
+			ret_val = -1;
+			iput(inode);
+			goto out_free_em;
+		}
+
+		for (pos2 = 0; pos2 < em->block_len; pos2 += length) {
+			btrfs_map_block((struct btrfs_mapping_tree *)
+				&root->fs_info->mapping_tree, READ,
+				em->block_start + pos2,
+				&length, &multi_ret, 0);
+
+			multi_bio = multi_ret;
+
+			/* Each range may have more than one stripe */
+			for (i = 0; i < multi_bio->num_stripes; i++) {
+				bio_stripe = &multi_bio->stripes[i];
+				bdev  = bio_stripe->dev->bdev;
+				if (!blk_queue_nonrot(bdev_get_queue(bdev)))
+					rotating = 1;
+			}
+		}
+		pos += em->block_len;
+		free_extent_map(em);
+	}
+
+	ret_val = rotating;
+	iput(inode);
+	goto out;
+
+out_free_em:
+	free_extent_map(em);
+out:
+	kfree(multi_ret);
+	return ret_val;
+}
+
+static int should_relocate_range(struct inode *inode, u64 start, u64 len,
+			       int thresh, u64 *last_len, u64 *skip,
+			       u64 *relocate_end)
+{
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_map *em = NULL;
+	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	int ret = 1;
+
+
+	if (thresh == 0)
+		thresh = 256 * 1024;
+
+	/*
+	 * make sure that once we start relocating and extent, we keep on
+	 * relocating it
+	 */
+	if (start < *relocate_end)
+		return 1;
+
+	*skip = 0;
+
+	/*
+	 * hopefully we have this extent in the tree already, try without
+	 * the full extent lock
+	 */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, start, len);
+	read_unlock(&em_tree->lock);
+
+	if (!em) {
+		/* get the big lock and read metadata off disk */
+		lock_extent(io_tree, start, start + len - 1, GFP_NOFS);
+		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+		unlock_extent(io_tree, start, start + len - 1, GFP_NOFS);
+
+		if (IS_ERR(em))
+			return 0;
+	}
+
+	/* this will cover holes, and inline extents */
+	if (em->block_start >= EXTENT_MAP_LAST_BYTE)
+		ret = 0;
+
+	if (ret) {
+		*last_len += len;
+		*relocate_end = extent_map_end(em);
+	} else {
+		*last_len = 0;
+		*skip = extent_map_end(em);
+		*relocate_end = 0;
+	}
+
+	free_extent_map(em);
+	return ret;
+}
+
+/*
+ * take and inode and range args (sub file range) and
+ * relocate to sdd or spinning based on past location.
+ *
+ * loads range into page cache and marks pages as dirty,
+ * range arg can pass whether or not this should be
+ * flushed immediately, or whether btrfs workers should
+ * flush later
+ *
+ * based on defrag ioctl
+ */
+int btrfs_relocate_inode(unsigned long inode_num,
+			     struct btrfs_relocate_range_args *range,
+			     struct btrfs_root *root,
+			     int location)
+{
+	struct inode *inode;
+	struct extent_io_tree *io_tree;
+	struct btrfs_ordered_extent *ordered;
+	struct page *page;
+	struct btrfs_key key;
+	struct file_ra_state *ra;
+	unsigned long last_index;
+	unsigned long ra_pages = root->fs_info->bdi.ra_pages;
+	unsigned long total_read = 0;
+	u64 page_start;
+	u64 page_end;
+	u64 last_len = 0;
+	u64 skip = 0;
+	u64 relocate_end = 0;
+	unsigned long i;
+	int new = 0;
+	int ret;
+
+	key.objectid = inode_num;
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+
+	inode = btrfs_iget(root->fs_info->sb, &key, root, &new);
+	if (IS_ERR(inode)) {
+		ret = -ENOENT;
+		goto out;
+	} else if (is_bad_inode(inode)) {
+		iput(inode);
+		ret = -ENOENT;
+		goto out;
+	}
+
+	io_tree = &BTRFS_I(inode)->io_tree;
+
+	if (inode->i_size == 0)
+		return 0;
+
+	if (range->start + range->len > range->start) {
+		last_index = min_t(u64, inode->i_size - 1,
+			 range->start + range->len - 1) >> PAGE_CACHE_SHIFT;
+	} else {
+		last_index = (inode->i_size - 1) >> PAGE_CACHE_SHIFT;
+	}
+
+	i = range->start >> PAGE_CACHE_SHIFT;
+	ra  = kzalloc(sizeof(*ra), GFP_NOFS);
+
+	while (i <= last_index) {
+		if (!should_relocate_range(inode, (u64)i << PAGE_CACHE_SHIFT,
+					PAGE_CACHE_SIZE,
+					range->extent_thresh,
+					&last_len, &skip,
+					&relocate_end)) {
+			unsigned long next;
+			/*
+			 * the should_relocate function tells us how much to
+			 * skip
+			 * bump our counter by the suggested amount
+			 */
+			next = (skip + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+			i = max(i + 1, next);
+			continue;
+		}
+
+		if (total_read % ra_pages == 0) {
+			btrfs_force_ra(inode->i_mapping, ra, NULL, i,
+			min(last_index, i + ra_pages - 1));
+		}
+		total_read++;
+		mutex_lock(&inode->i_mutex);
+		if (range->flags & BTRFS_RELOCATE_RANGE_COMPRESS)
+			BTRFS_I(inode)->force_compress = 1;
+
+		ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+		if (ret)
+			goto err_unlock;
+again:
+		if (inode->i_size == 0 ||
+		    i > ((inode->i_size - 1) >> PAGE_CACHE_SHIFT)) {
+			ret = 0;
+			goto err_reservations;
+		}
+
+		page = grab_cache_page(inode->i_mapping, i);
+		if (!page) {
+			ret = -ENOMEM;
+			goto err_reservations;
+		}
+
+		if (!PageUptodate(page)) {
+			btrfs_readpage(NULL, page);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EIO;
+				goto err_reservations;
+			}
+		}
+
+		if (page->mapping != inode->i_mapping) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto again;
+		}
+
+		wait_on_page_writeback(page);
+
+		if (PageDirty(page)) {
+			btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+			goto loop_unlock;
+		}
+
+		page_start = (u64)page->index << PAGE_CACHE_SHIFT;
+		page_end = page_start + PAGE_CACHE_SIZE - 1;
+		lock_extent(io_tree, page_start, page_end, GFP_NOFS);
+
+		ordered = btrfs_lookup_ordered_extent(inode, page_start);
+		if (ordered) {
+			unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+			unlock_page(page);
+			page_cache_release(page);
+			btrfs_start_ordered_extent(inode, ordered, 1);
+			btrfs_put_ordered_extent(ordered);
+			goto again;
+		}
+		set_page_extent_mapped(page);
+
+		/*
+		 * this makes sure page_mkwrite is called on the
+		 * page if it is dirtied again later
+		 */
+		clear_page_dirty_for_io(page);
+		clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_DIRTY | EXTENT_DELALLOC |
+				  EXTENT_DO_ACCOUNTING, GFP_NOFS);
+
+		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
+
+		if (location == BTRFS_ON_NONROTATING) {
+			btrfs_set_extent_prefer_nonrotating(inode, page_start,
+							page_end, NULL);
+			clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				  page_end, EXTENT_PREFER_ROTATING, GFP_NOFS);
+		} else if (location == BTRFS_ON_ROTATING) {
+			btrfs_set_extent_prefer_rotating(inode, page_start,
+							page_end, NULL);
+			clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+				page_end, EXTENT_PREFER_NONROTATING, GFP_NOFS);
+		}
+
+		ClearPageChecked(page);
+		set_page_dirty(page);
+		unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+
+loop_unlock:
+		unlock_page(page);
+		page_cache_release(page);
+		mutex_unlock(&inode->i_mutex);
+
+		balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+		i++;
+	}
+	kfree(ra);
+
+	if ((range->flags & BTRFS_RELOCATE_RANGE_START_IO))
+		filemap_flush(inode->i_mapping);
+
+	if ((range->flags & BTRFS_RELOCATE_RANGE_COMPRESS)) {
+		/* the filemap_flush will queue IO into the worker threads, but
+		 * we have to make sure the IO is actually started and that
+		 * ordered extents get created before we return
+		 */
+		atomic_inc(&root->fs_info->async_submit_draining);
+		while (atomic_read(&root->fs_info->nr_async_submits) ||
+		      atomic_read(&root->fs_info->async_delalloc_pages)) {
+			wait_event(root->fs_info->async_submit_wait,
+			   (atomic_read(&root->fs_info->
+					nr_async_submits) == 0 &&
+			    atomic_read(&root->fs_info->
+					async_delalloc_pages) == 0));
+		}
+		atomic_dec(&root->fs_info->async_submit_draining);
+
+		mutex_lock(&inode->i_mutex);
+		BTRFS_I(inode)->force_compress = 0;
+		mutex_unlock(&inode->i_mutex);
+	}
+
+	ret = 0;
+	goto put_inode;
+
+err_reservations:
+	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+err_unlock:
+	mutex_unlock(&inode->i_mutex);
+put_inode:
+	iput(inode);
+out:
+	return ret;
+}
+
diff --git a/fs/btrfs/hotdata_relocate.h b/fs/btrfs/hotdata_relocate.h
new file mode 100644
index 0000000..e3235d1
--- /dev/null
+++ b/fs/btrfs/hotdata_relocate.h
@@ -0,0 +1,73 @@
+/*
+ * fs/btrfs/hotdata_relocate.h
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __HOTDATARELOCATE__
+#define __HOTDATARELOCATE__
+
+#include "ctree.h"
+#include "hotdata_map.h"
+
+/* flags for the defrag range ioctl */
+#define BTRFS_RELOCATE_RANGE_COMPRESS 1
+#define BTRFS_RELOCATE_RANGE_START_IO 2
+
+/* where data is located */
+#define BTRFS_ON_ROTATING	0
+#define BTRFS_ON_NONROTATING	1
+#define BTRFS_ON_BOTH		2
+#define BTRFS_ON_UNKNOWN	3
+
+/* run relocation thread every X seconds */
+#define RELOCATE_TIME_DELAY 1
+/* maximum number of ranges to move in relocation thread run */
+#define RELOCATE_MAX_ITEMS 250
+
+struct btrfs_relocate_range_args {
+	/* start of the relocate operation */
+	u64 start;
+	/* number of bytes to relocate, use (u64)-1 to say all */
+	u64 len;
+	/*
+	 * flags for the operation, which can include turning
+	 * on compression for this one relocate
+	 */
+	u64 flags;
+	 /*
+	 * Use 1 to say every single extent must be rewritten
+	 */
+	u32 extent_thresh;
+};
+
+struct btrfs_root;
+/*
+ * initialization of relocation kthread,
+ * called if hotdatamove mount option is passed
+ */
+void init_hot_data_relocate_kthread(struct btrfs_root *root);
+void init_ssd_scan_kthread(struct btrfs_root *root);
+/* returns 1 if any part of range is on rotating disk (HDD) */
+int btrfs_range_on_rotating(struct btrfs_root *root,
+	struct hot_inode_item *hot_inode, u64 start, u64 len);
+/* relocate inode range to spinning or ssd based on range args */
+int btrfs_relocate_inode(unsigned long inode_num,
+			     struct btrfs_relocate_range_args *range,
+			     struct btrfs_root *root,
+			     int location);
+#endif /* __HOTDATARELOCATE__ */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 PATCH 4/6] Btrfs: Add debugfs interface for hot data stats
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
                   ` (2 preceding siblings ...)
  2010-08-12 22:22 ` [RFC v2 PATCH 3/6] Btrfs: Add hot data relocation facilities bchociej
@ 2010-08-12 22:22 ` bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 5/6] Btrfs: 3 new ioctls related to hot data features bchociej
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

From: Ben Chociej <bchociej@gmail.com>

Add a /sys/kernel/debug/btrfs_data/<device_name>/ directory for each
volume that contains two files. The first, `inode_data', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_data', contains similar information for
subfile ranges.

Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Signed-off-by: Ben Chociej <bchociej@gmail.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/btrfs/debugfs.c |  532 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/debugfs.h |   89 +++++++++
 2 files changed, 621 insertions(+), 0 deletions(-)
 create mode 100644 fs/btrfs/debugfs.c
 create mode 100644 fs/btrfs/debugfs.h

diff --git a/fs/btrfs/debugfs.c b/fs/btrfs/debugfs.c
new file mode 100644
index 0000000..c11c0b6
--- /dev/null
+++ b/fs/btrfs/debugfs.c
@@ -0,0 +1,532 @@
+/*
+ * fs/btrfs/debugfs.c
+ *
+ * This file contains the code to interface with the btrfs debugfs.
+ * The debugfs outputs range- and file-level access frequency
+ * statistics for each mounted volume.
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/vmalloc.h>
+#include <linux/limits.h>
+#include "ctree.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "hotdata_relocate.h"
+#include "debugfs.h"
+
+static int copy_msg_to_log(struct debugfs_vol_data *data, char *msg, int len)
+{
+	struct lstring *debugfs_log = data->debugfs_log;
+	uint new_log_alloc_size;
+	char *new_log;
+
+	if (len >= data->log_alloc_size - debugfs_log->len) {
+		/* Not enough room in the log buffer for the new message. */
+		/* Allocate a bigger buffer. */
+		new_log_alloc_size = data->log_alloc_size + LOG_PAGE_SIZE;
+		new_log = vmalloc(new_log_alloc_size);
+
+		if (new_log) {
+			memcpy(new_log, debugfs_log->str,
+				debugfs_log->len);
+			memset(new_log + debugfs_log->len, 0,
+				new_log_alloc_size - debugfs_log->len);
+			vfree(debugfs_log->str);
+			debugfs_log->str = new_log;
+			data->log_alloc_size = new_log_alloc_size;
+		} else {
+			WARN_ON(1);
+			if (data->log_alloc_size - debugfs_log->len) {
+				#define err_msg "No more memory!\n"
+				strlcpy(debugfs_log->str +
+					debugfs_log->len,
+					err_msg, data->log_alloc_size -
+					debugfs_log->len);
+				debugfs_log->len +=
+					min((typeof(debugfs_log->len))
+					sizeof(err_msg),
+					((typeof(debugfs_log->len))
+					data->log_alloc_size -
+					debugfs_log->len));
+			}
+			return 0;
+		}
+	}
+
+	memcpy(debugfs_log->str + debugfs_log->len,
+		data->log_work_buff, len);
+	debugfs_log->len += (unsigned long) len;
+
+	return len;
+}
+
+/* Returns the number of bytes written to the log. */
+static int debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...)
+{
+	struct lstring *debugfs_log = data->debugfs_log;
+	va_list args;
+	int len;
+
+	if (debugfs_log->str == NULL)
+		return -1;
+
+	spin_lock(&data->log_lock);
+
+	va_start(args, fmt);
+	len = vsnprintf(data->log_work_buff, sizeof(data->log_work_buff), fmt,
+		args);
+	va_end(args);
+
+	if (len >= sizeof(data->log_work_buff)) {
+		#define truncate_msg "The next message has been truncated.\n"
+		copy_msg_to_log(data, truncate_msg, sizeof(truncate_msg));
+	}
+
+	len = copy_msg_to_log(data, data->log_work_buff, len);
+	spin_unlock(&data->log_lock);
+
+	return len;
+}
+
+/* initialize a log corresponding to a btrfs volume */
+static int debugfs_log_init(struct debugfs_vol_data *data)
+{
+	int err = 0;
+	struct lstring *debugfs_log = data->debugfs_log;
+
+	spin_lock(&data->log_lock);
+	debugfs_log->str = vmalloc(INIT_LOG_ALLOC_SIZE);
+
+	if (debugfs_log->str) {
+		memset(debugfs_log->str, 0, INIT_LOG_ALLOC_SIZE);
+		data->log_alloc_size = INIT_LOG_ALLOC_SIZE;
+	} else {
+		err = -ENOMEM;
+	}
+
+	spin_unlock(&data->log_lock);
+	return err;
+}
+
+/* free a log corresponding to a btrfs volume */
+static void debugfs_log_exit(struct debugfs_vol_data *data)
+{
+	struct lstring *debugfs_log = data->debugfs_log;
+	spin_lock(&data->log_lock);
+	vfree(debugfs_log->str);
+	debugfs_log->str = NULL;
+	debugfs_log->len = 0;
+	spin_unlock(&data->log_lock);
+}
+
+/* fops to override for printing range data */
+static const struct file_operations btrfs_debugfs_range_fops = {
+	.read	= __btrfs_debugfs_range_read,
+	.open	= __btrfs_debugfs_open,
+};
+
+/* fops to override for printing inode data */
+static const struct file_operations btrfs_debugfs_inode_fops = {
+	.read	= __btrfs_debugfs_inode_read,
+	.open	= __btrfs_debugfs_open,
+};
+
+/* initialize debugfs for btrfs at module init */
+int btrfs_init_debugfs(void)
+{
+	debugfs_root_dentry = debugfs_create_dir(DEBUGFS_ROOT_NAME, NULL);
+	/*init list of debugfs data list */
+	INIT_LIST_HEAD(&debugfs_vol_data_list);
+	/*init lock to list of debugfs data list */
+	spin_lock_init(&data_list_lock);
+	if (!debugfs_root_dentry)
+		goto debugfs_error;
+	return 0;
+
+debugfs_error:
+	return -EIO;
+}
+
+/*
+ * on each volume mount, initialize the debugfs dentries and associated
+ * structures (debugfs_vol_data and debugfs_log)
+ */
+int btrfs_init_debugfs_volume(const char *uuid, struct super_block *sb)
+{
+	struct dentry *debugfs_volume_entry = NULL;
+	struct dentry *debugfs_range_entry = NULL;
+	struct dentry *debugfs_inode_entry = NULL;
+	struct debugfs_vol_data *range_data = NULL;
+	struct debugfs_vol_data *inode_data = NULL;
+	size_t dev_name_length = strlen(uuid);
+	char dev[NAME_MAX];
+
+	if (!debugfs_root_dentry)
+		goto debugfs_error;
+
+	/* create debugfs folder for this volume by mounted dev name */
+	memcpy(dev, uuid + DEV_NAME_CHOP, dev_name_length -
+		DEV_NAME_CHOP + 1);
+	debugfs_volume_entry = debugfs_create_dir(dev, debugfs_root_dentry);
+
+	if (!debugfs_volume_entry)
+		goto debugfs_error;
+
+	/* malloc and initialize debugfs_vol_data for range_data */
+	range_data = kmalloc(sizeof(struct debugfs_vol_data),
+		GFP_KERNEL | GFP_NOFS);
+	memset(range_data, 0, sizeof(struct debugfs_vol_data));
+	range_data->debugfs_log = NULL;
+	range_data->sb = sb;
+	spin_lock_init(&range_data->log_lock);
+	range_data->log_alloc_size = 0;
+
+	/* malloc and initialize debugfs_vol_data for range_data */
+	inode_data = kmalloc(sizeof(struct debugfs_vol_data),
+		GFP_KERNEL | GFP_NOFS);
+	memset(inode_data, 0, sizeof(struct debugfs_vol_data));
+	inode_data->debugfs_log = NULL;
+	inode_data->sb = sb;
+	spin_lock_init(&inode_data->log_lock);
+	inode_data->log_alloc_size = 0;
+
+	/*
+	 * add debugfs_vol_data for inode data and range data for
+	 * volume to list
+	 */
+	range_data->de = debugfs_volume_entry;
+	inode_data->de = debugfs_volume_entry;
+	spin_lock(&data_list_lock);
+	list_add(&range_data->node, &debugfs_vol_data_list);
+	list_add(&inode_data->node, &debugfs_vol_data_list);
+	spin_unlock(&data_list_lock);
+
+	/* create debugfs range_data file */
+	debugfs_range_entry = debugfs_create_file("range_data",
+			   S_IFREG | S_IRUSR | S_IWUSR |
+			   S_IRUGO,
+			   debugfs_volume_entry,
+			   (void *) range_data,
+			   &btrfs_debugfs_range_fops);
+	if (!debugfs_range_entry)
+		goto debugfs_error;
+
+	/* create debugfs inode_data file */
+	debugfs_inode_entry = debugfs_create_file("inode_data",
+			   S_IFREG | S_IRUSR | S_IWUSR |
+			   S_IRUGO,
+			   debugfs_volume_entry,
+			   (void *) inode_data,
+			   &btrfs_debugfs_inode_fops);
+
+	if (!debugfs_inode_entry)
+		goto debugfs_error;
+
+	return 0;
+
+debugfs_error:
+
+	kfree(range_data);
+	kfree(inode_data);
+
+	return -EIO;
+}
+
+/*
+ * find volume mounted (match by superblock) and remove
+ * debugfs dentry
+ */
+void btrfs_exit_debugfs_volume(struct super_block *sb)
+{
+	struct list_head *head;
+	struct list_head *pos;
+	struct debugfs_vol_data *data;
+	spin_lock(&data_list_lock);
+	head = &debugfs_vol_data_list;
+
+	/* must clean up memory assicatied with superblock */
+	list_for_each(pos, head)
+	{
+		data = list_entry(pos, struct debugfs_vol_data, node);
+		if (data->sb == sb) {
+			list_del(pos);
+			debugfs_remove_recursive(data->de);
+			kfree(data);
+			data = NULL;
+			break;
+		}
+	}
+
+	spin_unlock(&data_list_lock);
+}
+
+/* clean up memory and remove dentries for debugsfs */
+void btrfs_exit_debugfs(void)
+{
+	/* first iterate through debugfs_vol_data_list and free memory */
+	struct list_head *head;
+	struct list_head *pos;
+	struct list_head *cur;
+	struct debugfs_vol_data *data;
+
+	spin_lock(&data_list_lock);
+	head = &debugfs_vol_data_list;
+	list_for_each_safe(pos, cur, head) {
+		data = list_entry(pos, struct debugfs_vol_data, node);
+		if (data && pos != head)
+			kfree(data);
+	}
+	spin_unlock(&data_list_lock);
+
+	/* remove all debugfs entries recursively from the root */
+	debugfs_remove_recursive(debugfs_root_dentry);
+}
+
+/* debugfs open file override from fops table */
+static int __btrfs_debugfs_open(struct inode *inode, struct file *file)
+{
+	if (inode->i_private)
+		file->private_data = inode->i_private;
+
+	return 0;
+}
+
+/* debugfs read file override from fops table */
+static ssize_t __btrfs_debugfs_range_read(struct file *file, char __user *user,
+			     size_t count, loff_t *ppos)
+{
+	int err = 0;
+	struct super_block *sb;
+	struct btrfs_root *root;
+	struct btrfs_root *fs_root;
+	struct hot_inode_item *current_hot_inode;
+	struct debugfs_vol_data *data;
+	struct lstring *debugfs_log;
+	unsigned long inode_num;
+
+	data = (struct debugfs_vol_data *) file->private_data;
+	sb = data->sb;
+	root = btrfs_sb(sb);
+	fs_root = (struct btrfs_root *) root->fs_info->fs_root;
+
+	if (!data->debugfs_log) {
+		/* initialize debugfs log corresponding to this volume*/
+		debugfs_log = kmalloc(sizeof(struct lstring),
+			GFP_KERNEL | GFP_NOFS);
+		debugfs_log->str = NULL,
+		debugfs_log->len = 0;
+		data->debugfs_log = debugfs_log;
+		debugfs_log_init(data);
+	}
+
+	if ((unsigned long) *ppos > 0) {
+		/* caller is continuing a previous read, don't walk tree */
+		if ((unsigned long) *ppos >= data->debugfs_log->len)
+			goto clean_up;
+
+		goto print_to_user;
+	}
+
+	/* walk the inode tree */
+	current_hot_inode = find_next_hot_inode(fs_root, 0);
+
+	while (current_hot_inode) {
+		/* walk ranges, print data to debugfs log */
+		__walk_range_tree(current_hot_inode, data, fs_root);
+		inode_num = current_hot_inode->i_ino;
+		free_hot_inode_item(current_hot_inode);
+		current_hot_inode = find_next_hot_inode(fs_root, inode_num+1);
+	}
+
+print_to_user:
+	if (data->debugfs_log->len) {
+		err = simple_read_from_buffer(user, count, ppos,
+				      data->debugfs_log->str,
+				      data->debugfs_log->len);
+	}
+
+	return err;
+
+clean_up:
+	/* Reader has finished the file, clean up */
+
+	debugfs_log_exit(data);
+	kfree(data->debugfs_log);
+	data->debugfs_log = NULL;
+
+	return 0;
+}
+
+/* debugfs read file override from fops table */
+static ssize_t __btrfs_debugfs_inode_read(struct file *file, char __user *user,
+			     size_t count, loff_t *ppos)
+{
+	int err = 0;
+	struct super_block *sb;
+	struct btrfs_root *root;
+	struct btrfs_root *fs_root;
+	struct hot_inode_item *current_hot_inode;
+	struct debugfs_vol_data *data;
+	struct lstring *debugfs_log;
+	unsigned long inode_num;
+
+	data = (struct debugfs_vol_data *) file->private_data;
+	sb = data->sb;
+	root = btrfs_sb(sb);
+	fs_root = (struct btrfs_root *) root->fs_info->fs_root;
+
+	if (!data->debugfs_log) {
+		/* initialize debugfs log corresponding to this volume */
+		debugfs_log = kmalloc(sizeof(struct lstring),
+			GFP_KERNEL | GFP_NOFS);
+		debugfs_log->str = NULL,
+		debugfs_log->len = 0;
+		data->debugfs_log = debugfs_log;
+		debugfs_log_init(data);
+	}
+
+	if ((unsigned long) *ppos > 0) {
+		/* caller is continuing a previous read, don't walk tree */
+		if ((unsigned long) *ppos >= data->debugfs_log->len)
+			goto clean_up;
+
+		goto print_to_user;
+	}
+
+	/* walk the inode tree */
+	current_hot_inode = find_next_hot_inode(fs_root, 0);
+
+	while (current_hot_inode) {
+		/* walk ranges, print data to debugfs log */
+		__print_inode_freq_data(current_hot_inode, data, fs_root);
+		inode_num = current_hot_inode->i_ino;
+		free_hot_inode_item(current_hot_inode);
+		current_hot_inode = find_next_hot_inode(fs_root, inode_num+1);
+	}
+
+print_to_user:
+	if (data->debugfs_log->len) {
+		err = simple_read_from_buffer(user, count, ppos,
+				      data->debugfs_log->str,
+				      data->debugfs_log->len);
+	}
+
+	return err;
+
+clean_up:
+	/* reader has finished the file, clean up */
+	debugfs_log_exit(data);
+	kfree(data->debugfs_log);
+	data->debugfs_log = NULL;
+
+	return 0;
+}
+
+/*
+ * take the inode, find ranges associated with inode
+ * and print each range data struct
+ */
+static void __walk_range_tree(struct hot_inode_item *hot_inode,
+		       struct debugfs_vol_data *data,
+		       struct btrfs_root *fs_root)
+{
+	struct hot_range_tree *inode_range_tree;
+	struct rb_node *node;
+	struct hot_range_item *current_range;
+
+	inode_range_tree = &hot_inode->hot_range_tree;
+	read_lock(&inode_range_tree->lock);
+	node = rb_first(&inode_range_tree->map);
+
+	/* Walk the hot_range_tree for inode */
+	while (node) {
+		current_range = rb_entry(node, struct hot_range_item, rb_node);
+		__print_range_freq_data(hot_inode, current_range, data,
+			fs_root);
+		node = rb_next(node);
+	}
+	read_unlock(&inode_range_tree->lock);
+}
+
+/* Print frequency data for each range to log */
+static void __print_range_freq_data(struct hot_inode_item *hot_inode,
+			     struct hot_range_item *hot_range,
+			     struct debugfs_vol_data *data,
+			     struct btrfs_root *fs_root)
+{
+	struct btrfs_freq_data *freq_data;
+	u64 start;
+	u64 len;
+	int on_rotating;
+
+	freq_data = &hot_range->freq_data;
+
+	spin_lock(&hot_range->lock);
+	start = hot_range->start;
+	len = hot_range->len;
+	spin_unlock(&hot_range->lock);
+
+	on_rotating = btrfs_range_on_rotating(fs_root, hot_inode, start,
+						  len);
+	/* Always lock hot_inode_item first */
+	spin_lock(&hot_inode->lock);
+	spin_lock(&hot_range->lock);
+	debugfs_log(data, "inode #%lu, range start "
+			"%llu (range len %llu) reads %u, writes %u, "
+			"avg read time %llu, avg write time %llu, temp %u, "
+			"on_rotating %d\n",
+			hot_inode->i_ino,
+			hot_range->start,
+			hot_range->len,
+			freq_data->nr_reads,
+			freq_data->nr_writes,
+			freq_data->avg_delta_reads,
+			freq_data->avg_delta_writes,
+			freq_data->last_temp,
+			on_rotating);
+	spin_unlock(&hot_range->lock);
+	spin_unlock(&hot_inode->lock);
+}
+
+/* Print frequency data for each freq data to log */
+static void __print_inode_freq_data(struct hot_inode_item *hot_inode,
+			     struct debugfs_vol_data *data,
+			     struct btrfs_root *fs_root)
+{
+	struct btrfs_freq_data *freq_data = &hot_inode->freq_data;
+	int on_rotating = btrfs_range_on_rotating(fs_root, hot_inode, 0,
+						  (u64)-1);
+
+	spin_lock(&hot_inode->lock);
+	debugfs_log(data, "inode #%lu, reads %u, writes %u, "
+			  "avg read time %llu, avg write time %llu, temp %u, "
+			  "on_rotating %d\n",
+			hot_inode->i_ino,
+			freq_data->nr_reads,
+			freq_data->nr_writes,
+			freq_data->avg_delta_reads,
+			freq_data->avg_delta_writes,
+			freq_data->last_temp,
+			on_rotating);
+	spin_unlock(&hot_inode->lock);
+}
diff --git a/fs/btrfs/debugfs.h b/fs/btrfs/debugfs.h
new file mode 100644
index 0000000..492ff8f
--- /dev/null
+++ b/fs/btrfs/debugfs.h
@@ -0,0 +1,89 @@
+/*
+ * fs/btrfs/debugfs.h
+ *
+ * Copyright (C) 2010 International Business Machines Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEBUGFS__
+#define __BTRFS_DEBUGFS__
+
+/* size of log to vmalloc */
+#define INIT_LOG_ALLOC_SIZE (PAGE_SIZE * 10)
+#define LOG_PAGE_SIZE (PAGE_SIZE * 10)
+
+/*
+ * number of chars of device name of chop off for making debugfs folder
+ * e.g. /dev/sda -> sda
+ *
+ * TODO: use something better for this
+ */
+#define DEV_NAME_CHOP 5
+
+/* list to keep track of each mounted volumes debugfs_vol_data */
+static struct list_head debugfs_vol_data_list;
+
+/* lock for debugfs_vol_data_list */
+static spinlock_t data_list_lock;
+
+/*
+ * Name for BTRFS data in debugfs directory
+ * e.g. /sys/kernel/debug/btrfs_data
+ */
+#define DEBUGFS_ROOT_NAME "btrfs_data"
+
+/* pointer to top level debugfs dentry */
+static struct dentry *debugfs_root_dentry;
+
+/* log to output to userspace in debugfs files */
+struct lstring {
+	char		*str;
+	unsigned long	len;
+};
+
+/* debugfs_vol_data is a struct of items that is passed to the debugfs */
+struct debugfs_vol_data {
+	struct list_head node; /* protected by data_list_lock */
+	struct lstring *debugfs_log;
+	struct super_block *sb;
+	struct dentry *de;
+	spinlock_t log_lock; /* protects debugfs_log */
+	char log_work_buff[1024];
+	uint log_alloc_size;
+};
+
+static ssize_t __btrfs_debugfs_range_read(struct file *file, char __user *user,
+	size_t size, loff_t *len);
+
+static ssize_t __btrfs_debugfs_inode_read(struct file *file, char __user *user,
+	size_t size, loff_t *len);
+
+static int __btrfs_debugfs_open(struct inode *inode, struct file *file);
+
+static void __walk_range_tree(struct hot_inode_item *hot_inode,
+			struct debugfs_vol_data *data,
+			struct btrfs_root *root);
+
+static void __print_range_freq_data(struct hot_inode_item *hot_inode,
+		       struct hot_range_item *hot_range,
+		       struct debugfs_vol_data *data,
+		       struct btrfs_root *root);
+
+static void __print_inode_freq_data(struct hot_inode_item *hot_inode,
+		       struct debugfs_vol_data *data,
+		       struct btrfs_root *root);
+
+#endif
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 PATCH 5/6] Btrfs: 3 new ioctls related to hot data features
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
                   ` (3 preceding siblings ...)
  2010-08-12 22:22 ` [RFC v2 PATCH 4/6] Btrfs: Add debugfs interface for hot data stats bchociej
@ 2010-08-12 22:22 ` bchociej
  2010-08-12 22:22 ` [RFC v2 PATCH 6/6] Btrfs: Add hooks to enable hot data tracking bchociej
  2010-08-26  2:13 ` [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality Shaohua Li
  6 siblings, 0 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

From: Ben Chociej <bchociej@gmail.com>

BTRFS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in btrfs_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.

BTRFS_IOC_GET_HEAT_OPTS: return an integer representing the current
state of hot data tracking and migration:

0 = do nothing
1 = track frequency of access
2 = migrate data to fast media based on temperature (not implemented)

BTRFS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and
migration, as described above.

Signed-off-by: Ben Chociej <bchociej@gmail.com>
Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/btrfs/ioctl.c |  142 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.h |   23 +++++++++
 2 files changed, 164 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4dbaf89..88cd0e7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -49,6 +49,8 @@
 #include "print-tree.h"
 #include "volumes.h"
 #include "locking.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
 
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
@@ -1869,7 +1871,7 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
 	return 0;
 }
 
-long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
+static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
 {
 	struct btrfs_ioctl_space_args space_args;
 	struct btrfs_ioctl_space_info space;
@@ -1974,6 +1976,138 @@ long btrfs_ioctl_trans_end(struct file *file)
 	return 0;
 }
 
+/*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by heat_info->live.
+ */
+static long btrfs_ioctl_heat_info(struct file *file, void __user *argp)
+{
+	struct inode *mnt_inode = fdentry(file)->d_inode;
+	struct inode *file_inode;
+	struct file *file_filp;
+	struct btrfs_root *root = BTRFS_I(mnt_inode)->root;
+	struct btrfs_ioctl_heat_info *heat_info;
+	struct hot_inode_tree *hitree;
+	struct hot_inode_item *he;
+	int ret;
+
+	heat_info = kmalloc(sizeof(struct btrfs_ioctl_heat_info),
+			GFP_KERNEL | GFP_NOFS);
+
+	if (copy_from_user((void *) heat_info,
+			  argp,
+			  sizeof(struct btrfs_ioctl_heat_info)) != 0) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	file_filp = filp_open(heat_info->filename, O_RDONLY, 0);
+	file_inode = file_filp->f_dentry->d_inode;
+	filp_close(file_filp, NULL);
+
+	hitree = &root->hot_inode_tree;
+	read_lock(&hitree->lock);
+	he = lookup_hot_inode_item(hitree, file_inode->i_ino);
+	read_unlock(&hitree->lock);
+
+	if (!he || IS_ERR(he)) {
+		/* we don't have any info on this file yet */
+		ret = -ENODATA;
+		goto err;
+	}
+
+	spin_lock(&he->lock);
+
+	heat_info->avg_delta_reads =
+		(__u64) he->freq_data.avg_delta_reads;
+	heat_info->avg_delta_writes =
+		(__u64) he->freq_data.avg_delta_writes;
+	heat_info->last_read_time =
+		(__u64) timespec_to_ns(&he->freq_data.last_read_time);
+	heat_info->last_write_time =
+		(__u64) timespec_to_ns(&he->freq_data.last_write_time);
+	heat_info->num_reads =
+		(__u32) he->freq_data.nr_reads;
+	heat_info->num_writes =
+		(__u32) he->freq_data.nr_writes;
+
+	if (heat_info->live > 0) {
+		/* got a request for live temperature,
+		 * call btrfs_get_temp to recalculate */
+		heat_info->temperature = btrfs_get_temp(&he->freq_data);
+	} else {
+		/* not live temperature, get it from the hashlist */
+		read_lock(&he->heat_node->hlist->rwlock);
+		heat_info->temperature = he->heat_node->hlist->temperature;
+		read_unlock(&he->heat_node->hlist->rwlock);
+	}
+
+	spin_unlock(&he->lock);
+	free_hot_inode_item(he);
+
+	if (copy_to_user(argp, (void *) heat_info,
+		     sizeof(struct btrfs_ioctl_heat_info))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	kfree(heat_info);
+	return 0;
+
+err:
+	kfree(heat_info);
+	return ret;
+}
+
+static long btrfs_ioctl_heat_opts(struct file *file, void __user *argp, int set)
+{
+	struct inode *inode = fdentry(file)->d_inode;
+	int arg, ret = 0;
+
+	if (!set) {
+		arg =	((BTRFS_I(inode)->flags &
+				BTRFS_INODE_NO_HOTDATA_TRACK) ?
+			0 : 1) +
+			((BTRFS_I(inode)->flags & BTRFS_INODE_NO_HOTDATA_MOVE) ?
+			0 : 1);
+
+		if (copy_to_user(argp, (void *) &arg, sizeof(int)) != 0)
+			ret = -EFAULT;
+	} else if (copy_from_user((void *) &arg, argp, sizeof(int)) != 0)
+		ret = -EFAULT;
+	else
+		switch (arg) {
+		case 0: /* track nothing, move nothing */
+			/* set both flags */
+			BTRFS_I(inode)->flags |=
+				BTRFS_INODE_NO_HOTDATA_TRACK |
+				BTRFS_INODE_NO_HOTDATA_MOVE;
+			break;
+		case 1: /* do tracking, don't move anything */
+			/* clear NO_HOTDATA_TRACK, set NO_HOTDATA_MOVE */
+			BTRFS_I(inode)->flags &= ~BTRFS_INODE_NO_HOTDATA_TRACK;
+			BTRFS_I(inode)->flags |= BTRFS_INODE_NO_HOTDATA_MOVE;
+			break;
+		case 2: /* track and move */
+			/* clear both flags */
+			BTRFS_I(inode)->flags &=
+				~(BTRFS_INODE_NO_HOTDATA_TRACK |
+				  BTRFS_INODE_NO_HOTDATA_MOVE);
+			break;
+		default:
+			ret = -EINVAL;
+		}
+
+	return ret;
+}
+
+
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -2021,6 +2155,12 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_ino_lookup(file, argp);
 	case BTRFS_IOC_SPACE_INFO:
 		return btrfs_ioctl_space_info(root, argp);
+	case BTRFS_IOC_GET_HEAT_INFO:
+		return btrfs_ioctl_heat_info(file, argp);
+	case BTRFS_IOC_SET_HEAT_OPTS:
+		return btrfs_ioctl_heat_opts(file, argp, 1);
+	case BTRFS_IOC_GET_HEAT_OPTS:
+		return btrfs_ioctl_heat_opts(file, argp, 0);
 	case BTRFS_IOC_SYNC:
 		btrfs_sync_fs(file->f_dentry->d_sb, 1);
 		return 0;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 424694a..7bc2fd4 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -138,6 +138,18 @@ struct btrfs_ioctl_space_args {
 	struct btrfs_ioctl_space_info spaces[0];
 };
 
+struct btrfs_ioctl_heat_info {
+	__u64 avg_delta_reads;
+	__u64 avg_delta_writes;
+	__u64 last_read_time;
+	__u64 last_write_time;
+	__u32 num_reads;
+	__u32 num_writes;
+	char filename[BTRFS_PATH_NAME_MAX + 1];
+	int temperature;
+	__u8 live;
+};
+
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
 				   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
@@ -178,4 +190,15 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
 				    struct btrfs_ioctl_space_args)
+
+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define BTRFS_IOC_GET_HEAT_INFO _IOWR(BTRFS_IOCTL_MAGIC, 21,	\
+				struct btrfs_ioctl_heat_info)
+#define BTRFS_IOC_SET_HEAT_OPTS _IOW(BTRFS_IOCTL_MAGIC, 22, int)
+#define BTRFS_IOC_GET_HEAT_OPTS _IOR(BTRFS_IOCTL_MAGIC, 23, int)
+
 #endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC v2 PATCH 6/6] Btrfs: Add hooks to enable hot data tracking
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
                   ` (4 preceding siblings ...)
  2010-08-12 22:22 ` [RFC v2 PATCH 5/6] Btrfs: 3 new ioctls related to hot data features bchociej
@ 2010-08-12 22:22 ` bchociej
  2010-08-26  2:13 ` [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality Shaohua Li
  6 siblings, 0 replies; 12+ messages in thread
From: bchociej @ 2010-08-12 22:22 UTC (permalink / raw)
  To: chris.mason, linux-btrfs
  Cc: linux-fsdevel, linux-kernel, cmm, bcchocie, mrlupfer, crscott,
	bchociej, mlupfer, conscott

From: Ben Chociej <bchociej@gmail.com>

Miscellaneous features that implement hot data tracking, enable hot data
migration to faster media, and generally make the hot data functions a
bit more friendly.

ctree.h: Add the root hot_inode_tree and heat hashlists. Defines some
mount options and inode flags for turning all of the hot data
functionality on and off globally and per file. Defines some guard
macros that enforce the mount options and inode flags.

disk-io.c: Initialization and freeing of various structures.

extent-tree.c: Add block group types for SSD data and SSD metadata to
be relocated.

extent_io.c: Add hook into extent_write_cache_pages to enable hot data
tracking and migration functionality. Added miscellaneous code to set
some extent flags for migration / relocation.

inode.c: Add hooks into btrfs_direct_IO, btrfs_fiemap,
btrfs_writepage(s), and btrfs_readpages to enable hot data tracking
and relocation functionality.

super.c: Implement aforementioned mount options, does various
initializing and freeing.

volumes.c: Change the allocator to direct hot data on to SSD, cold data
to spinning disk.

Signed-off-by: Ben Chociej <bchociej@gmail.com>
Signed-off-by: Matt Lupfer <mlupfer@gmail.com>
Signed-off-by: Conor Scott <conscott@vt.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/btrfs/Makefile      |    3 +-
 fs/btrfs/ctree.h       |   96 ++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c     |   28 ++++++++
 fs/btrfs/extent-tree.c |   60 ++++++++++++++++--
 fs/btrfs/extent_io.c   |   34 ++++++++++
 fs/btrfs/extent_io.h   |    7 ++
 fs/btrfs/inode.c       |  162 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/super.c       |   62 +++++++++++++++++-
 fs/btrfs/volumes.c     |   38 ++++++++++-
 9 files changed, 473 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index a35eb36..46a4613 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -7,4 +7,5 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
 	   export.o tree-log.o acl.o free-space-cache.o zlib.o \
-	   compression.o delayed-ref.o relocation.o
+	   compression.o delayed-ref.o relocation.o debugfs.o hotdata_map.o \
+	   hotdata_hash.o hotdata_relocate.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e9bf864..20d6351 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -31,6 +31,8 @@
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
 
 struct btrfs_trans_handle;
 struct btrfs_transaction;
@@ -664,6 +666,17 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID1    (1 << 4)
 #define BTRFS_BLOCK_GROUP_DUP	   (1 << 5)
 #define BTRFS_BLOCK_GROUP_RAID10   (1 << 6)
+/*
+ * New block groups for use with hot data relocation feature.  When hot data
+ * relocation is on, *_SSD block groups are forced to nonrotating drives and
+ * the plain DATA and METADATA block groups are forced to rotating drives.
+ *
+ * This should be further optimized, i.e. force metadata to SSD or relocate
+ * inode metadata to SSD when any of its subfile ranges are relocated to SSD
+ * so that reads and writes aren't delayed by HDD seeks.
+ */
+#define BTRFS_BLOCK_GROUP_DATA_SSD (1 << 7)
+#define BTRFS_BLOCK_GROUP_METADATA_SSD (1 << 8)
 #define BTRFS_NR_RAID_TYPES	   5
 
 struct btrfs_block_group_item {
@@ -877,6 +890,22 @@ struct btrfs_fs_info {
 	struct mutex cleaner_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
+
+	/* protects hot data items while being iterated and updated */
+	struct mutex hot_data_update_kthread_mutex;
+
+	/*
+	 * protects heat hash list while iterating through it for hot data
+	 * relocation operations
+	 */
+	struct mutex hot_data_relocate_kthread_mutex;
+
+	/*
+	 * will eventually protect ssd scan operations that bring previously
+	 * hot inode and range items into memory after a mount
+	 */
+	struct mutex ssd_scan_kthread_mutex;
+
 	/*
 	 * this protects the ordered operations list only while we are
 	 * processing all of the entries on it.  This way we make
@@ -950,6 +979,13 @@ struct btrfs_fs_info {
 	struct btrfs_workers endio_meta_write_workers;
 	struct btrfs_workers endio_write_workers;
 	struct btrfs_workers submit_workers;
+
+	/*
+	 * Workers to update hot_data_hash and relocate data
+	 */
+	struct btrfs_workers hot_data_update_workers;
+	struct btrfs_workers hot_data_relocate_workers;
+
 	/*
 	 * fixup workers take dirty pages that didn't properly go through
 	 * the cow mechanism and make them safe to write.  It happens
@@ -958,6 +994,10 @@ struct btrfs_fs_info {
 	struct btrfs_workers fixup_workers;
 	struct task_struct *transaction_kthread;
 	struct task_struct *cleaner_kthread;
+	struct task_struct *hot_data_update_kthread;
+	struct task_struct *hot_data_relocate_kthread;
+	struct task_struct *ssd_scan_kthread;
+
 	int thread_pool_size;
 
 	struct kobject super_kobj;
@@ -1009,6 +1049,9 @@ struct btrfs_fs_info {
 	unsigned data_chunk_allocations;
 	unsigned metadata_ratio;
 
+	unsigned data_ssd_chunk_allocations;
+	unsigned metadata_ssd_ratio;
+
 	void *bdev_holder;
 };
 
@@ -1092,6 +1135,20 @@ struct btrfs_root {
 	/* red-black tree that keeps track of in-memory inodes */
 	struct rb_root inode_tree;
 
+	/* red-black tree that keeps track of fs-wide hot data */
+	struct hot_inode_tree hot_inode_tree;
+
+	/* hash map of inode temperature */
+	struct heat_hashlist_entry heat_inode_hl[HEAT_HASH_SIZE];
+
+	/* hash map of range temperature */
+	struct heat_hashlist_entry heat_range_hl[HEAT_HASH_SIZE];
+
+	int heat_threshold;
+
+	struct btrfs_work work_inode;
+
+	struct btrfs_work work_range;
 	/*
 	 * right now this just gets used so that a root has its own devid
 	 * for stat.  It may be used for more later
@@ -1192,6 +1249,12 @@ struct btrfs_root {
 #define BTRFS_MOUNT_NOSSD		(1 << 9)
 #define BTRFS_MOUNT_DISCARD		(1 << 10)
 #define BTRFS_MOUNT_FORCE_COMPRESS      (1 << 11)
+/*
+ * for activating hot data tracking and relocation.
+ * always ensure that HOTDATA_MOVE implies HOTDATA_TRACK.
+ */
+#define BTRFS_MOUNT_HOTDATA_TRACK	(1 << 12)
+#define BTRFS_MOUNT_HOTDATA_MOVE		(1 << 13)
 
 #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
@@ -1211,6 +1274,28 @@ struct btrfs_root {
 #define BTRFS_INODE_NODUMP		(1 << 8)
 #define BTRFS_INODE_NOATIME		(1 << 9)
 #define BTRFS_INODE_DIRSYNC		(1 << 10)
+/*
+ * same as mount flags, but these turn off tracking/relocation when set
+ * to 1. (not implemented)
+ */
+#define BTRFS_INODE_NO_HOTDATA_TRACK	(1 << 11)
+#define BTRFS_INODE_NO_HOTDATA_MOVE	(1 << 12)
+
+/* Hot data tracking and relocation -- guard macros */
+#define BTRFS_TRACKING_HOT_DATA(btrfs_root)				\
+(btrfs_test_opt(btrfs_root, HOTDATA_TRACK))
+
+#define BTRFS_MOVING_HOT_DATA(btrfs_root)				\
+((btrfs_test_opt(btrfs_root, HOTDATA_MOVE)) &&				\
+!(btrfs_root->fs_info->sb->s_flags & MS_RDONLY))
+
+#define BTRFS_TRACK_THIS_INODE(btrfs_inode)				\
+((BTRFS_TRACKING_HOT_DATA(btrfs_inode->root)) &&			\
+!(btrfs_inode->flags & BTRFS_INODE_NO_HOTDATA_TRACK))
+
+#define BTRFS_MOVE_THIS_INODE(btrfs_inode)				\
+((BTRFS_MOVING_HOT_DATA(btrfs_inode->root)) &&				\
+!(btrfs_inode->flags & BTRFS_INODE_NO_HOTDATA_MOVE))
 
 /* some macros to generate set/get funcs for the struct fields.  This
  * assumes there is a lefoo_to_cpu for every type, so lets make a simple
@@ -2376,6 +2461,10 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
 int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 			      struct extent_state **cached_state);
+int btrfs_set_extent_prefer_nonrotating(struct inode *inode, u64 start, u64 end,
+			      struct extent_state **cached_state);
+int btrfs_set_extent_prefer_rotating(struct inode *inode, u64 start, u64 end,
+			      struct extent_state **cached_state);
 int btrfs_writepages(struct address_space *mapping,
 		     struct writeback_control *wbc);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
@@ -2457,6 +2546,13 @@ int btrfs_sysfs_add_root(struct btrfs_root *root);
 void btrfs_sysfs_del_root(struct btrfs_root *root);
 void btrfs_sysfs_del_super(struct btrfs_fs_info *root);
 
+
+/* debugfs.c */
+int btrfs_init_debugfs(void);
+void btrfs_exit_debugfs(void);
+int btrfs_init_debugfs_volume(const char *, struct super_block *);
+void btrfs_exit_debugfs_volume(struct super_block *);
+
 /* xattr.c */
 ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 34f7c37..1758fa6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -39,6 +39,7 @@
 #include "locking.h"
 #include "tree-log.h"
 #include "free-space-cache.h"
+#include "hotdata_hash.h"
 
 static struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
@@ -898,6 +899,8 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 			struct btrfs_fs_info *fs_info,
 			u64 objectid)
 {
+	int i;
+
 	root->node = NULL;
 	root->commit_root = NULL;
 	root->sectorsize = sectorsize;
@@ -917,6 +920,7 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->name = NULL;
 	root->in_sysfs = 0;
 	root->inode_tree = RB_ROOT;
+	hot_inode_tree_init(&root->hot_inode_tree);
 	root->block_rsv = NULL;
 	root->orphan_block_rsv = NULL;
 
@@ -938,6 +942,7 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->log_batch = 0;
 	root->log_transid = 0;
 	root->last_log_commit = 0;
+	root->heat_threshold = HEAT_INITIAL_THRESH;
 	extent_io_tree_init(&root->dirty_log_pages,
 			     fs_info->btree_inode->i_mapping, GFP_NOFS);
 
@@ -945,6 +950,19 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	memset(&root->root_item, 0, sizeof(root->root_item));
 	memset(&root->defrag_progress, 0, sizeof(root->defrag_progress));
 	memset(&root->root_kobj, 0, sizeof(root->root_kobj));
+	memset(&root->heat_inode_hl, 0, sizeof(root->heat_inode_hl));
+	memset(&root->heat_range_hl, 0, sizeof(root->heat_range_hl));
+	for (i = 0; i < HEAT_HASH_SIZE; i++) {
+		INIT_HLIST_HEAD(&root->heat_inode_hl[i].hashhead);
+		INIT_HLIST_HEAD(&root->heat_range_hl[i].hashhead);
+
+		rwlock_init(&root->heat_inode_hl[i].rwlock);
+		rwlock_init(&root->heat_range_hl[i].rwlock);
+
+		root->heat_inode_hl[i].temperature = i;
+		root->heat_range_hl[i].temperature = i;
+	}
+
 	root->defrag_trans_start = fs_info->generation;
 	init_completion(&root->kobj_unregister);
 	root->defrag_running = 0;
@@ -1671,6 +1689,9 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
 	mutex_init(&fs_info->volume_mutex);
+	mutex_init(&fs_info->hot_data_update_kthread_mutex);
+	mutex_init(&fs_info->hot_data_relocate_kthread_mutex);
+	mutex_init(&fs_info->ssd_scan_kthread_mutex);
 	init_rwsem(&fs_info->extent_commit_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
@@ -2324,6 +2345,9 @@ static void free_fs_root(struct btrfs_root *root)
 		down_write(&root->anon_super.s_umount);
 		kill_anon_super(&root->anon_super);
 	}
+
+	free_heat_hashlists(root);
+	free_hot_inode_tree(root);
 	free_extent_buffer(root->node);
 	free_extent_buffer(root->commit_root);
 	kfree(root->name);
@@ -2429,6 +2453,10 @@ int close_ctree(struct btrfs_root *root)
 
 	kthread_stop(root->fs_info->transaction_kthread);
 	kthread_stop(root->fs_info->cleaner_kthread);
+	if (btrfs_test_opt(root, HOTDATA_TRACK))
+		kthread_stop(root->fs_info->hot_data_update_kthread);
+	if (btrfs_test_opt(root, HOTDATA_TRACK))
+		kthread_stop(root->fs_info->hot_data_relocate_kthread);
 
 	fs_info->closing = 2;
 	smp_mb();
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a46b64d..642a946 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -505,7 +505,8 @@ static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info,
 	struct btrfs_space_info *found;
 
 	flags &= BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM |
-		 BTRFS_BLOCK_GROUP_METADATA;
+		 BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA_SSD |
+		 BTRFS_BLOCK_GROUP_METADATA_SSD;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(found, head, list) {
@@ -2780,7 +2781,9 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 	spin_lock_init(&found->lock);
 	found->flags = flags & (BTRFS_BLOCK_GROUP_DATA |
 				BTRFS_BLOCK_GROUP_SYSTEM |
-				BTRFS_BLOCK_GROUP_METADATA);
+				BTRFS_BLOCK_GROUP_METADATA |
+				BTRFS_BLOCK_GROUP_DATA_SSD |
+				BTRFS_BLOCK_GROUP_METADATA_SSD);
 	found->total_bytes = total_bytes;
 	found->bytes_used = bytes_used;
 	found->disk_used = bytes_used * factor;
@@ -2854,12 +2857,21 @@ static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
 	return btrfs_reduce_alloc_profile(root, flags);
 }
 
+/*
+ * Turns a chunk_type integer into set of block group flags (a profile).
+ * Hot data relocation code adds chunk_types 2 and 3 for hot data specific
+ * block group types.
+ */
 static u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data)
 {
 	u64 flags;
 
-	if (data)
+	if (data == 1)
 		flags = BTRFS_BLOCK_GROUP_DATA;
+	else if (data == 2)
+		flags = BTRFS_BLOCK_GROUP_DATA_SSD;
+	else if (data == 3)
+		flags = BTRFS_BLOCK_GROUP_METADATA_SSD;
 	else if (root == root->fs_info->chunk_root)
 		flags = BTRFS_BLOCK_GROUP_SYSTEM;
 	else
@@ -2998,6 +3010,19 @@ static void force_metadata_allocation(struct btrfs_fs_info *info)
 	rcu_read_unlock();
 }
 
+static void force_metadata_ssd_allocation(struct btrfs_fs_info *info)
+{
+	struct list_head *head = &info->space_info;
+	struct btrfs_space_info *found;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(found, head, list) {
+		if (found->flags & BTRFS_BLOCK_GROUP_METADATA_SSD)
+			found->force_alloc = 1;
+	}
+	rcu_read_unlock();
+}
+
 static int should_alloc_chunk(struct btrfs_space_info *sinfo,
 			      u64 alloc_bytes)
 {
@@ -3060,6 +3085,14 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
 			force_metadata_allocation(fs_info);
 	}
 
+	if (flags & BTRFS_BLOCK_GROUP_DATA_SSD &&
+		fs_info->metadata_ssd_ratio) {
+		fs_info->data_ssd_chunk_allocations++;
+		if (!(fs_info->data_ssd_chunk_allocations %
+		      fs_info->metadata_ssd_ratio))
+			force_metadata_ssd_allocation(fs_info);
+	}
+
 	ret = btrfs_alloc_chunk(trans, extent_root, flags);
 	spin_lock(&space_info->lock);
 	if (ret)
@@ -3503,6 +3536,20 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
 	meta_used = sinfo->bytes_used;
 	spin_unlock(&sinfo->lock);
 
+	sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA_SSD);
+	if (sinfo) {
+		spin_lock(&sinfo->lock);
+		data_used += sinfo->bytes_used;
+		spin_unlock(&sinfo->lock);
+	}
+
+	sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA_SSD);
+	if (sinfo) {
+		spin_lock(&sinfo->lock);
+		meta_used += sinfo->bytes_used;
+		spin_unlock(&sinfo->lock);
+	}
+
 	num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) *
 		    csum_size * 2;
 	num_bytes += div64_u64(data_used + meta_used, 50);
@@ -3518,7 +3565,6 @@ static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
 	struct btrfs_space_info *sinfo = block_rsv->space_info;
 	u64 num_bytes;
-
 	num_bytes = calc_global_metadata_size(fs_info);
 
 	spin_lock(&block_rsv->lock);
@@ -4831,7 +4877,8 @@ checks:
 		BUG_ON(offset > search_start);
 
 		ret = update_reserved_bytes(block_group, num_bytes, 1,
-					    (data & BTRFS_BLOCK_GROUP_DATA));
+					  (data & BTRFS_BLOCK_GROUP_DATA) ||
+					  (data & BTRFS_BLOCK_GROUP_DATA_SSD));
 		if (ret == -EAGAIN) {
 			btrfs_add_free_space(block_group, offset, num_bytes);
 			goto loop;
@@ -4939,7 +4986,8 @@ loop:
 
 	/* we found what we needed */
 	if (ins->objectid) {
-		if (!(data & BTRFS_BLOCK_GROUP_DATA))
+		if (!(data & BTRFS_BLOCK_GROUP_DATA) &&
+		    !(data & BTRFS_BLOCK_GROUP_DATA_SSD))
 			trans->block_group = block_group->key.objectid;
 
 		btrfs_put_block_group(block_group);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a4080c2..d17118a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -961,6 +961,22 @@ int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
 			      0, NULL, cached_state, mask);
 }
 
+int set_extent_prefer_nonrotating(struct extent_io_tree *tree, u64 start,
+				  u64 end, struct extent_state **cached_state,
+				  gfp_t mask)
+{
+	return set_extent_bit(tree, start, end, EXTENT_PREFER_NONROTATING,
+			      0, NULL, cached_state, mask);
+}
+
+int set_extent_prefer_rotating(struct extent_io_tree *tree, u64 start,
+				  u64 end, struct extent_state **cached_state,
+				  gfp_t mask)
+{
+	return set_extent_bit(tree, start, end, EXTENT_PREFER_ROTATING,
+			      0, NULL, cached_state, mask);
+}
+
 int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
 		       gfp_t mask)
 {
@@ -2468,8 +2484,10 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
 	int ret = 0;
 	int done = 0;
 	int nr_to_write_done = 0;
+	int nr_written = 0;
 	struct pagevec pvec;
 	int nr_pages;
+	pgoff_t start;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	int scanned = 0;
@@ -2486,6 +2504,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree,
 			range_whole = 1;
 		scanned = 1;
 	}
+	start = index << PAGE_CACHE_SHIFT;
 retry:
 	while (!done && !nr_to_write_done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
@@ -2547,10 +2566,13 @@ retry:
 			 * at any time
 			 */
 			nr_to_write_done = wbc->nr_to_write <= 0;
+			nr_written += 1;
 		}
+
 		pagevec_release(&pvec);
 		cond_resched();
 	}
+
 	if (!scanned && !done) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
@@ -2560,6 +2582,18 @@ retry:
 		index = 0;
 		goto retry;
 	}
+
+	/*
+	 * Update access frequency statistics.
+	 * i_ino = 1 appears to come from metadata operations, ignore
+	 * those writes.
+	 */
+	if (BTRFS_TRACK_THIS_INODE(BTRFS_I(mapping->host)) &&
+		mapping->host->i_ino > 1 && nr_written > 0) {
+		btrfs_update_freqs(mapping->host, start,
+			nr_written * PAGE_CACHE_SIZE, 1);
+	}
+
 	return ret;
 }
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5691c7b..a51e7c6 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -17,6 +17,8 @@
 #define EXTENT_NODATASUM (1 << 10)
 #define EXTENT_DO_ACCOUNTING (1 << 11)
 #define EXTENT_FIRST_DELALLOC (1 << 12)
+#define EXTENT_PREFER_NONROTATING (1 << 13)
+#define EXTENT_PREFER_ROTATING (1 << 14)
 #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
@@ -205,6 +207,11 @@ int clear_extent_ordered_metadata(struct extent_io_tree *tree, u64 start,
 				  u64 end, gfp_t mask);
 int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
 			struct extent_state **cached_state, gfp_t mask);
+int set_extent_prefer_nonrotating(struct extent_io_tree *tree, u64 start,
+			u64 end, struct extent_state **cached_state,
+			gfp_t mask);
+int set_extent_prefer_rotating(struct extent_io_tree *tree, u64 start, u64 end,
+			struct extent_state **cached_state, gfp_t mask);
 int set_extent_ordered(struct extent_io_tree *tree, u64 start, u64 end,
 		     gfp_t mask);
 int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f08427c..25d2404 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -37,6 +37,7 @@
 #include <linux/posix_acl.h>
 #include <linux/falloc.h>
 #include <linux/slab.h>
+#include <linux/pagevec.h>
 #include "compat.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -50,6 +51,8 @@
 #include "tree-log.h"
 #include "compression.h"
 #include "locking.h"
+#include "hotdata_map.h"
+#include "hotdata_relocate.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -763,6 +766,9 @@ static noinline int cow_file_range(struct inode *inode,
 	struct extent_map *em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	int ret = 0;
+	int prefer_nonrot;
+	int prefer_rot;
+	int chunk_type = 1;
 
 	trans = btrfs_join_transaction(root, 1);
 	BUG_ON(!trans);
@@ -776,6 +782,79 @@ static noinline int cow_file_range(struct inode *inode,
 	disk_num_bytes = num_bytes;
 	ret = 0;
 
+	/*
+	 * Use COW operations to move hot data to SSD and cold data
+	 * back to rotating disk.  Sets chunk_type to 1 to indicate
+	 * to write to BTRFS_BLOCK_GROUP_DATA or 2 to indicate
+	 * BTRFS_BLOCK_GROUP_DATA_SSD.
+	 */
+	if (BTRFS_MOVE_THIS_INODE(BTRFS_I(inode))) {
+		prefer_nonrot = test_range_bit(&BTRFS_I(inode)->io_tree,
+			start, end, EXTENT_PREFER_NONROTATING, 1, NULL);
+		prefer_rot = test_range_bit(&BTRFS_I(inode)->io_tree,
+			start, end, EXTENT_PREFER_ROTATING, 1, NULL);
+		WARN_ON(prefer_nonrot && prefer_rot);
+
+		if (prefer_nonrot)
+			chunk_type = 2;
+		if (prefer_rot)
+			chunk_type = 1;
+
+		/*
+		 * Although the async thread has not chosen this range
+		 * for relocation to SSD, we're COWing the data anyway
+		 * so let's test the range now. Note that "range" here
+		 * is different from ranges on RANGE_SIZE boundaries.
+		 */
+		if (!(prefer_rot || prefer_nonrot)) {
+			int temperature = 0;
+			struct hot_inode_item *he;
+			struct hot_range_item *hr;
+
+			/* Test just the first proper hotdata range */
+			he = lookup_hot_inode_item(
+				&root->hot_inode_tree, inode->i_ino);
+			if (!he)
+				goto skip_cow_reloc;
+			hr = lookup_hot_range_item(&he->hot_range_tree,
+						   start & RANGE_SIZE_MASK);
+			if (!hr) {
+				free_hot_inode_item(he);
+				goto skip_cow_reloc;
+			}
+
+			spin_lock(&hr->lock);
+			temperature = btrfs_get_temp(&hr->freq_data);
+			spin_unlock(&hr->lock);
+
+			if (temperature >=
+				root->fs_info->fs_root->heat_threshold) {
+				/* This range is hot */
+				chunk_type = 2;
+
+				/*
+				 * Set extent flags and location so future
+				 * operations keep the range on SSD
+				 */
+				btrfs_set_extent_prefer_nonrotating(inode,
+					start, end, NULL);
+				clear_extent_bits(&BTRFS_I(inode)->io_tree,
+					start, end, EXTENT_PREFER_ROTATING,
+					GFP_NOFS);
+				spin_lock(&hr->lock);
+				spin_lock(&hr->heat_node->location_lock);
+				hr->heat_node->location = BTRFS_ON_NONROTATING;
+				spin_unlock(&hr->heat_node->location_lock);
+				spin_unlock(&hr->lock);
+			} else
+				chunk_type = 1;
+
+			free_hot_range_item(hr);
+			free_hot_inode_item(he);
+		}
+	}
+
+skip_cow_reloc:
 	if (start == 0) {
 		/* lets try to make an inline extent */
 		ret = cow_file_range_inline(trans, root, inode,
@@ -811,7 +890,10 @@ static noinline int cow_file_range(struct inode *inode,
 		cur_alloc_size = disk_num_bytes;
 		ret = btrfs_reserve_extent(trans, root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
-					   (u64)-1, &ins, 1);
+					   (u64)-1, &ins, chunk_type);
+		if (ret)
+			printk(KERN_INFO "btrfs cow_file_range btrfs_reserve"
+				"_extent returned %d\n", ret);
 		BUG_ON(ret);
 
 		em = alloc_extent_map(GFP_NOFS);
@@ -1225,9 +1307,25 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 			      unsigned long *nr_written)
 {
 	int ret;
+	int prefer_rot = 0;
+	int prefer_nonrot = 0;
+
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 
-	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)
+	/*
+	 * Force COW for hot data relocation
+	 */
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW &&
+		BTRFS_MOVE_THIS_INODE(BTRFS_I(inode))) {
+		prefer_nonrot = test_range_bit(&BTRFS_I(inode)->io_tree,
+			start, end, EXTENT_PREFER_NONROTATING, 1, NULL);
+		prefer_rot = test_range_bit(&BTRFS_I(inode)->io_tree,
+			start, end, EXTENT_PREFER_ROTATING, 1, NULL);
+		WARN_ON(prefer_nonrot && prefer_rot);
+	}
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !(prefer_rot ||
+		prefer_nonrot))
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC)
@@ -1480,6 +1578,26 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 				   cached_state, GFP_NOFS);
 }
 
+int btrfs_set_extent_prefer_nonrotating(struct inode *inode, u64 start,
+				     u64 end, struct extent_state
+				     **cached_state)
+{
+	if ((end & (PAGE_CACHE_SIZE - 1)) == 0)
+		WARN_ON(1);
+	return set_extent_prefer_nonrotating(&BTRFS_I(inode)->io_tree, start,
+					  end, cached_state, GFP_NOFS);
+}
+
+int btrfs_set_extent_prefer_rotating(struct inode *inode, u64 start,
+				     u64 end, struct extent_state
+				     **cached_state)
+{
+	if ((end & (PAGE_CACHE_SIZE - 1)) == 0)
+		WARN_ON(1);
+	return set_extent_prefer_rotating(&BTRFS_I(inode)->io_tree, start,
+					  end, cached_state, GFP_NOFS);
+}
+
 /* see btrfs_writepage_start_hook for details on why this is required */
 struct btrfs_writepage_fixup {
 	struct page *page;
@@ -2870,6 +2988,18 @@ static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
 				 dentry->d_name.name, dentry->d_name.len);
 	BUG_ON(ret);
 
+	if (BTRFS_TRACKING_HOT_DATA(root)) {
+		struct hot_inode_item *he;
+
+		he = lookup_hot_inode_item(
+			&root->hot_inode_tree, inode->i_ino);
+
+		if (he) {
+			btrfs_remove_inode_from_heat_index(he, root);
+			free_hot_inode_item(he);
+		}
+	}
+
 	if (inode->i_nlink == 0) {
 		ret = btrfs_orphan_add(trans, inode);
 		BUG_ON(ret);
@@ -5781,6 +5911,11 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 	lockstart = offset;
 	lockend = offset + count - 1;
 
+	/* Update access frequency statistics */
+	if (BTRFS_TRACK_THIS_INODE(BTRFS_I(inode)) && count > 0)
+		btrfs_update_freqs(inode, lockstart, (u64) count,
+			writing);
+
 	if (writing) {
 		ret = btrfs_delalloc_reserve_space(inode, count);
 		if (ret)
@@ -5860,7 +5995,16 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 int btrfs_readpage(struct file *file, struct page *page)
 {
 	struct extent_io_tree *tree;
+	u64 start;
+
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
+	start = (u64) page->index << PAGE_CACHE_SHIFT;
+
+	/* Update access frequency statistics */
+	if (BTRFS_TRACK_THIS_INODE(BTRFS_I(page->mapping->host)))
+		btrfs_update_freqs(page->mapping->host, start,
+			PAGE_CACHE_SIZE, 0);
+
 	return extent_read_full_page(tree, page, btrfs_get_extent);
 }
 
@@ -5868,13 +6012,14 @@ static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree;
 
-
 	if (current->flags & PF_MEMALLOC) {
 		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
 		return 0;
 	}
+
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
+
 	return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
 }
 
@@ -5884,6 +6029,7 @@ int btrfs_writepages(struct address_space *mapping,
 	struct extent_io_tree *tree;
 
 	tree = &BTRFS_I(mapping->host)->io_tree;
+
 	return extent_writepages(tree, mapping, btrfs_get_extent, wbc);
 }
 
@@ -5892,7 +6038,17 @@ btrfs_readpages(struct file *file, struct address_space *mapping,
 		struct list_head *pages, unsigned nr_pages)
 {
 	struct extent_io_tree *tree;
+	u64 start, len;
+
 	tree = &BTRFS_I(mapping->host)->io_tree;
+	start = (u64) (list_entry(pages->prev, struct page, lru)->index)
+		<< PAGE_CACHE_SHIFT;
+	len = nr_pages * PAGE_CACHE_SIZE;
+
+	/* Update access frequency statistics */
+	if (len > 0 && BTRFS_TRACK_THIS_INODE(BTRFS_I(mapping->host)))
+		btrfs_update_freqs(mapping->host, start, len, 0);
+
 	return extent_readpages(tree, mapping, pages, nr_pages,
 				btrfs_get_extent);
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 859ddaa..c1c22a0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -51,6 +51,9 @@
 #include "version.h"
 #include "export.h"
 #include "compression.h"
+#include "hotdata_map.h"
+#include "hotdata_hash.h"
+#include "hotdata_relocate.h"
 
 static const struct super_operations btrfs_super_ops;
 
@@ -59,6 +62,11 @@ static void btrfs_put_super(struct super_block *sb)
 	struct btrfs_root *root = btrfs_sb(sb);
 	int ret;
 
+	root->heat_threshold = 0;
+
+	if (btrfs_test_opt(root, HOTDATA_TRACK))
+		btrfs_exit_debugfs_volume(sb);
+
 	ret = close_ctree(root);
 	sb->s_fs_info = NULL;
 }
@@ -68,7 +76,7 @@ enum {
 	Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd,
 	Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress,
 	Opt_compress_force, Opt_notreelog, Opt_ratio, Opt_flushoncommit,
-	Opt_discard, Opt_err,
+	Opt_discard, Opt_hotdatatrack, Opt_hotdatamove, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -92,6 +100,8 @@ static match_table_t tokens = {
 	{Opt_flushoncommit, "flushoncommit"},
 	{Opt_ratio, "metadata_ratio=%d"},
 	{Opt_discard, "discard"},
+	{Opt_hotdatatrack, "hotdatatrack"},
+	{Opt_hotdatamove, "hotdatamove"},
 	{Opt_err, NULL},
 };
 
@@ -235,6 +245,18 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		case Opt_discard:
 			btrfs_set_opt(info->mount_opt, DISCARD);
 			break;
+		case Opt_hotdatamove:
+			printk(KERN_INFO "btrfs: turning on hot data "
+				"migration\n");
+			printk(KERN_INFO "       (implies hotdatatrack, "
+				"no ssd_spread)\n");
+			btrfs_set_opt(info->mount_opt, HOTDATA_MOVE);
+			btrfs_clear_opt(info->mount_opt, SSD_SPREAD);
+		case Opt_hotdatatrack:
+			printk(KERN_INFO "btrfs: turning on hot data"
+				" tracking\n");
+			btrfs_set_opt(info->mount_opt, HOTDATA_TRACK);
+			break;
 		case Opt_err:
 			printk(KERN_INFO "btrfs: unrecognized mount option "
 			       "'%s'\n", p);
@@ -457,6 +479,17 @@ static int btrfs_fill_super(struct super_block *sb,
 		printk("btrfs: open_ctree failed\n");
 		return PTR_ERR(tree_root);
 	}
+
+	/*
+	 * Initialize relocate kthread with HOTDATA_TRACK
+	 * to allow seamless remount to enable HOTDATA_MOVE
+	 */
+	if (btrfs_test_opt(tree_root, HOTDATA_TRACK)) {
+		init_hash_list_kthread(tree_root);
+		init_hot_data_relocate_kthread(tree_root);
+		init_ssd_scan_kthread(tree_root);
+	}
+
 	sb->s_fs_info = tree_root;
 	disk_super = &tree_root->fs_info->super_copy;
 
@@ -658,6 +691,8 @@ static int btrfs_get_sb(struct file_system_type *fs_type, int flags,
 
 	mnt->mnt_sb = s;
 	mnt->mnt_root = root;
+	if (btrfs_test_opt(btrfs_sb(s), HOTDATA_TRACK))
+		btrfs_init_debugfs_volume(dev_name, s);
 
 	kfree(subvol_name);
 	return 0;
@@ -846,18 +881,30 @@ static int __init init_btrfs_fs(void)
 	if (err)
 		goto free_sysfs;
 
-	err = extent_io_init();
+	err = btrfs_init_debugfs();
 	if (err)
 		goto free_cachep;
 
+	err = extent_io_init();
+	if (err)
+		goto free_debugfs;
+
 	err = extent_map_init();
 	if (err)
 		goto free_extent_io;
 
-	err = btrfs_interface_init();
+	err = hot_inode_item_init();
 	if (err)
 		goto free_extent_map;
 
+	err = hot_range_item_init();
+	if (err)
+		goto free_hot_inode_item;
+
+	err = btrfs_interface_init();
+	if (err)
+		goto free_hot_range_item;
+
 	err = register_filesystem(&btrfs_fs_type);
 	if (err)
 		goto unregister_ioctl;
@@ -867,10 +914,16 @@ static int __init init_btrfs_fs(void)
 
 unregister_ioctl:
 	btrfs_interface_exit();
+free_hot_range_item:
+	hot_range_item_exit();
+free_hot_inode_item:
+	hot_inode_item_exit();
 free_extent_map:
 	extent_map_exit();
 free_extent_io:
 	extent_io_exit();
+free_debugfs:
+	btrfs_exit_debugfs();
 free_cachep:
 	btrfs_destroy_cachep();
 free_sysfs:
@@ -882,10 +935,13 @@ static void __exit exit_btrfs_fs(void)
 {
 	btrfs_destroy_cachep();
 	extent_map_exit();
+	hot_inode_item_exit();
+	hot_range_item_exit();
 	extent_io_exit();
 	btrfs_interface_exit();
 	unregister_filesystem(&btrfs_fs_type);
 	btrfs_exit_sysfs();
+	btrfs_exit_debugfs();
 	btrfs_cleanup_fs_uuids();
 	btrfs_zlib_exit();
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d6e3af8..62fd1ab 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2210,10 +2210,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		min_stripes = 4;
 	}
 
-	if (type & BTRFS_BLOCK_GROUP_DATA) {
+	if (type & BTRFS_BLOCK_GROUP_DATA ||
+	    type & BTRFS_BLOCK_GROUP_DATA_SSD) {
 		max_chunk_size = 10 * calc_size;
 		min_stripe_size = 64 * 1024 * 1024;
-	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+	} else if (type & BTRFS_BLOCK_GROUP_METADATA ||
+		   type & BTRFS_BLOCK_GROUP_METADATA_SSD) {
 		max_chunk_size = 256 * 1024 * 1024;
 		min_stripe_size = 32 * 1024 * 1024;
 	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
@@ -2274,15 +2276,43 @@ again:
 
 	INIT_LIST_HEAD(&private_devs);
 	while (index < num_stripes) {
+		int dev_rotating;
+		int skip_device = 0;
 		device = list_entry(cur, struct btrfs_device, dev_alloc_list);
 		BUG_ON(!device->writeable);
+		dev_rotating = !blk_queue_nonrot(bdev_get_queue(device->bdev));
+
+		/*
+		 * If HOTDATA_MOVE is set, the chunk type being allocated
+		 * determines which disks the data may be allocated on.
+		 * This can cause problems if, for example, the data alloc
+		 * profile is RAID0 and there are only two devices, 1 SSD +
+		 * 1 HDD.  All allocations to BTRFS_BLOCK_GROUP_DATA_SSD
+		 * in this config will return -ENOSPC as the allocation code
+		 * can't find allowable space for the second stripe.
+		 */
+		if (btrfs_test_opt(extent_root, HOTDATA_MOVE)) {
+			if (type & BTRFS_BLOCK_GROUP_DATA &&
+				!dev_rotating)
+				skip_device = 1;
+			if (type & BTRFS_BLOCK_GROUP_METADATA &&
+				!dev_rotating)
+				skip_device = 1;
+			if (type & BTRFS_BLOCK_GROUP_DATA_SSD &&
+				dev_rotating)
+				skip_device = 1;
+			if (type & BTRFS_BLOCK_GROUP_METADATA_SSD &&
+				dev_rotating)
+				skip_device = 1;
+		}
 		if (device->total_bytes > device->bytes_used)
 			avail = device->total_bytes - device->bytes_used;
 		else
 			avail = 0;
-		cur = cur->next;
 
-		if (device->in_fs_metadata && avail >= min_free) {
+		cur = cur->next;
+		if (!skip_device &&
+			device->in_fs_metadata && avail >= min_free) {
 			ret = find_free_dev_extent(trans, device,
 						   min_free, &dev_offset,
 						   &max_avail);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality
  2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
                   ` (5 preceding siblings ...)
  2010-08-12 22:22 ` [RFC v2 PATCH 6/6] Btrfs: Add hooks to enable hot data tracking bchociej
@ 2010-08-26  2:13 ` Shaohua Li
  2010-08-30  0:42     ` Hubert Kario
  6 siblings, 1 reply; 12+ messages in thread
From: Shaohua Li @ 2010-08-26  2:13 UTC (permalink / raw)
  To: bchociej
  Cc: chris.mason, linux-btrfs, linux-fsdevel, linux-kernel, cmm,
	bcchocie, mrlupfer, crscott, mlupfer, conscott

On Fri, Aug 13, 2010 at 06:22:00AM +0800, bchociej@gmail.com wrote:
> 
> - Hooks in existing Btrfs functions to track data access frequency
>   (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)
> 
> - New rbtrees for tracking access frequency of inodes and sub-file
>   ranges (hotdata_map.c)
> 
> - A hash list for indexing data by its temperature (hotdata_hash.c)
> 
> - A debugfs interface for dumping data from the rbtrees (debugfs.c)
> 
> - A background kthread for relocating data to faster media based on
>   temperature
Hi,
I'm wondering if the temperature info can be exported to userspace, and
let a daemon to do the relocation (by ioctl). A userspace daemon is more
flexible.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality
  2010-08-26  2:13 ` [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality Shaohua Li
  2010-08-30  0:42     ` Hubert Kario
@ 2010-08-30  0:42     ` Hubert Kario
  0 siblings, 0 replies; 12+ messages in thread
From: Hubert Kario @ 2010-08-30  0:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: bchociej, chris.mason, linux-btrfs, linux-fsdevel, linux-kernel,
	cmm, bcchocie, mrlupfer, crscott, mlupfer, conscott

On Thursday 26 of August 2010 04:13:43 Shaohua Li wrote:
> On Fri, Aug 13, 2010 at 06:22:00AM +0800, bchociej@gmail.com wrote:
> > - Hooks in existing Btrfs functions to track data access frequency
> >=20
> >   (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)
> >=20
> > - New rbtrees for tracking access frequency of inodes and sub-file
> >=20
> >   ranges (hotdata_map.c)
> >=20
> > - A hash list for indexing data by its temperature (hotdata_hash.c)
> >=20
> > - A debugfs interface for dumping data from the rbtrees (debugfs.c)
> >=20
> > - A background kthread for relocating data to faster media based on
> >=20
> >   temperature
>=20
> Hi,
> I'm wondering if the temperature info can be exported to userspace, a=
nd
> let a daemon to do the relocation (by ioctl). A userspace daemon is m=
ore
> flexible.

=46lexibility of userspace daemon is one thing, the ability to let the =
admin=20
precisely control on which drive data is placed could be really benefic=
ial in=20
some scenarios is another thing.

This would also allow online defragmentation, together with access to=20
statistics, one that (for quick runs) has really good time/performance =
benefit=20
ratio.
--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=F3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality
@ 2010-08-30  0:42     ` Hubert Kario
  0 siblings, 0 replies; 12+ messages in thread
From: Hubert Kario @ 2010-08-30  0:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: bchociej, chris.mason, linux-btrfs, linux-fsdevel, linux-kernel,
	cmm, bcchocie, mrlupfer, crscott, mlupfer, conscott

On Thursday 26 of August 2010 04:13:43 Shaohua Li wrote:
> On Fri, Aug 13, 2010 at 06:22:00AM +0800, bchociej@gmail.com wrote:
> > - Hooks in existing Btrfs functions to track data access frequency
> > 
> >   (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)
> > 
> > - New rbtrees for tracking access frequency of inodes and sub-file
> > 
> >   ranges (hotdata_map.c)
> > 
> > - A hash list for indexing data by its temperature (hotdata_hash.c)
> > 
> > - A debugfs interface for dumping data from the rbtrees (debugfs.c)
> > 
> > - A background kthread for relocating data to faster media based on
> > 
> >   temperature
> 
> Hi,
> I'm wondering if the temperature info can be exported to userspace, and
> let a daemon to do the relocation (by ioctl). A userspace daemon is more
> flexible.

Flexibility of userspace daemon is one thing, the ability to let the admin 
precisely control on which drive data is placed could be really beneficial in 
some scenarios is another thing.

This would also allow online defragmentation, together with access to 
statistics, one that (for quick runs) has really good time/performance benefit 
ratio.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality
@ 2010-08-30  0:42     ` Hubert Kario
  0 siblings, 0 replies; 12+ messages in thread
From: Hubert Kario @ 2010-08-30  0:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: bchociej, chris.mason, linux-btrfs, linux-fsdevel, linux-kernel,
	cmm, bcchocie, mrlupfer, crscott, mlupfer, conscott

On Thursday 26 of August 2010 04:13:43 Shaohua Li wrote:
> On Fri, Aug 13, 2010 at 06:22:00AM +0800, bchociej@gmail.com wrote:
> > - Hooks in existing Btrfs functions to track data access frequency
> > 
> >   (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)
> > 
> > - New rbtrees for tracking access frequency of inodes and sub-file
> > 
> >   ranges (hotdata_map.c)
> > 
> > - A hash list for indexing data by its temperature (hotdata_hash.c)
> > 
> > - A debugfs interface for dumping data from the rbtrees (debugfs.c)
> > 
> > - A background kthread for relocating data to faster media based on
> > 
> >   temperature
> 
> Hi,
> I'm wondering if the temperature info can be exported to userspace, and
> let a daemon to do the relocation (by ioctl). A userspace daemon is more
> flexible.

Flexibility of userspace daemon is one thing, the ability to let the admin 
precisely control on which drive data is placed could be really beneficial in 
some scenarios is another thing.

This would also allow online defragmentation, together with access to 
statistics, one that (for quick runs) has really good time/performance benefit 
ratio.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality
  2010-08-30  0:42     ` Hubert Kario
  (?)
  (?)
@ 2010-08-30  1:05     ` Shaohua Li
  -1 siblings, 0 replies; 12+ messages in thread
From: Shaohua Li @ 2010-08-30  1:05 UTC (permalink / raw)
  To: Hubert Kario
  Cc: bchociej, chris.mason, linux-btrfs, linux-fsdevel, linux-kernel,
	cmm, bcchocie, mrlupfer, crscott, mlupfer, conscott

On Mon, 2010-08-30 at 08:42 +0800, Hubert Kario wrote:
> On Thursday 26 of August 2010 04:13:43 Shaohua Li wrote:
> > On Fri, Aug 13, 2010 at 06:22:00AM +0800, bchociej@gmail.com wrote:
> > > - Hooks in existing Btrfs functions to track data access frequency
> > > 
> > >   (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages)
> > > 
> > > - New rbtrees for tracking access frequency of inodes and sub-file
> > > 
> > >   ranges (hotdata_map.c)
> > > 
> > > - A hash list for indexing data by its temperature (hotdata_hash.c)
> > > 
> > > - A debugfs interface for dumping data from the rbtrees (debugfs.c)
> > > 
> > > - A background kthread for relocating data to faster media based on
> > > 
> > >   temperature
> > 
> > Hi,
> > I'm wondering if the temperature info can be exported to userspace, and
> > let a daemon to do the relocation (by ioctl). A userspace daemon is more
> > flexible.
> 
> Flexibility of userspace daemon is one thing, the ability to let the admin 
> precisely control on which drive data is placed could be really beneficial in 
> some scenarios is another thing.
> 
> This would also allow online defragmentation, together with access to 
> statistics, one that (for quick runs) has really good time/performance benefit 
> ratio.
Agreed, I'm thinking of the online defragmentation based on hot access
too. Btrfs usually has more fragment.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-08-30  1:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-12 22:22 [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality bchociej
2010-08-12 22:22 ` [RFC v2 PATCH 1/6] Btrfs: Add experimental hot data hash list index bchociej
2010-08-12 22:22 ` [RFC v2 PATCH 2/6] Btrfs: Add data structures for hot data tracking bchociej
2010-08-12 22:22 ` [RFC v2 PATCH 3/6] Btrfs: Add hot data relocation facilities bchociej
2010-08-12 22:22 ` [RFC v2 PATCH 4/6] Btrfs: Add debugfs interface for hot data stats bchociej
2010-08-12 22:22 ` [RFC v2 PATCH 5/6] Btrfs: 3 new ioctls related to hot data features bchociej
2010-08-12 22:22 ` [RFC v2 PATCH 6/6] Btrfs: Add hooks to enable hot data tracking bchociej
2010-08-26  2:13 ` [RFC v2 PATCH 0/6] Btrfs: Add hot data relocation functionality Shaohua Li
2010-08-30  0:42   ` Hubert Kario
2010-08-30  0:42     ` Hubert Kario
2010-08-30  0:42     ` Hubert Kario
2010-08-30  1:05     ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.