All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/5] BTRFS hot relocation support
@ 2013-05-06  8:53 zwu.kernel
  2013-05-06  8:53 ` [RFC 1/5] vfs: add one list_head field zwu.kernel
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: zwu.kernel @ 2013-05-06  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, idryomov, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  The patchset as RFC is sent out mainly to see if it goes in the
correct development direction.

  The patchset is trying to introduce hot relocation support
for BTRFS. In hybrid storage environment, when the data in
HDD disk get hot, it can be relocated to SSD disk by BTRFS
hot relocation support automatically; also, if SSD disk ratio
exceed its upper threshold, the data which get cold can be
looked up and relocated to HDD disk to make more space in SSD
disk at first, and then the data which get hot will be relocated
to SSD disk automatically.

  BTRFS hot relocation mainly reserve block space from SSD disk
at first, load the hot data to page cache from HDD, allocate
block space from SSD disk, and finally write the data to SSD disk.

  If you'd like to play with it, pls pull the patchset from
my git on github:
  https://github.com/wuzhy/kernel.git hot_reloc

For how to use, please refer too the example below:

root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
^^^ Above command will hack /dev/vdc to be one SSD disk
root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f
     
WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
     
[ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb
[ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
[ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
[ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
[ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
adding device /dev/vdc id 2
[ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc
fs created label (null) on /dev/vdb
nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
Btrfs v0.20-rc1-254-gb0136aa-dirty
root@debian-i386:~# mount -o hot_move /dev/vdb /data2
[ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb
[ 144.870444] btrfs: disk space caching is enabled
[ 144.904214] VFS: Turning on hot data tracking
root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.0G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=2.00GB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.19MB
Data_SSD: total=8.00MB, used=0.00
root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109
root@debian-i386:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 16G 13G 2.2G 86% /
tmpfs 4.8G 0 4.8G 0% /lib/init/rw
udev 10M 176K 9.9M 2% /dev
tmpfs 4.8G 0 4.8G 0% /dev/shm
/dev/vdb 15G 2.1G 13G 14% /data2
root@debian-i386:~# btrfs fi df /data2
Data: total=3.01GB, used=6.25MB
System: total=4.00MB, used=4.00KB
Metadata: total=8.00MB, used=2.26MB
Data_SSD: total=2.01GB, used=2.00GB
root@debian-i386:~# 

Zhi Yong Wu (5):
  vfs: add one list_head field
  btrfs: add one new block group
  btrfs: add one hot relocation kthread
  procfs: add three proc interfaces
  btrfs: add hot relocation support

 fs/btrfs/Makefile            |   3 +-
 fs/btrfs/ctree.h             |  26 +-
 fs/btrfs/extent-tree.c       | 107 +++++-
 fs/btrfs/extent_io.c         |  31 +-
 fs/btrfs/extent_io.h         |   4 +
 fs/btrfs/file.c              |  36 +-
 fs/btrfs/hot_relocate.c      | 802 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hot_relocate.h      |  48 +++
 fs/btrfs/inode-map.c         |  13 +-
 fs/btrfs/inode.c             |  92 ++++-
 fs/btrfs/ioctl.c             |  23 +-
 fs/btrfs/relocation.c        |  14 +-
 fs/btrfs/super.c             |  30 +-
 fs/btrfs/volumes.c           |  28 +-
 fs/hot_tracking.c            |   1 +
 include/linux/btrfs.h        |   4 +
 include/linux/hot_tracking.h |   1 +
 kernel/sysctl.c              |  22 ++
 18 files changed, 1234 insertions(+), 51 deletions(-)
 create mode 100644 fs/btrfs/hot_relocate.c
 create mode 100644 fs/btrfs/hot_relocate.h

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 1/5] vfs: add one list_head field
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
@ 2013-05-06  8:53 ` zwu.kernel
  2013-05-06  8:53 ` [RFC 2/5] btrfs: add one new block group zwu.kernel
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: zwu.kernel @ 2013-05-06  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, idryomov, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add one list_head field 'reloc_list' to accommodate
hot relocation support.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 1 +
 include/linux/hot_tracking.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 3b0002c..7071ac8 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -41,6 +41,7 @@ static void hot_comm_item_init(struct hot_comm_item *ci, int type)
 	clear_bit(HOT_IN_LIST, &ci->delete_flag);
 	clear_bit(HOT_DELETING, &ci->delete_flag);
 	INIT_LIST_HEAD(&ci->track_list);
+	INIT_LIST_HEAD(&ci->reloc_list);
 	memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data));
 	ci->hot_freq_data.avg_delta_reads = (u64) -1;
 	ci->hot_freq_data.avg_delta_writes = (u64) -1;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 2272975..49f901c 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -68,6 +68,7 @@ struct hot_comm_item {
 	struct rb_node rb_node;			/* rbtree index */
 	unsigned long delete_flag;
 	struct list_head track_list;		/* link to *_map[] */
+	struct list_head reloc_list;		/* used in hot relocation*/
 };
 
 /* An item representing an inode and its access frequency */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 2/5] btrfs: add one new block group
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
  2013-05-06  8:53 ` [RFC 1/5] vfs: add one list_head field zwu.kernel
@ 2013-05-06  8:53 ` zwu.kernel
  2013-05-06  8:53 ` [RFC 3/5] btrfs: add one hot relocation kthread zwu.kernel
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: zwu.kernel @ 2013-05-06  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, idryomov, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Introduce one new block group BTRFS_BLOCK_GROUP_DATA_SSD,
which is used to differentiate if the block space is reserved
and allocated from one HDD disk or SSD disk.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/Makefile       |   3 +-
 fs/btrfs/ctree.h        |  24 ++++++++++-
 fs/btrfs/extent-tree.c  | 107 +++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.c    |  31 ++++++++++++--
 fs/btrfs/extent_io.h    |   4 ++
 fs/btrfs/file.c         |  36 +++++++++++++---
 fs/btrfs/hot_relocate.c |  78 +++++++++++++++++++++++++++++++++++
 fs/btrfs/hot_relocate.h |  31 ++++++++++++++
 fs/btrfs/inode-map.c    |  13 +++++-
 fs/btrfs/inode.c        |  92 +++++++++++++++++++++++++++++++++--------
 fs/btrfs/ioctl.c        |  23 +++++++++--
 fs/btrfs/relocation.c   |  14 ++++++-
 fs/btrfs/super.c        |   3 +-
 fs/btrfs/volumes.c      |  28 ++++++++++++-
 14 files changed, 439 insertions(+), 48 deletions(-)
 create mode 100644 fs/btrfs/hot_relocate.c
 create mode 100644 fs/btrfs/hot_relocate.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 3932224..94f1ea5 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o
+	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
+	   hot_relocate.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 701dec5..f4c4419 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -961,6 +961,16 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10	(1ULL << 6)
 #define BTRFS_BLOCK_GROUP_RAID5    (1 << 7)
 #define BTRFS_BLOCK_GROUP_RAID6    (1 << 8)
+/*
+ * New block groups for use with hot data relocation feature. When hot data
+ * relocation is on, *_SSD block groups are forced to nonrotating drives and
+ * the plain DATA and METADATA block groups are forced to rotating drives.
+ *
+ * This should be further optimized, i.e. force metadata to SSD or relocate
+ * inode metadata to SSD when any of its subfile ranges are relocated to SSD
+ * so that reads and writes aren't delayed by HDD seeks.
+ */
+#define BTRFS_BLOCK_GROUP_DATA_SSD	(1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RESERVED	BTRFS_AVAIL_ALLOC_BIT_SINGLE
 
 enum btrfs_raid_types {
@@ -976,7 +986,8 @@ enum btrfs_raid_types {
 
 #define BTRFS_BLOCK_GROUP_TYPE_MASK	(BTRFS_BLOCK_GROUP_DATA |    \
 					 BTRFS_BLOCK_GROUP_SYSTEM |  \
-					 BTRFS_BLOCK_GROUP_METADATA)
+					 BTRFS_BLOCK_GROUP_METADATA | \
+					 BTRFS_BLOCK_GROUP_DATA_SSD)
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
@@ -1508,6 +1519,7 @@ struct btrfs_fs_info {
 	struct list_head space_info;
 
 	struct btrfs_space_info *data_sinfo;
+	struct btrfs_space_info *hot_data_sinfo;
 
 	struct reloc_control *reloc_ctl;
 
@@ -1532,6 +1544,7 @@ struct btrfs_fs_info {
 	u64 avail_data_alloc_bits;
 	u64 avail_metadata_alloc_bits;
 	u64 avail_system_alloc_bits;
+	u64 avail_data_ssd_alloc_bits;
 
 	/* restriper state */
 	spinlock_t balance_lock;
@@ -1544,6 +1557,7 @@ struct btrfs_fs_info {
 
 	unsigned data_chunk_allocations;
 	unsigned metadata_ratio;
+	unsigned data_ssd_chunk_allocations;
 
 	void *bdev_holder;
 
@@ -1901,6 +1915,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
 #define BTRFS_MOUNT_HOT_TRACK		(1 << 23)
+#define BTRFS_MOUNT_HOT_MOVE		(1 << 24)
 
 #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
@@ -1922,6 +1937,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_INODE_NOATIME		(1 << 9)
 #define BTRFS_INODE_DIRSYNC		(1 << 10)
 #define BTRFS_INODE_COMPRESS		(1 << 11)
+#define BTRFS_INODE_HOT			(1 << 12)
 
 #define BTRFS_INODE_ROOT_ITEM_INIT	(1 << 31)
 
@@ -3014,6 +3030,8 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_root *root,
 int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
 			  struct btrfs_root *root,
 			  u64 objectid, u64 offset, u64 bytenr);
+struct btrfs_block_group_cache *btrfs_lookup_first_block_group(
+				struct btrfs_fs_info *info, u64 bytenr);
 struct btrfs_block_group_cache *btrfs_lookup_block_group(
 						 struct btrfs_fs_info *info,
 						 u64 bytenr);
@@ -3070,6 +3088,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 			 struct btrfs_root *root,
 			 u64 bytenr, u64 num_bytes, u64 parent,
 			 u64 root_objectid, u64 owner, u64 offset, int for_cow);
+struct btrfs_block_group_cache *next_block_group(struct btrfs_root *root,
+			 struct btrfs_block_group_cache *cache);
 
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
 				    struct btrfs_root *root);
@@ -3102,6 +3122,7 @@ enum btrfs_reserve_flush_enum {
 
 int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
+void btrfs_free_reserved_ssd_data_space(struct inode *inode, u64 bytes);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root);
 int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
@@ -3118,6 +3139,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
 int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
+void btrfs_delalloc_release_ssd_space(struct inode *inode, u64 num_bytes);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
 					      unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d55123..676b08e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -598,7 +598,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 /*
  * return the block group that starts at or after bytenr
  */
-static struct btrfs_block_group_cache *
+struct btrfs_block_group_cache *
 btrfs_lookup_first_block_group(struct btrfs_fs_info *info, u64 bytenr)
 {
 	struct btrfs_block_group_cache *cache;
@@ -2961,7 +2961,7 @@ fail:
 
 }
 
-static struct btrfs_block_group_cache *
+struct btrfs_block_group_cache *
 next_block_group(struct btrfs_root *root,
 		 struct btrfs_block_group_cache *cache)
 {
@@ -3082,7 +3082,12 @@ again:
 					      &alloc_hint);
 	if (!ret)
 		dcs = BTRFS_DC_SETUP;
-	btrfs_free_reserved_data_space(inode, num_pages);
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		btrfs_free_reserved_ssd_data_space(inode, num_pages);
+	} else
+		btrfs_free_reserved_data_space(inode, num_pages);
 
 out_put:
 	iput(inode);
@@ -3284,6 +3289,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 	list_add_rcu(&found->list, &info->space_info);
 	if (flags & BTRFS_BLOCK_GROUP_DATA)
 		info->data_sinfo = found;
+	else if (flags & BTRFS_BLOCK_GROUP_DATA_SSD)
+		info->hot_data_sinfo = found;
 	return 0;
 }
 
@@ -3299,6 +3306,8 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 		fs_info->avail_metadata_alloc_bits |= extra_flags;
 	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		fs_info->avail_system_alloc_bits |= extra_flags;
+	if (flags & BTRFS_BLOCK_GROUP_DATA_SSD)
+		fs_info->avail_data_ssd_alloc_bits |= extra_flags;
 	write_sequnlock(&fs_info->profiles_lock);
 }
 
@@ -3405,18 +3414,27 @@ static u64 get_alloc_profile(struct btrfs_root *root, u64 flags)
 			flags |= root->fs_info->avail_system_alloc_bits;
 		else if (flags & BTRFS_BLOCK_GROUP_METADATA)
 			flags |= root->fs_info->avail_metadata_alloc_bits;
+		else if (flags & BTRFS_BLOCK_GROUP_DATA_SSD)
+			flags |= root->fs_info->avail_data_ssd_alloc_bits;
 	} while (read_seqretry(&root->fs_info->profiles_lock, seq));
 
 	return btrfs_reduce_alloc_profile(root, flags);
 }
 
+/*
+ * Turns a chunk_type integer into set of block group flags (a profile).
+ * Hot data relocation code adds chunk_types 2 and 3 for hot data specific
+ * block group types.
+ */
 u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data)
 {
 	u64 flags;
 	u64 ret;
 
-	if (data)
+	if (data == 1)
 		flags = BTRFS_BLOCK_GROUP_DATA;
+	else if (data == 2)
+		flags = BTRFS_BLOCK_GROUP_DATA_SSD;
 	else if (root == root->fs_info->chunk_root)
 		flags = BTRFS_BLOCK_GROUP_SYSTEM;
 	else
@@ -3437,6 +3455,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 used;
 	int ret = 0, committed = 0, alloc_chunk = 1;
+	int data, tried = 0;
 
 	/* make sure bytes are sectorsize aligned */
 	bytes = ALIGN(bytes, root->sectorsize);
@@ -3447,7 +3466,15 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes)
 		committed = 1;
 	}
 
-	data_sinfo = fs_info->data_sinfo;
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+try_hot:
+		data = 2;
+		data_sinfo = fs_info->hot_data_sinfo;
+	} else {
+		data = 1;
+		data_sinfo = fs_info->data_sinfo;
+	}
+
 	if (!data_sinfo)
 		goto alloc;
 
@@ -3465,13 +3492,22 @@ again:
 		 * if we don't have enough free bytes in this space then we need
 		 * to alloc a new chunk.
 		 */
-		if (!data_sinfo->full && alloc_chunk) {
+		if (alloc_chunk) {
 			u64 alloc_target;
 
+			if (data_sinfo->full) {
+				if (!tried) {
+					tried = 1;
+					spin_unlock(&data_sinfo->lock);
+					goto try_hot;
+				} else
+					goto non_alloc;
+			}
+
 			data_sinfo->force_alloc = CHUNK_ALLOC_FORCE;
 			spin_unlock(&data_sinfo->lock);
 alloc:
-			alloc_target = btrfs_get_alloc_profile(root, 1);
+			alloc_target = btrfs_get_alloc_profile(root, data);
 			trans = btrfs_join_transaction(root);
 			if (IS_ERR(trans))
 				return PTR_ERR(trans);
@@ -3488,11 +3524,13 @@ alloc:
 			}
 
 			if (!data_sinfo)
-				data_sinfo = fs_info->data_sinfo;
+				data_sinfo = (data == 1) ? fs_info->data_sinfo :
+						fs_info->hot_data_sinfo;
 
 			goto again;
 		}
 
+non_alloc:
 		/*
 		 * If we have less pinned bytes than we want to allocate then
 		 * don't bother committing the transaction, it won't help us.
@@ -3503,7 +3541,7 @@ alloc:
 
 		/* commit the current transaction and try again */
 commit_trans:
-		if (!committed &&
+		if (!committed && data_sinfo &&
 		    !atomic_read(&root->fs_info->open_ioctl_trans)) {
 			committed = 1;
 			trans = btrfs_join_transaction(root);
@@ -3517,6 +3555,10 @@ commit_trans:
 
 		return -ENOSPC;
 	}
+
+	if (tried)
+		BTRFS_I(inode)->flags |= BTRFS_INODE_HOT;
+
 	data_sinfo->bytes_may_use += bytes;
 	trace_btrfs_space_reservation(root->fs_info, "space_info",
 				      data_sinfo->flags, bytes, 1);
@@ -3544,6 +3586,22 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
 	spin_unlock(&data_sinfo->lock);
 }
 
+void btrfs_free_reserved_ssd_data_space(struct inode *inode, u64 bytes)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_space_info *data_sinfo;
+
+	/* make sure bytes are sectorsize aligned */
+	bytes = ALIGN(bytes, root->sectorsize);
+
+	data_sinfo = root->fs_info->hot_data_sinfo;
+	spin_lock(&data_sinfo->lock);
+	data_sinfo->bytes_may_use -= bytes;
+	trace_btrfs_space_reservation(root->fs_info, "space_info",
+				      data_sinfo->flags, bytes, 0);
+	spin_unlock(&data_sinfo->lock);
+}
+
 static void force_metadata_allocation(struct btrfs_fs_info *info)
 {
 	struct list_head *head = &info->space_info;
@@ -3715,6 +3773,13 @@ again:
 			force_metadata_allocation(fs_info);
 	}
 
+	if (flags & BTRFS_BLOCK_GROUP_DATA_SSD && fs_info->metadata_ratio) {
+		fs_info->data_ssd_chunk_allocations++;
+		if (!(fs_info->data_ssd_chunk_allocations %
+			fs_info->metadata_ratio))
+				force_metadata_allocation(fs_info);
+	}
+
 	/*
 	 * Check if we have enough space in SYSTEM chunk because we may need
 	 * to update devices.
@@ -4422,6 +4487,13 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
 	meta_used = sinfo->bytes_used;
 	spin_unlock(&sinfo->lock);
 
+	sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA_SSD);
+	if (sinfo) {
+		spin_lock(&sinfo->lock);
+		data_used += sinfo->bytes_used;
+		spin_unlock(&sinfo->lock);
+	}
+
 	num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) *
 		    csum_size * 2;
 	num_bytes += div64_u64(data_used + meta_used, 50);
@@ -4916,7 +4988,11 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes)
 
 	ret = btrfs_delalloc_reserve_metadata(inode, num_bytes);
 	if (ret) {
-		btrfs_free_reserved_data_space(inode, num_bytes);
+		if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+			BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+			btrfs_free_reserved_ssd_data_space(inode, num_bytes);
+		} else
+			btrfs_free_reserved_data_space(inode, num_bytes);
 		return ret;
 	}
 
@@ -4942,6 +5018,12 @@ void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes)
 	btrfs_free_reserved_data_space(inode, num_bytes);
 }
 
+void btrfs_delalloc_release_ssd_space(struct inode *inode, u64 num_bytes)
+{
+	btrfs_delalloc_release_metadata(inode, num_bytes);
+	btrfs_free_reserved_ssd_data_space(inode, num_bytes);
+}
+
 static int update_block_group(struct btrfs_root *root,
 			      u64 bytenr, u64 num_bytes, int alloc)
 {
@@ -5770,7 +5852,8 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_space_info *space_info;
 	int loop = 0;
 	int index = __get_raid_index(data);
-	int alloc_type = (data & BTRFS_BLOCK_GROUP_DATA) ?
+	int alloc_type = ((data & BTRFS_BLOCK_GROUP_DATA)
+		|| (data & BTRFS_BLOCK_GROUP_DATA_SSD)) ?
 		RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC;
 	bool found_uncached_bg = false;
 	bool failed_cluster_refill = false;
@@ -8189,6 +8272,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 		fs_info->avail_metadata_alloc_bits &= ~extra_flags;
 	if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
 		fs_info->avail_system_alloc_bits &= ~extra_flags;
+	if (flags & BTRFS_BLOCK_GROUP_DATA_SSD)
+		fs_info->avail_data_ssd_alloc_bits &= ~extra_flags;
 	write_sequnlock(&fs_info->profiles_lock);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cdee391..608b7a8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1400,9 +1400,11 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 {
 	struct rb_node *node;
 	struct extent_state *state;
+	struct btrfs_root *root;
 	u64 cur_start = *start;
 	u64 found = 0;
 	u64 total_bytes = 0;
+	int flag = EXTENT_DELALLOC;
 
 	spin_lock(&tree->lock);
 
@@ -1417,13 +1419,27 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 		goto out;
 	}
 
+	root = BTRFS_I(tree->mapping->host)->root;
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (found && (state->start != cur_start ||
 			      (state->state & EXTENT_BOUNDARY))) {
 			goto out;
 		}
-		if (!(state->state & EXTENT_DELALLOC)) {
+		if (btrfs_test_opt(root, HOT_MOVE)) {
+			if (!(state->state & EXTENT_DELALLOC) ||
+				(!(state->state & EXTENT_HOT) &&
+				!(state->state & EXTENT_COLD))) {
+				if (!found)
+					*end = state->end;
+				goto out;
+			} else {
+				if (!found)
+					flag = (state->state & EXTENT_HOT) ?
+						EXTENT_HOT : EXTENT_COLD;
+			}
+		}
+		if (!(state->state & flag)) {
 			if (!found)
 				*end = state->end;
 			goto out;
@@ -1610,7 +1626,13 @@ again:
 	lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state);
 
 	/* then test to make sure it is all still delalloc */
-	ret = test_range_bit(tree, delalloc_start, delalloc_end,
+	if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE)) {
+		ret = test_range_bit(tree, delalloc_start, delalloc_end,
+			     EXTENT_DELALLOC | EXTENT_HOT, 1, cached_state);
+		ret |= test_range_bit(tree, delalloc_start, delalloc_end,
+			     EXTENT_DELALLOC | EXTENT_COLD, 1, cached_state);
+	} else
+		ret = test_range_bit(tree, delalloc_start, delalloc_end,
 			     EXTENT_DELALLOC, 1, cached_state);
 	if (!ret) {
 		unlock_extent_cached(tree, delalloc_start, delalloc_end,
@@ -1644,7 +1666,10 @@ int extent_clear_unlock_delalloc(struct inode *inode,
 		clear_bits |= EXTENT_LOCKED;
 	if (op & EXTENT_CLEAR_DIRTY)
 		clear_bits |= EXTENT_DIRTY;
-
+	if (op & EXTENT_CLEAR_HOT)
+		clear_bits |= EXTENT_HOT;
+	if (op & EXTENT_CLEAR_COLD)
+		clear_bits |= EXTENT_COLD;
 	if (op & EXTENT_CLEAR_DELALLOC)
 		clear_bits |= EXTENT_DELALLOC;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 258c921..35e155f 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -19,6 +19,8 @@
 #define EXTENT_FIRST_DELALLOC (1 << 12)
 #define EXTENT_NEED_WAIT (1 << 13)
 #define EXTENT_DAMAGED (1 << 14)
+#define EXTENT_HOT (1 << 15)
+#define EXTENT_COLD (1 << 16)
 #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
@@ -51,6 +53,8 @@
 #define EXTENT_END_WRITEBACK	 0x20
 #define EXTENT_SET_PRIVATE2	 0x40
 #define EXTENT_CLEAR_ACCOUNTING  0x80
+#define EXTENT_CLEAR_HOT	 0x100
+#define EXTENT_CLEAR_COLD	 0x200
 
 /*
  * page->private values.  Every page that is controlled by the extent
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ade03e6..941b50e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -40,6 +40,7 @@
 #include "locking.h"
 #include "compat.h"
 #include "volumes.h"
+#include "hot_relocate.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -513,6 +514,10 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize);
 
 	end_of_last_block = start_pos + num_bytes - 1;
+
+	if (btrfs_test_opt(root, HOT_MOVE))
+		hot_set_extent(inode, start_pos, end_of_last_block, cached, 1);
+
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
 					cached);
 	if (err)
@@ -1372,7 +1377,12 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 				    pos, first_index, write_bytes,
 				    force_page_uptodate);
 		if (ret) {
-			btrfs_delalloc_release_space(inode,
+			if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+				BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+				btrfs_delalloc_release_ssd_space(inode,
+					num_pages << PAGE_CACHE_SHIFT);
+			} else
+				btrfs_delalloc_release_space(inode,
 					num_pages << PAGE_CACHE_SHIFT);
 			break;
 		}
@@ -1410,7 +1420,12 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 				BTRFS_I(inode)->outstanding_extents++;
 				spin_unlock(&BTRFS_I(inode)->lock);
 			}
-			btrfs_delalloc_release_space(inode,
+			if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT)
+				btrfs_delalloc_release_ssd_space(inode,
+					(num_pages - dirty_pages) <<
+					PAGE_CACHE_SHIFT);
+			else
+				btrfs_delalloc_release_space(inode,
 					(num_pages - dirty_pages) <<
 					PAGE_CACHE_SHIFT);
 		}
@@ -1420,8 +1435,13 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 						dirty_pages, pos, copied,
 						NULL);
 			if (ret) {
-				btrfs_delalloc_release_space(inode,
-					dirty_pages << PAGE_CACHE_SHIFT);
+				if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+					BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+					btrfs_delalloc_release_ssd_space(inode,
+						dirty_pages << PAGE_CACHE_SHIFT);
+				} else
+					btrfs_delalloc_release_space(inode,
+						dirty_pages << PAGE_CACHE_SHIFT);
 				btrfs_drop_pages(pages, num_pages);
 				break;
 			}
@@ -2282,7 +2302,13 @@ out:
 		btrfs_qgroup_free(root, alloc_end - alloc_start);
 out_reserve_fail:
 	/* Let go of our reservation. */
-	btrfs_free_reserved_data_space(inode, alloc_end - alloc_start);
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		btrfs_free_reserved_ssd_data_space(inode,
+					alloc_end - alloc_start);
+	} else
+		btrfs_free_reserved_data_space(inode,
+					alloc_end - alloc_start);
 	return ret;
 }
 
diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
new file mode 100644
index 0000000..1effd14
--- /dev/null
+++ b/fs/btrfs/hot_relocate.c
@@ -0,0 +1,78 @@
+/*
+ * fs/btrfs/hot_relocate.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *            Ben Chociej <bchociej@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include "hot_relocate.h"
+
+static void hot_set_extent_bits(struct extent_io_tree *tree, u64 start,
+		u64 end, struct extent_state **cached_state,
+		gfp_t mask, int storage_type, int flag)
+{
+	int set_bits = 0, clear_bits = 0;
+
+	if (flag) {
+		set_bits = EXTENT_DELALLOC | EXTENT_UPTODATE;
+		clear_bits = EXTENT_DIRTY | EXTENT_DELALLOC |
+				EXTENT_DO_ACCOUNTING;
+	}
+
+	if (storage_type == ON_ROT_DISK) {
+		set_bits |= EXTENT_COLD;
+		clear_bits |= EXTENT_HOT;
+	} else if (storage_type == ON_NONROT_DISK) {
+		set_bits |= EXTENT_HOT;
+		clear_bits |= EXTENT_COLD;
+	}
+
+	clear_extent_bit(tree, start, end, clear_bits,
+			0, 0, cached_state, mask);
+	set_extent_bit(tree, start, end, set_bits, NULL,
+			cached_state, mask);
+}
+
+void hot_set_extent(struct inode *inode, u64 start, u64 end,
+		struct extent_state **cached_state, int flag)
+{
+	int storage_type;
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		if (flag)
+			BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		storage_type = TYPE_NONROT;
+	} else
+		storage_type = TYPE_ROT;
+
+	hot_set_extent_bits(&BTRFS_I(inode)->io_tree, start,
+			end, cached_state, GFP_NOFS, storage_type, 0);
+}
+
+int hot_get_chunk_type(struct inode *inode, u64 start, u64 end)
+{
+	int hot, cold, ret = 1;
+
+	hot = test_range_bit(&BTRFS_I(inode)->io_tree,
+				start, end, EXTENT_HOT, 1, NULL);
+	cold = test_range_bit(&BTRFS_I(inode)->io_tree,
+				start, end, EXTENT_COLD, 1, NULL);
+
+	WARN_ON(hot && cold);
+
+	if (hot)
+		ret = 2;
+	else if (cold)
+		ret = 1;
+	else
+		WARN_ON(1);
+
+	return ret;
+}
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
new file mode 100644
index 0000000..b8427ba
--- /dev/null
+++ b/fs/btrfs/hot_relocate.h
@@ -0,0 +1,31 @@
+/*
+ * fs/btrfs/hot_relocate.h
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *	      Ben Chociej <bchociej@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_RELOCATE__
+#define __HOT_RELOCATE__
+
+#include <linux/hot_tracking.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+
+enum {
+	TYPE_ROT,       /* rot -> rotating */
+	TYPE_NONROT,    /* nonrot -> nonrotating */
+	MAX_RELOC_TYPES
+};
+
+void hot_set_extent(struct inode *inode, u64 start, u64 end,
+		struct extent_state **cached_state, int flag);
+int hot_get_chunk_type(struct inode *inode, u64 start, u64 end);
+
+#endif /* __HOT_RELOCATE__ */
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index d26f67a..a720135 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -497,10 +497,19 @@ again:
 	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
 					      prealloc, prealloc, &alloc_hint);
 	if (ret) {
-		btrfs_delalloc_release_space(inode, prealloc);
+		if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+			BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+			btrfs_delalloc_release_ssd_space(inode, prealloc);
+		} else
+			btrfs_delalloc_release_space(inode, prealloc);
 		goto out_put;
 	}
-	btrfs_free_reserved_data_space(inode, prealloc);
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		btrfs_free_reserved_ssd_data_space(inode, prealloc);
+	} else
+		btrfs_free_reserved_data_space(inode, prealloc);
 
 	ret = btrfs_write_out_ino_cache(root, trans, path);
 out_put:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 09c58a3..77eda44 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -56,6 +56,7 @@
 #include "free-space-cache.h"
 #include "inode-map.h"
 #include "backref.h"
+#include "hot_relocate.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -857,13 +858,14 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
 {
 	u64 alloc_hint = 0;
 	u64 num_bytes;
-	unsigned long ram_size;
+	unsigned long ram_size, hot_flag = 0;
 	u64 disk_num_bytes;
 	u64 cur_alloc_size;
 	u64 blocksize = root->sectorsize;
 	struct btrfs_key ins;
 	struct extent_map *em;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	int chunk_type = 1;
 	int ret = 0;
 
 	BUG_ON(btrfs_is_free_space_inode(inode));
@@ -871,6 +873,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
 	num_bytes = ALIGN(end - start + 1, blocksize);
 	num_bytes = max(blocksize,  num_bytes);
 	disk_num_bytes = num_bytes;
+	ret = 0;
 
 	/* if this is a small write inside eof, kick off defrag */
 	if (num_bytes < 64 * 1024 &&
@@ -890,7 +893,8 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
 				     EXTENT_CLEAR_DELALLOC |
 				     EXTENT_CLEAR_DIRTY |
 				     EXTENT_SET_WRITEBACK |
-				     EXTENT_END_WRITEBACK);
+				     EXTENT_END_WRITEBACK |
+				     hot_flag);
 
 			*nr_written = *nr_written +
 			     (end - start + PAGE_CACHE_SIZE) / PAGE_CACHE_SIZE;
@@ -912,9 +916,25 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
 		unsigned long op;
 
 		cur_alloc_size = disk_num_bytes;
+
+		/*
+		 * Use COW operations to move hot data to SSD and cold data
+		 * back to rotating disk. Sets chunk_type to 1 to indicate
+		 * to write to BTRFS_BLOCK_GROUP_DATA or 2 to indicate
+		 * BTRFS_BLOCK_GROUP_DATA_SSD.
+		 */
+		if (btrfs_test_opt(root, HOT_MOVE)) {
+			chunk_type = hot_get_chunk_type(inode, start,
+						start + cur_alloc_size - 1);
+			if (chunk_type == 1)
+				hot_flag = EXTENT_CLEAR_COLD;
+			if (chunk_type == 2)
+				hot_flag = EXTENT_CLEAR_HOT;
+		}
+
 		ret = btrfs_reserve_extent(trans, root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
-					   &ins, 1);
+					   &ins, chunk_type);
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, root, ret);
 			goto out_unlock;
@@ -978,7 +998,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans,
 		 */
 		op = unlock ? EXTENT_CLEAR_UNLOCK_PAGE : 0;
 		op |= EXTENT_CLEAR_UNLOCK | EXTENT_CLEAR_DELALLOC |
-			EXTENT_SET_PRIVATE2;
+			EXTENT_SET_PRIVATE2 | hot_flag;
 
 		extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree,
 					     start, start + ram_size - 1,
@@ -1000,7 +1020,8 @@ out_unlock:
 		     EXTENT_CLEAR_DELALLOC |
 		     EXTENT_CLEAR_DIRTY |
 		     EXTENT_SET_WRITEBACK |
-		     EXTENT_END_WRITEBACK);
+		     EXTENT_END_WRITEBACK |
+		     hot_flag);
 
 	goto out;
 }
@@ -1593,8 +1614,12 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 			btrfs_delalloc_release_metadata(inode, len);
 
 		if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
-		    && do_list)
-			btrfs_free_reserved_data_space(inode, len);
+		    && do_list) {
+			if ((state->state & EXTENT_HOT) && (*bits & EXTENT_HOT))
+				btrfs_free_reserved_ssd_data_space(inode, len);
+			else
+				btrfs_free_reserved_data_space(inode, len);
+		}
 
 		__percpu_counter_add(&root->fs_info->delalloc_bytes, -len,
 				     root->fs_info->delalloc_batch);
@@ -1828,6 +1853,9 @@ again:
 		goto out;
 	 }
 
+	if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE))
+		hot_set_extent(inode, page_start, page_end, &cached_state, 1);
+
 	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
 	ClearPageChecked(page);
 	set_page_dirty(page);
@@ -4282,7 +4310,12 @@ int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
 again:
 	page = find_or_create_page(mapping, index, mask);
 	if (!page) {
-		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+		if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+			BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+			btrfs_delalloc_release_ssd_space(inode,
+							PAGE_CACHE_SIZE);
+		} else
+			btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -4324,6 +4357,9 @@ again:
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			  0, 0, &cached_state, GFP_NOFS);
 
+	if (btrfs_test_opt(root, HOT_MOVE))
+		hot_set_extent(inode, page_start, page_end, &cached_state, 0);
+
 	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
 					&cached_state);
 	if (ret) {
@@ -4332,6 +4368,8 @@ again:
 		goto out_unlock;
 	}
 
+	BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+
 	if (offset != PAGE_CACHE_SIZE) {
 		if (!len)
 			len = PAGE_CACHE_SIZE - offset;
@@ -4349,8 +4387,14 @@ again:
 			     GFP_NOFS);
 
 out_unlock:
-	if (ret)
-		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+	if (ret) {
+		if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+			BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+			btrfs_delalloc_release_ssd_space(inode,
+							PAGE_CACHE_SIZE);
+		} else
+			btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+	}
 	unlock_page(page);
 	page_cache_release(page);
 out:
@@ -7373,12 +7417,21 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
 			iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
 			btrfs_submit_direct, flags);
 	if (rw & WRITE) {
-		if (ret < 0 && ret != -EIOCBQUEUED)
-			btrfs_delalloc_release_space(inode, count);
-		else if (ret >= 0 && (size_t)ret < count)
-			btrfs_delalloc_release_space(inode,
+		if (ret < 0 && ret != -EIOCBQUEUED) {
+			if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+				BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+				btrfs_delalloc_release_ssd_space(inode, count);
+			} else
+				btrfs_delalloc_release_space(inode, count);
+		} else if (ret >= 0 && (size_t)ret < count) {
+			if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+				BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+				btrfs_delalloc_release_ssd_space(inode,
 						     count - (size_t)ret);
-		else
+			} else
+				btrfs_delalloc_release_space(inode,
+						     count - (size_t)ret);
+		} else
 			btrfs_delalloc_release_metadata(inode, 0);
 	}
 out:
@@ -7618,6 +7671,9 @@ again:
 			  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
 			  0, 0, &cached_state, GFP_NOFS);
 
+	if (btrfs_test_opt(root, HOT_MOVE))
+		hot_set_extent(inode, page_start, page_end, &cached_state, 0);
+
 	ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
 					&cached_state);
 	if (ret) {
@@ -7657,7 +7713,11 @@ out_unlock:
 	}
 	unlock_page(page);
 out:
-	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		btrfs_delalloc_release_ssd_space(inode, PAGE_CACHE_SIZE);
+	} else
+		btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2c02310..b9925fd 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -56,6 +56,7 @@
 #include "rcu-string.h"
 #include "send.h"
 #include "dev-replace.h"
+#include "hot_relocate.h"
 
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
@@ -1098,10 +1099,17 @@ again:
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents++;
 		spin_unlock(&BTRFS_I(inode)->lock);
-		btrfs_delalloc_release_space(inode,
+		if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+			btrfs_delalloc_release_ssd_space(inode,
+				     (page_cnt - i_done) << PAGE_CACHE_SHIFT);
+		} else
+			btrfs_delalloc_release_space(inode,
 				     (page_cnt - i_done) << PAGE_CACHE_SHIFT);
 	}
 
+	if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE))
+		hot_set_extent(inode, page_start,
+				page_end - 1, &cached_state, 1);
 
 	set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end - 1,
 			  &cached_state, GFP_NOFS);
@@ -1124,7 +1132,13 @@ out:
 		unlock_page(pages[i]);
 		page_cache_release(pages[i]);
 	}
-	btrfs_delalloc_release_space(inode, page_cnt << PAGE_CACHE_SHIFT);
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		btrfs_delalloc_release_ssd_space(inode,
+				page_cnt << PAGE_CACHE_SHIFT);
+	} else
+		btrfs_delalloc_release_space(inode,
+				page_cnt << PAGE_CACHE_SHIFT);
 	return ret;
 
 }
@@ -3014,8 +3028,9 @@ long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
 	u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
 		       BTRFS_BLOCK_GROUP_SYSTEM,
 		       BTRFS_BLOCK_GROUP_METADATA,
-		       BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
-	int num_types = 4;
+		       BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA,
+		       BTRFS_BLOCK_GROUP_DATA_SSD};
+	int num_types = 5;
 	int alloc_size;
 	int ret = 0;
 	u64 slot_count = 0;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b67171e..5d44488 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
 #include "async-thread.h"
 #include "free-space-cache.h"
 #include "inode-map.h"
+#include "hot_relocate.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -2935,8 +2936,14 @@ int prealloc_file_extent_cluster(struct inode *inode,
 			break;
 		nr++;
 	}
-	btrfs_free_reserved_data_space(inode, cluster->end +
-				       1 - cluster->start);
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_HOT) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+		btrfs_free_reserved_ssd_data_space(inode,
+				cluster->end + 1 - cluster->start);
+	} else
+		btrfs_free_reserved_data_space(inode,
+				cluster->end + 1 - cluster->start);
 out:
 	mutex_unlock(&inode->i_mutex);
 	return ret;
@@ -3065,6 +3072,9 @@ static int relocate_file_extent_cluster(struct inode *inode,
 			nr++;
 		}
 
+		if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE))
+			hot_set_extent(inode, page_start, page_end, NULL, 1);
+
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
 		set_page_dirty(page);
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b1bab1c..bdd8850 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1527,7 +1527,8 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	mutex_lock(&fs_info->chunk_mutex);
 	rcu_read_lock();
 	list_for_each_entry_rcu(found, head, list) {
-		if (found->flags & BTRFS_BLOCK_GROUP_DATA) {
+		if ((found->flags & BTRFS_BLOCK_GROUP_DATA) ||
+			(found->flags & BTRFS_BLOCK_GROUP_DATA_SSD)) {
 			total_free_data += found->disk_total - found->disk_used;
 			total_free_data -=
 				btrfs_account_ro_block_groups_free_space(found);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2854c82..d516557 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1450,6 +1450,8 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
 		all_avail = root->fs_info->avail_data_alloc_bits |
 			    root->fs_info->avail_system_alloc_bits |
 			    root->fs_info->avail_metadata_alloc_bits;
+		if (btrfs_test_opt(root, HOT_MOVE))
+			all_avail |= root->fs_info->avail_data_ssd_alloc_bits;
 	} while (read_seqretry(&root->fs_info->profiles_lock, seq));
 
 	num_devices = root->fs_info->fs_devices->num_devices;
@@ -3736,7 +3738,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	devs_increment = btrfs_raid_array[index].devs_increment;
 	ncopies = btrfs_raid_array[index].ncopies;
 
-	if (type & BTRFS_BLOCK_GROUP_DATA) {
+	if (type & BTRFS_BLOCK_GROUP_DATA ||
+		type & BTRFS_BLOCK_GROUP_DATA_SSD) {
 		max_stripe_size = 1024 * 1024 * 1024;
 		max_chunk_size = 10 * max_stripe_size;
 	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
@@ -3775,9 +3778,30 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		struct btrfs_device *device;
 		u64 max_avail;
 		u64 dev_offset;
+		int dev_rot;
+		int skip = 0;
 
 		device = list_entry(cur, struct btrfs_device, dev_alloc_list);
 
+		/*
+		 * If HOT_MOVE is set, the chunk type being allocated
+		 * determines which disks the data may be allocated on.
+		 * This can cause problems if, for example, the data alloc
+		 * profile is RAID0 and there are only two devices, 1 SSD +
+		 * 1 HDD. All allocations to BTRFS_BLOCK_GROUP_DATA_SSD
+		 * in this config will return -ENOSPC as the allocation code
+		 * can't find allowable space for the second stripe.
+		 */
+		dev_rot = !blk_queue_nonrot(bdev_get_queue(device->bdev));
+		if (btrfs_test_opt(extent_root, HOT_MOVE)) {
+			int ret1 = type & (BTRFS_BLOCK_GROUP_DATA |
+				BTRFS_BLOCK_GROUP_METADATA |
+				BTRFS_BLOCK_GROUP_SYSTEM) && !dev_rot;
+			int ret2 = type & BTRFS_BLOCK_GROUP_DATA_SSD && dev_rot;
+			if (ret1 || ret2)
+				skip = 1;
+		}
+
 		cur = cur->next;
 
 		if (!device->writeable) {
@@ -3786,7 +3810,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
-		if (!device->in_fs_metadata ||
+		if (skip || !device->in_fs_metadata ||
 		    device->is_tgtdev_for_dev_replace)
 			continue;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 3/5] btrfs: add one hot relocation kthread
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
  2013-05-06  8:53 ` [RFC 1/5] vfs: add one list_head field zwu.kernel
  2013-05-06  8:53 ` [RFC 2/5] btrfs: add one new block group zwu.kernel
@ 2013-05-06  8:53 ` zwu.kernel
  2013-05-06  8:53 ` [RFC 4/5] procfs: add three proc interfaces zwu.kernel
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: zwu.kernel @ 2013-05-06  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, idryomov, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

   Add one private kthread for hot relocation. It will check
if there're some extents which is hotter than the threshold
and queue them at first, if no, it will return and wait for
its next turn; otherwise, it will check if SSD ratio is beyond
beyond its usage threshold, if no, it will directly relocate
those hot extents from HDD disk to SSD disk; otherwise it will
find the extents with low temperature and queue them, then
relocate those extents with low temperature and queue them,
and finally relocate the hot extents from from HDD disk to SSD
disk.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h        |   2 +
 fs/btrfs/hot_relocate.c | 720 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/hot_relocate.h |  21 ++
 fs/btrfs/super.c        |   1 +
 4 files changed, 742 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f4c4419..77d9b1c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1619,6 +1619,8 @@ struct btrfs_fs_info {
 	struct btrfs_dev_replace dev_replace;
 
 	atomic_t mutually_exclusive_operation_running;
+
+	void *hot_reloc;
 };
 
 /*
diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
index 1effd14..683e154 100644
--- a/fs/btrfs/hot_relocate.c
+++ b/fs/btrfs/hot_relocate.c
@@ -12,8 +12,46 @@
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/module.h>
 #include "hot_relocate.h"
 
+/*
+ * Hot relocation strategy:
+ *
+ * The relocation code below operates on the heat map lists to identify
+ * hot or cold data logical file ranges that are candidates for relocation.
+ * The triggering mechanism for relocation is controlled by a global heat
+ * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * queued for relocation by the periodically executing relocate kthread,
+ * which updates the global heat threshold and responds to space pressure
+ * on the SSDs.
+ *
+ * The heat map lists index logical ranges by heat and provide a constant-time
+ * access path to hot or cold range items. The relocation kthread uses this
+ * path to find hot or cold items to move to/from SSD. To ensure that the
+ * relocation kthread has a chance to sleep, and to prevent thrashing between
+ * SSD and HDD, there is a configurable limit to how many ranges are moved per
+ * iteration of the kthread. This limit may be overrun in the case where space
+ * pressure requires that items be aggressively moved from SSD back to HDD.
+ *
+ * This needs still more resistance to thrashing and stronger (read: actual)
+ * guarantees that relocation operations won't -ENOSPC.
+ *
+ * The relocation code has introduced one new btrfs block group type:
+ * BTRFS_BLOCK_GROUP_DATA_SSD.
+ *
+ * When mkfs'ing a volume with the hot data relocation option, initial block
+ * groups are allocated to the proper disks. Runtime block group allocation
+ * only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and
+ * BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates
+ * BTRFS_BLOCK_GROUP_DATA_SSD to SSD.
+ * (assuming, critically, the HOT_MOVE option is set at mount time).
+ */
+
 static void hot_set_extent_bits(struct extent_io_tree *tree, u64 start,
 		u64 end, struct extent_state **cached_state,
 		gfp_t mask, int storage_type, int flag)
@@ -26,10 +64,10 @@ static void hot_set_extent_bits(struct extent_io_tree *tree, u64 start,
 				EXTENT_DO_ACCOUNTING;
 	}
 
-	if (storage_type == ON_ROT_DISK) {
+	if (storage_type == TYPE_ROT) {
 		set_bits |= EXTENT_COLD;
 		clear_bits |= EXTENT_HOT;
-	} else if (storage_type == ON_NONROT_DISK) {
+	} else if (storage_type == TYPE_NONROT) {
 		set_bits |= EXTENT_HOT;
 		clear_bits |= EXTENT_COLD;
 	}
@@ -76,3 +114,681 @@ int hot_get_chunk_type(struct inode *inode, u64 start, u64 end)
 
 	return ret;
 }
+
+/*
+ * Returns SSD ratio that is full.
+ * If no SSD is found, returns THRESH_MAX_VALUE + 1.
+ */
+static int hot_calc_ssd_ratio(struct hot_reloc *hot_reloc)
+{
+	struct btrfs_space_info *info;
+	struct btrfs_device *device, *next;
+	struct btrfs_fs_info *fs_info = hot_reloc->fs_info;
+	u64 total_bytes = 0, bytes_used = 0;
+
+	/*
+	 * Iterate through devices, if they're nonrot,
+	 * add their bytes to the total_bytes.
+	 */
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	list_for_each_entry_safe(device, next,
+		&fs_info->fs_devices->devices, dev_list) {
+		if (blk_queue_nonrot(bdev_get_queue(device->bdev)))
+			total_bytes += device->total_bytes;
+	}
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+
+	if (total_bytes == 0)
+		return THRESH_MAX_VALUE + 1;
+
+	/*
+	 * Iterate through space_info. if the SSD data block group
+	 * is found, add the bytes used by that group bytes_used
+	 */
+	rcu_read_lock();
+	list_for_each_entry_rcu(info, &fs_info->space_info, list) {
+		if (info->flags & BTRFS_BLOCK_GROUP_DATA_SSD)
+			bytes_used += info->bytes_used;
+	}
+	rcu_read_unlock();
+
+	/* Finish up, return ratio of SSD filled. */
+	BUG_ON(bytes_used >= total_bytes);
+
+	return (int) div64_u64(bytes_used * 100, total_bytes);
+}
+
+/*
+ * Update heat threshold for hot relocation
+ * based on how full SSD drives are.
+ */
+static int hot_update_threshold(struct hot_reloc *hot_reloc,
+				int update)
+{
+	int thresh = hot_reloc->thresh;
+	int ratio = hot_calc_ssd_ratio(hot_reloc);
+
+	/* Sometimes update global threshold, others not */
+	if (!update && ratio < HIGH_WATER_LEVEL)
+		return ratio;
+
+	if (unlikely(ratio > THRESH_MAX_VALUE))
+		thresh = HEAT_MAX_VALUE + 1;
+	else {
+		WARN_ON(HIGH_WATER_LEVEL > THRESH_MAX_VALUE
+			|| LOW_WATER_LEVEL < 0);
+
+		if (ratio >= HIGH_WATER_LEVEL)
+			thresh += THRESH_UP_SPEED;
+		else if (ratio <= LOW_WATER_LEVEL)
+			thresh -= THRESH_DOWN_SPEED;
+
+		if (thresh > HEAT_MAX_VALUE)
+			thresh = HEAT_MAX_VALUE + 1;
+		else if (thresh < 0)
+			thresh = 0;
+	}
+
+	hot_reloc->thresh = thresh;
+	return ratio;
+}
+
+static bool hot_can_relocate(struct inode *inode, u64 start,
+			u64 len, u64 *skip, u64 *end)
+{
+	struct extent_map *em = NULL;
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	bool ret = true;
+
+	/*
+	 * Make sure that once we start relocating an extent,
+	 * we keep on relocating it
+	 */
+	if (start < *end)
+		return true;
+
+	*skip = 0;
+
+	/*
+	 * Hopefully we have this extent in the tree already,
+	 * try without the full extent lock
+	 */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, start, len);
+	read_unlock(&em_tree->lock);
+	if (!em) {
+		/* Get the big lock and read metadata off disk */
+		lock_extent(io_tree, start, start + len - 1);
+		em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+		unlock_extent(io_tree, start, start + len - 1);
+		if (IS_ERR(em))
+			return false;
+	}
+
+	/* This will cover holes, and inline extents */
+	if (em->block_start >= EXTENT_MAP_LAST_BYTE)
+		ret = false;
+
+	if (ret) {
+		*end = extent_map_end(em);
+	} else {
+		*skip = extent_map_end(em);
+		*end = 0;
+	}
+
+	free_extent_map(em);
+	return ret;
+}
+
+static void hot_cleanup_relocq(struct list_head *bucket) {
+	struct hot_range_item *hr;
+	struct hot_comm_item *ci, *ci_next;
+
+	list_for_each_entry_safe(ci, ci_next, bucket, reloc_list) {
+		hr = container_of(ci, struct hot_range_item, hot_range);
+		list_del_init(&hr->hot_range.reloc_list);
+		hot_comm_item_put(ci);
+	}
+}
+
+static int hot_queue_extent(struct hot_reloc *hot_reloc,
+			struct list_head *bucket,
+			u64 *counter, int storage_type)
+{
+	struct hot_comm_item *ci;
+	struct hot_range_item *hr;
+	struct hot_inode_item *he;
+	int st, ret = 0;
+
+	/* Queue hot_ranges */
+	list_for_each_entry_rcu(ci, bucket, track_list) {
+		hot_comm_item_get(ci);
+		hr = container_of(ci, struct hot_range_item, hot_range);
+		he = hr->hot_inode;
+
+		/* Queue up on relocate list */
+		st = hr->storage_type;
+		if (st != storage_type) {
+			list_add_tail(&ci->reloc_list,
+				&hot_reloc->hot_relocq[storage_type]);
+			hot_comm_item_get(ci);
+			*counter = *counter + 1;
+		}
+
+		spin_lock(&he->i_lock);
+		hot_comm_item_put(ci);
+		spin_unlock(&he->i_lock);
+
+		if (*counter >= HOT_RELOC_MAX_ITEMS)
+			break;
+
+		if (kthread_should_stop()) {
+			ret = 1;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static u64 hot_search_extent(struct hot_reloc *hot_reloc,
+			int thresh, int storage_type)
+{
+	struct hot_info *root;
+	u64 counter = 0;
+	int i, ret = 0;
+
+	root = hot_reloc->fs_info->sb->s_hot_root;
+	for (i = HEAT_MAX_VALUE; i >= thresh; i--) {
+		rcu_read_lock();
+		if (!list_empty(&root->hot_map[TYPE_RANGE][i]))
+			ret = hot_queue_extent(hot_reloc,
+					&root->hot_map[TYPE_RANGE][i],
+					&counter, storage_type);
+		rcu_read_unlock();
+		if (ret) {
+			counter = 0;
+			break;
+		}
+	}
+
+	if (ret)
+		hot_cleanup_relocq(&hot_reloc->hot_relocq[storage_type]);
+
+	return counter;
+}
+
+static int hot_load_file_extent(struct inode *inode,
+			    struct page **pages,
+			    unsigned long start_index,
+			    int num_pages, int storage_type)
+{
+	unsigned long file_end;
+	int ret, i, i_done;
+	u64 isize = i_size_read(inode), page_start, page_end, page_cnt;
+	struct btrfs_ordered_extent *ordered;
+	struct extent_state *cached_state = NULL;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+
+	file_end = (isize - 1) >> PAGE_CACHE_SHIFT;
+	if (!isize || start_index > file_end)
+		return 0;
+
+	page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1);
+
+	if (storage_type == TYPE_NONROT)
+		BTRFS_I(inode)->flags |= BTRFS_INODE_HOT;
+	ret = btrfs_delalloc_reserve_space(inode, page_cnt << PAGE_CACHE_SHIFT);
+	if (storage_type == TYPE_NONROT)
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_HOT;
+	if (ret)
+		return ret;
+
+	i_done = 0;
+	/* step one, lock all the pages */
+	for (i = 0; i < page_cnt; i++) {
+		struct page *page;
+again:
+		page = find_or_create_page(inode->i_mapping,
+					   start_index + i, mask);
+		if (!page)
+			break;
+
+		page_start = page_offset(page);
+		page_end = page_start + PAGE_CACHE_SIZE - 1;
+		while (1) {
+			lock_extent(tree, page_start, page_end);
+			ordered = btrfs_lookup_ordered_extent(inode,
+							page_start);
+			unlock_extent(tree, page_start, page_end);
+			if (!ordered)
+				break;
+
+			unlock_page(page);
+			btrfs_start_ordered_extent(inode, ordered, 1);
+			btrfs_put_ordered_extent(ordered);
+			lock_page(page);
+			/*
+			 * we unlocked the page above, so we need check if
+			 * it was released or not.
+			 */
+			if (page->mapping != inode->i_mapping) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto again;
+			}
+		}
+
+		if (!PageUptodate(page)) {
+			btrfs_readpage(NULL, page);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EIO;
+				break;
+			}
+		}
+
+		if (page->mapping != inode->i_mapping) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto again;
+		}
+
+		pages[i] = page;
+		i_done++;
+	}
+	if (!i_done || ret)
+		goto out;
+
+	if (!(inode->i_sb->s_flags & MS_ACTIVE))
+		goto out;
+
+	page_start = page_offset(pages[0]);
+	page_end = page_offset(pages[i_done - 1]) + PAGE_CACHE_SIZE - 1;
+
+	lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
+
+	if (i_done != page_cnt) {
+		spin_lock(&BTRFS_I(inode)->lock);
+		BTRFS_I(inode)->outstanding_extents++;
+		spin_unlock(&BTRFS_I(inode)->lock);
+
+		if (storage_type == TYPE_NONROT)
+			btrfs_delalloc_release_ssd_space(inode,
+				(page_cnt - i_done) << PAGE_CACHE_SHIFT);
+		else if (storage_type == TYPE_ROT)
+			btrfs_delalloc_release_space(inode,
+				(page_cnt - i_done) << PAGE_CACHE_SHIFT);
+	}
+
+	hot_set_extent_bits(tree, page_start, page_end,
+			&cached_state, GFP_NOFS, storage_type, 1);
+	unlock_extent_cached(tree, page_start, page_end,
+			&cached_state, GFP_NOFS);
+
+	for (i = 0; i < i_done; i++) {
+		clear_page_dirty_for_io(pages[i]);
+		ClearPageChecked(pages[i]);
+		set_page_extent_mapped(pages[i]);
+		set_page_dirty(pages[i]);
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+
+	/*
+	 * so now we have a nice long stream of locked
+	 * and up to date pages, lets wait on them
+	 */
+	for (i = 0; i < i_done; i++)
+		wait_on_page_writeback(pages[i]);
+
+	return i_done;
+out:
+	for (i = 0; i < i_done; i++) {
+		unlock_page(pages[i]);
+		page_cache_release(pages[i]);
+	}
+
+	if (storage_type == TYPE_NONROT)
+		btrfs_delalloc_release_ssd_space(inode,
+				page_cnt << PAGE_CACHE_SHIFT);
+	else if (storage_type == TYPE_ROT)
+		btrfs_delalloc_release_space(inode,
+				page_cnt << PAGE_CACHE_SHIFT);
+
+	return ret;
+}
+
+/*
+ * Relocate data to SSD or spinning drive based on past location
+ * and load the file into page cache and marks pages as dirty.
+ *
+ * based on defrag ioctl
+ */
+static int hot_relocate_extent(struct hot_range_item *hr,
+			struct hot_reloc *hot_reloc,
+			int storage_type)
+{
+	struct hot_inode_item *he = hr->hot_inode;
+	struct btrfs_root *root = hot_reloc->fs_info->fs_root;
+	struct inode *inode;
+	struct file_ra_state *ra = NULL;
+	struct btrfs_key key;
+	u64 isize, last_len = 0, skip = 0, end = 0;
+	unsigned long i, last, ra_index = 0;
+	int ret = -ENOENT, count = 0, new = 0;
+	int max_cluster = (256 * 1024) >> PAGE_CACHE_SHIFT;
+	int cluster = max_cluster;
+	struct page **pages = NULL;
+
+	hot_comm_item_get(&hr->hot_range);
+
+	key.objectid = hr->hot_inode->i_ino;
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.offset = 0;
+	inode = btrfs_iget(root->fs_info->sb, &key, root, &new);
+	if (IS_ERR(inode))
+		goto out;
+	else if (is_bad_inode(inode))
+		goto out_inode;
+
+	isize = i_size_read(inode);
+	if (isize == 0) {
+		ret = 0;
+		goto out_inode;
+	}
+
+	ra = kzalloc(sizeof(*ra), GFP_NOFS);
+	if (!ra) {
+		ret = -ENOMEM;
+		goto out_inode;
+	} else {
+		file_ra_state_init(ra, inode->i_mapping);
+	}
+
+	pages = kmalloc(sizeof(struct page *) * max_cluster,
+			GFP_NOFS);
+	if (!pages) {
+		ret = -ENOMEM;
+		goto out_ra;
+	}
+
+	/* find the last page */
+	if (hr->start + hr->len > hr->start) {
+		last = min_t(u64, isize - 1,
+			 hr->start + hr->len - 1) >> PAGE_CACHE_SHIFT;
+	} else {
+		last = (isize - 1) >> PAGE_CACHE_SHIFT;
+	}
+
+	i = hr->start >> PAGE_CACHE_SHIFT;
+
+	/*
+	 * make writeback starts from i, so the range can be
+	 * written sequentially.
+	 */
+	if (i < inode->i_mapping->writeback_index)
+		inode->i_mapping->writeback_index = i;
+
+	while (i <= last && count < last + 1 &&
+	       (i < (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+		PAGE_CACHE_SHIFT)) {
+		/*
+		 * make sure we stop running if someone unmounts
+		 * the FS
+		 */
+		if (!(inode->i_sb->s_flags & MS_ACTIVE))
+			break;
+
+		if (signal_pending(current)) {
+			printk(KERN_DEBUG "btrfs: hot relocation cancelled\n");
+			break;
+		}
+
+		if (!hot_can_relocate(inode, (u64)i << PAGE_CACHE_SHIFT,
+				 PAGE_CACHE_SIZE, &skip, &end)) {
+			unsigned long next;
+			/*
+			 * the function tells us how much to skip
+			 * bump our counter by the suggested amount
+			 */
+			next = (skip + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+			i = max(i + 1, next);
+			continue;
+		}
+
+		cluster = (PAGE_CACHE_ALIGN(end) >> PAGE_CACHE_SHIFT) - i;
+		cluster = min(cluster, max_cluster);
+
+		if (i + cluster > ra_index) {
+			ra_index = max(i, ra_index);
+			btrfs_force_ra(inode->i_mapping, ra, NULL, ra_index,
+				       cluster);
+			ra_index += max_cluster;
+		}
+
+		mutex_lock(&inode->i_mutex);
+		ret = hot_load_file_extent(inode, pages,
+					i, cluster, storage_type);
+		if (ret < 0) {
+			mutex_unlock(&inode->i_mutex);
+			goto out_ra;
+		}
+
+		count += ret;
+		balance_dirty_pages_ratelimited(inode->i_mapping);
+		mutex_unlock(&inode->i_mutex);
+
+		if (ret > 0) {
+			i += ret;
+			last_len += ret << PAGE_CACHE_SHIFT;
+		} else {
+			i++;
+			last_len = 0;
+		}
+	}
+
+	ret = count;
+	if (ret > 0)
+		hr->storage_type = storage_type;
+
+out_ra:
+	kfree(ra);
+	kfree(pages);
+out_inode:
+	iput(inode);
+out:
+	spin_lock(&he->i_lock);
+	hot_comm_item_put(&hr->hot_range);
+	spin_unlock(&he->i_lock);
+
+	list_del_init(&hr->hot_range.reloc_list);
+
+	spin_lock(&he->i_lock);
+	hot_comm_item_put(&hr->hot_range);
+	spin_unlock(&he->i_lock);
+
+	return ret;
+}
+
+/*
+ * Main function iterates through heat map table and
+ * finds hot and cold data to move based on SSD pressure.
+ *
+ * First iterates through cold items below the heat
+ * threshold, if the item is on SSD and is now cold,
+ * we queue it up for relocation back to spinning disk.
+ * After scanning these items, we call relocation code
+ * on all ranges that have been queued up for moving
+ * to HDD.
+ *
+ * We then iterate through items above the heat threshold
+ * and if they are on HDD we queue them up to be moved to
+ * SSD. We then iterate through queue and move hot ranges
+ * to SSD if they are not already.
+ */
+void hot_do_relocate(struct hot_reloc *hot_reloc)
+{
+	struct hot_info *root;
+	struct hot_range_item *hr;
+	struct hot_comm_item *ci, *ci_next;
+	int i, ret = 0, thresh, ratio = 0;
+	u64 count, count_to_cold, count_to_hot;
+	static u32 run = 1;
+
+	run++;
+	ratio = hot_update_threshold(hot_reloc, !(run % 15));
+	thresh = hot_reloc->thresh;
+
+	INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_NONROT]);
+
+	/* Check and queue hot extents */
+	count_to_hot = hot_search_extent(hot_reloc,
+					thresh, TYPE_NONROT);
+	if (count_to_hot == 0)
+		return;
+
+	count_to_cold = HOT_RELOC_MAX_ITEMS;
+
+	/* Don't move cold data to HDD unless there's space pressure */
+	if (ratio < HIGH_WATER_LEVEL)
+		goto do_hot_reloc;
+
+	INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_ROT]);
+
+	/*
+	 * Move up to RELOCATE_MAX_ITEMS cold ranges back to spinning
+	 * disk. First, queue up items to move on the hot_relocq[TYPE_ROT].
+	 */
+	root = hot_reloc->fs_info->sb->s_hot_root;
+	for (count = 0, count_to_cold = 0; (count < thresh) &&
+		(count_to_cold < count_to_hot); count++) {
+		rcu_read_lock();
+		if (!list_empty(&root->hot_map[TYPE_RANGE][count]))
+			ret = hot_queue_extent(hot_reloc,
+					&root->hot_map[TYPE_RANGE][count],
+					&count_to_cold, TYPE_ROT);
+		rcu_read_unlock();
+		if (ret) {
+			goto relocq_clean;
+		}
+	}
+
+	/* Do the hot -> cold relocation */
+	count_to_cold = 0;
+	list_for_each_entry_safe(ci, ci_next,
+			&hot_reloc->hot_relocq[TYPE_ROT], reloc_list) {
+        	hr = container_of(ci, struct hot_range_item, hot_range);
+		ret = hot_relocate_extent(hr, hot_reloc, TYPE_ROT);
+		if ((ret == -ENOSPC) || kthread_should_stop())
+			goto relocq_clean;
+		else if (ret > 0)
+			count_to_cold++;
+	}
+
+	/*
+	 * Move up to RELOCATE_MAX_ITEMS ranges to SSD. Periodically check
+	 * for space pressure on SSD and directly return if we've exceeded
+	 * the SSD capacity high water mark.
+	 * First, queue up items to move on hot_relocq[TYPE_NONROT].
+	 */
+do_hot_reloc:
+	/* Do the cold -> hot relocation */
+	count_to_hot = 0;
+	list_for_each_entry_safe(ci, ci_next,
+			&hot_reloc->hot_relocq[TYPE_NONROT], reloc_list) {
+		hr = container_of(ci, struct hot_range_item, hot_range);
+		ret = hot_relocate_extent(hr, hot_reloc, TYPE_NONROT);
+		if ((ret == -ENOSPC) || (count_to_hot >= count_to_cold) ||
+			kthread_should_stop())
+			goto relocq_clean;
+		else if (ret > 0)
+			count_to_hot++;
+
+		/*
+		 * If we've exceeded the SSD capacity high water mark,
+		 * directly return.
+		 */
+		if ((count_to_hot != 0) && count_to_hot % 30 == 0) {
+			ratio = hot_update_threshold(hot_reloc, 1);
+			if (ratio >= HIGH_WATER_LEVEL)
+				goto relocq_clean;
+		}
+	}
+
+	return;
+
+relocq_clean:
+	for (i = 0; i < MAX_RELOC_TYPES; i++)
+		hot_cleanup_relocq(&hot_reloc->hot_relocq[i]);
+}
+
+/* Main loop for running relcation thread */
+static int hot_relocate_kthread(void *arg)
+{
+	struct hot_reloc *hot_reloc = arg;
+	unsigned long delay;
+
+	do {
+		delay = HZ * HOT_RELOC_INTERVAL;
+		if (mutex_trylock(&hot_reloc->hot_reloc_mutex)) {
+			hot_do_relocate(hot_reloc);
+			mutex_unlock(&hot_reloc->hot_reloc_mutex);
+		}
+
+		if (!try_to_freeze()) {
+			set_current_state(TASK_INTERRUPTIBLE);
+			if (!kthread_should_stop())
+				schedule_timeout(delay);
+			__set_current_state(TASK_RUNNING);
+		}
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
+/* Kick off the relocation kthread */
+int hot_relocate_init(struct btrfs_fs_info *fs_info)
+{
+	int i, ret = 0;
+	struct hot_reloc *hot_reloc;
+
+	hot_reloc = kzalloc(sizeof(*hot_reloc), GFP_NOFS);
+	if (!hot_reloc) {
+		printk(KERN_ERR "%s: Failed to allocate memory for "
+				"hot_reloc\n", __func__);
+		return -ENOMEM;
+	}
+
+	fs_info->hot_reloc = hot_reloc;
+	hot_reloc->fs_info = fs_info;
+	hot_reloc->thresh = HOT_RELOC_THRESHOLD;
+	for (i = 0; i < MAX_RELOC_TYPES; i++)
+		INIT_LIST_HEAD(&hot_reloc->hot_relocq[i]);
+	mutex_init(&hot_reloc->hot_reloc_mutex);
+
+	hot_reloc->hot_reloc_kthread = kthread_run(hot_relocate_kthread,
+				hot_reloc, "hot_relocate_kthread");
+	ret = IS_ERR(hot_reloc->hot_reloc_kthread);
+	if (ret) {
+		kthread_stop(hot_reloc->hot_reloc_kthread);
+		kfree(hot_reloc);
+	}
+
+	return ret;
+}
+
+void hot_relocate_exit(struct btrfs_fs_info *fs_info)
+{
+	struct hot_reloc *hot_reloc = fs_info->hot_reloc;
+
+	if (hot_reloc->hot_reloc_kthread)
+		kthread_stop(hot_reloc->hot_reloc_kthread);
+
+	kfree(hot_reloc);
+	fs_info->hot_reloc = NULL;
+}
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
index b8427ba..077d9b3 100644
--- a/fs/btrfs/hot_relocate.h
+++ b/fs/btrfs/hot_relocate.h
@@ -24,8 +24,29 @@ enum {
 	MAX_RELOC_TYPES
 };
 
+#define HOT_RELOC_INTERVAL  120
+#define HOT_RELOC_THRESHOLD 150
+#define HOT_RELOC_MAX_ITEMS 250
+
+#define HEAT_MAX_VALUE    (MAP_SIZE - 1)
+#define HIGH_WATER_LEVEL  75 /* when to raise the threshold */
+#define LOW_WATER_LEVEL   50 /* when to lower the threshold */
+#define THRESH_UP_SPEED   10 /* how much to raise it by */
+#define THRESH_DOWN_SPEED 1  /* how much to lower it by */
+#define THRESH_MAX_VALUE  100
+
+struct hot_reloc {
+	struct btrfs_fs_info *fs_info;
+	struct list_head hot_relocq[MAX_RELOC_TYPES];
+	int thresh;
+	struct task_struct *hot_reloc_kthread;
+	struct mutex hot_reloc_mutex;
+};
+
 void hot_set_extent(struct inode *inode, u64 start, u64 end,
 		struct extent_state **cached_state, int flag);
 int hot_get_chunk_type(struct inode *inode, u64 start, u64 end);
+int hot_relocate_init(struct btrfs_fs_info *fs_info);
+void hot_relocate_exit(struct btrfs_fs_info *fs_info);
 
 #endif /* __HOT_RELOCATE__ */
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index bdd8850..4cbd0de 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -57,6 +57,7 @@
 #include "compression.h"
 #include "rcu-string.h"
 #include "dev-replace.h"
+#include "hot_relocate.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 4/5] procfs: add three proc interfaces
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (2 preceding siblings ...)
  2013-05-06  8:53 ` [RFC 3/5] btrfs: add one hot relocation kthread zwu.kernel
@ 2013-05-06  8:53 ` zwu.kernel
  2013-05-06  8:53 ` [RFC 5/5] btrfs: add hot relocation support zwu.kernel
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: zwu.kernel @ 2013-05-06  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, idryomov, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add three proc interfaces hot-reloc-interval, hot-reloc-threshold,
and hot-reloc-max-items under the dir /proc/sys/fs/ in order to
turn HOT_RELOC_INTERVAL, HOT_RELOC_THRESHOLD, and HOT_RELOC_MAX_ITEMS
into be tunable.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/hot_relocate.c | 26 +++++++++++++++++---------
 fs/btrfs/hot_relocate.h |  4 ----
 include/linux/btrfs.h   |  4 ++++
 kernel/sysctl.c         | 22 ++++++++++++++++++++++
 4 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c
index 683e154..aa8c9f0 100644
--- a/fs/btrfs/hot_relocate.c
+++ b/fs/btrfs/hot_relocate.c
@@ -25,7 +25,7 @@
  * The relocation code below operates on the heat map lists to identify
  * hot or cold data logical file ranges that are candidates for relocation.
  * The triggering mechanism for relocation is controlled by a global heat
- * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are
+ * threshold integer value (sysctl_hot_reloc_threshold). Ranges are
  * queued for relocation by the periodically executing relocate kthread,
  * which updates the global heat threshold and responds to space pressure
  * on the SSDs.
@@ -52,6 +52,15 @@
  * (assuming, critically, the HOT_MOVE option is set at mount time).
  */
 
+int sysctl_hot_reloc_threshold = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_threshold);
+
+int sysctl_hot_reloc_interval __read_mostly = 120;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_interval);
+
+int sysctl_hot_reloc_max_items __read_mostly = 250;
+EXPORT_SYMBOL_GPL(sysctl_hot_reloc_max_items);
+
 static void hot_set_extent_bits(struct extent_io_tree *tree, u64 start,
 		u64 end, struct extent_state **cached_state,
 		gfp_t mask, int storage_type, int flag)
@@ -165,7 +174,7 @@ static int hot_calc_ssd_ratio(struct hot_reloc *hot_reloc)
 static int hot_update_threshold(struct hot_reloc *hot_reloc,
 				int update)
 {
-	int thresh = hot_reloc->thresh;
+	int thresh = sysctl_hot_reloc_threshold;
 	int ratio = hot_calc_ssd_ratio(hot_reloc);
 
 	/* Sometimes update global threshold, others not */
@@ -189,7 +198,7 @@ static int hot_update_threshold(struct hot_reloc *hot_reloc,
 			thresh = 0;
 	}
 
-	hot_reloc->thresh = thresh;
+	sysctl_hot_reloc_threshold = thresh;
 	return ratio;
 }
 
@@ -280,7 +289,7 @@ static int hot_queue_extent(struct hot_reloc *hot_reloc,
 		hot_comm_item_put(ci);
 		spin_unlock(&he->i_lock);
 
-		if (*counter >= HOT_RELOC_MAX_ITEMS)
+		if (*counter >= sysctl_hot_reloc_max_items)
 			break;
 
 		if (kthread_should_stop()) {
@@ -361,7 +370,7 @@ again:
 		while (1) {
 			lock_extent(tree, page_start, page_end);
 			ordered = btrfs_lookup_ordered_extent(inode,
-							page_start);
+							      page_start);
 			unlock_extent(tree, page_start, page_end);
 			if (!ordered)
 				break;
@@ -642,7 +651,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
 
 	run++;
 	ratio = hot_update_threshold(hot_reloc, !(run % 15));
-	thresh = hot_reloc->thresh;
+	thresh = sysctl_hot_reloc_threshold;
 
 	INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_NONROT]);
 
@@ -652,7 +661,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc)
 	if (count_to_hot == 0)
 		return;
 
-	count_to_cold = HOT_RELOC_MAX_ITEMS;
+	count_to_cold = sysctl_hot_reloc_max_items;
 
 	/* Don't move cold data to HDD unless there's space pressure */
 	if (ratio < HIGH_WATER_LEVEL)
@@ -734,7 +743,7 @@ static int hot_relocate_kthread(void *arg)
 	unsigned long delay;
 
 	do {
-		delay = HZ * HOT_RELOC_INTERVAL;
+		delay = HZ * sysctl_hot_reloc_interval;
 		if (mutex_trylock(&hot_reloc->hot_reloc_mutex)) {
 			hot_do_relocate(hot_reloc);
 			mutex_unlock(&hot_reloc->hot_reloc_mutex);
@@ -766,7 +775,6 @@ int hot_relocate_init(struct btrfs_fs_info *fs_info)
 
 	fs_info->hot_reloc = hot_reloc;
 	hot_reloc->fs_info = fs_info;
-	hot_reloc->thresh = HOT_RELOC_THRESHOLD;
 	for (i = 0; i < MAX_RELOC_TYPES; i++)
 		INIT_LIST_HEAD(&hot_reloc->hot_relocq[i]);
 	mutex_init(&hot_reloc->hot_reloc_mutex);
diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h
index 077d9b3..ca30944 100644
--- a/fs/btrfs/hot_relocate.h
+++ b/fs/btrfs/hot_relocate.h
@@ -24,9 +24,6 @@ enum {
 	MAX_RELOC_TYPES
 };
 
-#define HOT_RELOC_INTERVAL  120
-#define HOT_RELOC_THRESHOLD 150
-#define HOT_RELOC_MAX_ITEMS 250
 
 #define HEAT_MAX_VALUE    (MAP_SIZE - 1)
 #define HIGH_WATER_LEVEL  75 /* when to raise the threshold */
@@ -38,7 +35,6 @@ enum {
 struct hot_reloc {
 	struct btrfs_fs_info *fs_info;
 	struct list_head hot_relocq[MAX_RELOC_TYPES];
-	int thresh;
 	struct task_struct *hot_reloc_kthread;
 	struct mutex hot_reloc_mutex;
 };
diff --git a/include/linux/btrfs.h b/include/linux/btrfs.h
index 22d7991..7179819 100644
--- a/include/linux/btrfs.h
+++ b/include/linux/btrfs.h
@@ -3,4 +3,8 @@
 
 #include <uapi/linux/btrfs.h>
 
+extern int sysctl_hot_reloc_threshold;
+extern int sysctl_hot_reloc_interval;
+extern int sysctl_hot_reloc_max_items;
+
 #endif /* _LINUX_BTRFS_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 11e4a3d..c1db57e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -62,6 +62,7 @@
 #include <linux/capability.h>
 #include <linux/binfmts.h>
 #include <linux/sched/sysctl.h>
+#include <linux/btrfs.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1617,6 +1618,27 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "hot-reloc-threshold",
+		.data		= &sysctl_hot_reloc_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "hot-reloc-interval",
+		.data		= &sysctl_hot_reloc_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "hot-reloc-max-items",
+		.data		= &sysctl_hot_reloc_max_items,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 5/5] btrfs: add hot relocation support
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (3 preceding siblings ...)
  2013-05-06  8:53 ` [RFC 4/5] procfs: add three proc interfaces zwu.kernel
@ 2013-05-06  8:53 ` zwu.kernel
  2013-05-06 20:36 ` [RFC 0/5] BTRFS " Kai Krakow
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: zwu.kernel @ 2013-05-06  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, idryomov, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add one new mount option '-o hot_move' for hot
relocation support. When hot relocation is enabled,
hot tracking will be enabled automatically.
  Its usage looks like:
    mount -o hot_move
    mount -o nouser,hot_move
    mount -o nouser,hot_move,loop
    mount -o hot_move,nouser

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/super.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4cbd0de..b342f6f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -311,8 +311,13 @@ static void btrfs_put_super(struct super_block *sb)
 	 * process...  Whom would you report that to?
 	 */
 
+	/* Hot data relocation */
+	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_MOVE))
+		hot_relocate_exit(btrfs_sb(sb));
+
 	/* Hot data tracking */
-	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_MOVE)
+		|| btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
 		hot_track_exit(sb);
 }
 
@@ -327,7 +332,7 @@ enum {
 	Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
 	Opt_check_integrity, Opt_check_integrity_including_extent_data,
 	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
-	Opt_err,
+	Opt_hot_move, Opt_err,
 };
 
 static match_table_t tokens = {
@@ -368,6 +373,7 @@ static match_table_t tokens = {
 	{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
 	{Opt_hot_track, "hot_track"},
+	{Opt_hot_move, "hot_move"},
 	{Opt_err, NULL},
 };
 
@@ -636,6 +642,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		case Opt_hot_track:
 			btrfs_set_opt(info->mount_opt, HOT_TRACK);
 			break;
+		case Opt_hot_move:
+			btrfs_set_opt(info->mount_opt, HOT_MOVE);
+			break;
 		case Opt_err:
 			printk(KERN_INFO "btrfs: unrecognized mount option "
 			       "'%s'\n", p);
@@ -863,17 +872,26 @@ static int btrfs_fill_super(struct super_block *sb,
 		goto fail_close;
 	}
 
-	if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+	if (btrfs_test_opt(fs_info->tree_root, HOT_MOVE)
+		|| btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
 		err = hot_track_init(sb);
 		if (err)
 			goto fail_hot;
 	}
 
+	if (btrfs_test_opt(fs_info->tree_root, HOT_MOVE)) {
+		err = hot_relocate_init(fs_info);
+		if (err)
+			goto fail_reloc;
+	}
+
 	save_mount_options(sb, data);
 	cleancache_init_fs(sb);
 	sb->s_flags |= MS_ACTIVE;
 	return 0;
 
+fail_reloc:
+	hot_track_exit(sb);
 fail_hot:
 	dput(sb->s_root);
 	sb->s_root = NULL;
@@ -974,6 +992,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",fatal_errors=panic");
 	if (btrfs_test_opt(root, HOT_TRACK))
 		seq_puts(seq, ",hot_track");
+	if (btrfs_test_opt(root, HOT_MOVE))
+		seq_puts(seq, ",hot_move");
 	return 0;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (4 preceding siblings ...)
  2013-05-06  8:53 ` [RFC 5/5] btrfs: add hot relocation support zwu.kernel
@ 2013-05-06 20:36 ` Kai Krakow
  2013-05-07  5:17   ` Tomasz Torcz
  2013-05-07 21:35 ` Gabriel de Perthuis
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Kai Krakow @ 2013-05-06 20:36 UTC (permalink / raw)
  To: linux-btrfs

zwu.kernel@gmail.com <zwu.kernel@gmail.com> schrieb:

>   The patchset is trying to introduce hot relocation support
> for BTRFS. In hybrid storage environment, when the data in
> HDD disk get hot, it can be relocated to SSD disk by BTRFS
> hot relocation support automatically; also, if SSD disk ratio
> exceed its upper threshold, the data which get cold can be
> looked up and relocated to HDD disk to make more space in SSD
> disk at first, and then the data which get hot will be relocated
> to SSD disk automatically.

How will it compare to bcache? I'm currently thinking about buying an SSD 
but bcache requires some efforts in migrating the storage to use. And after 
all those hassles I am even not sure if it would work easily with a dracut 
generated initramfs.

Bcache seems to be quite clever with its approach. This one looks completely 
different and more targetted to relocate data which is used often instead of 
trying to reduce head movement. I'm quite happy with the throuput of my 3x 
HDD btrfs pool (according to bootchart up to 600 MB/s during boot). A single 
SSD would be slower since head movement seems not to be the issue during 
boot. Will this patch relocate such data? Or does it try to relocate only 
data which requires random head movement?

Thanks,
Kai


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-06 20:36 ` [RFC 0/5] BTRFS " Kai Krakow
@ 2013-05-07  5:17   ` Tomasz Torcz
  2013-05-07 21:17     ` Kai Krakow
  0 siblings, 1 reply; 27+ messages in thread
From: Tomasz Torcz @ 2013-05-07  5:17 UTC (permalink / raw)
  To: linux-btrfs

On Mon, May 06, 2013 at 10:36:03PM +0200, Kai Krakow wrote:
> zwu.kernel@gmail.com <zwu.kernel@gmail.com> schrieb:
> 
> >   The patchset is trying to introduce hot relocation support
> > for BTRFS. In hybrid storage environment, when the data in
> > HDD disk get hot, it can be relocated to SSD disk by BTRFS
> > hot relocation support automatically; also, if SSD disk ratio
> > exceed its upper threshold, the data which get cold can be
> > looked up and relocated to HDD disk to make more space in SSD
> > disk at first, and then the data which get hot will be relocated
> > to SSD disk automatically.
> 
> How will it compare to bcache? I'm currently thinking about buying an SSD 
> but bcache requires some efforts in migrating the storage to use. And after 
> all those hassles I am even not sure if it would work easily with a dracut 
> generated initramfs.

  On the side note: dm-cache, which is already in-kernel, do not need to
reformat backing storage.

-- 
Tomasz Torcz                Only gods can safely risk perfection,
xmpp: zdzichubg@chrome.pl     it's a dangerous thing for a man.  -- Alia


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-07  5:17   ` Tomasz Torcz
@ 2013-05-07 21:17     ` Kai Krakow
  0 siblings, 0 replies; 27+ messages in thread
From: Kai Krakow @ 2013-05-07 21:17 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Torcz <tomek@pipebreaker.pl> schrieb:

>> How will it compare to bcache? I'm currently thinking about buying an SSD
>> but bcache requires some efforts in migrating the storage to use. And
>> after all those hassles I am even not sure if it would work easily with a
>> dracut generated initramfs.
> 
>   On the side note: dm-cache, which is already in-kernel, do not need to
> reformat backing storage.

Oh thanks, good pointer. I haven't looked so much into dm-cache yet because 
it seemed to have come "out of nowhere" while I was tracking news around 
bcache development for a while now.

Regards,
Kai


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (5 preceding siblings ...)
  2013-05-06 20:36 ` [RFC 0/5] BTRFS " Kai Krakow
@ 2013-05-07 21:35 ` Gabriel de Perthuis
  2013-05-07 21:58   ` Kai Krakow
  2013-05-08 23:13 ` Zhi Yong Wu
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Gabriel de Perthuis @ 2013-05-07 21:35 UTC (permalink / raw)
  To: Tomasz Torcz, linux-btrfs

>> How will it compare to bcache? I'm currently thinking about buying an SSD 
>> but bcache requires some efforts in migrating the storage to use. And after 
>> all those hassles I am even not sure if it would work easily with a dracut 
>> generated initramfs.

>   On the side note: dm-cache, which is already in-kernel, do not need to
> reformat backing storage.

On the other hand dm-cache is somewhat complex to assemble, and letting
the system automount the unsynchronised backing device is a recipe for
data loss.

It will need lvm integration to become really convenient to use.

Anyway, here's a shameless plug for a tool that converts to bcache
in-place:  https://github.com/g2p/blocks#bcache-conversion


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-07 21:35 ` Gabriel de Perthuis
@ 2013-05-07 21:58   ` Kai Krakow
  2013-05-07 22:27     ` Gabriel de Perthuis
  0 siblings, 1 reply; 27+ messages in thread
From: Kai Krakow @ 2013-05-07 21:58 UTC (permalink / raw)
  To: linux-btrfs

Gabriel de Perthuis <g2p.code@gmail.com> schrieb:

>>> How will it compare to bcache? I'm currently thinking about buying an
>>> SSD but bcache requires some efforts in migrating the storage to use.
>>> And after all those hassles I am even not sure if it would work easily
>>> with a dracut generated initramfs.
> 
>>   On the side note: dm-cache, which is already in-kernel, do not need to
>> reformat backing storage.
> 
> On the other hand dm-cache is somewhat complex to assemble, and letting
> the system automount the unsynchronised backing device is a recipe for
> data loss.

Yes, that was my first impression, too, after reading of how it works. How 
safe is bcache on that matter?

> Anyway, here's a shameless plug for a tool that converts to bcache
> in-place:  https://github.com/g2p/blocks#bcache-conversion

Did I say: I love your shameless plugs? ;-)

I've read the docs for this tool with interest. Still I do not feel very 
comfortable with converting my storage for some unknown outcome. Sure, I can 
take backups (and by any means: I will). But it takes time: backup, try, 
restore, try again, maybe restore... I don't want to find out that it was 
all useless because it's just not ready to boot a multi-device btrfs through 
dracut. So you see, the point is: Will that work? I didn't see any docs 
answering my questions.

Of course, if it would work I'd happily contribute documentation to your 
project.

Regards,
Kai


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-07 21:58   ` Kai Krakow
@ 2013-05-07 22:27     ` Gabriel de Perthuis
  0 siblings, 0 replies; 27+ messages in thread
From: Gabriel de Perthuis @ 2013-05-07 22:27 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 07 May 2013 23:58:08 +0200, Kai Krakow wrote:
> Gabriel de Perthuis <g2p.code@gmail.com> schrieb:
>>>   On the side note: dm-cache, which is already in-kernel, do not need to
>>> reformat backing storage.
>> 
>> On the other hand dm-cache is somewhat complex to assemble, and letting
>> the system automount the unsynchronised backing device is a recipe for
>> data loss.
> 
> Yes, that was my first impression, too, after reading of how it works. How 
> safe is bcache on that matter?

The bcache superblock is there just to prevent the naked backing device
from becoming available.  So it's safe in that respect.  LVM has
something similar with hidden volumes.

>> Anyway, here's a shameless plug for a tool that converts to bcache
>> in-place:  https://github.com/g2p/blocks#bcache-conversion
> 
> Did I say: I love your shameless plugs? ;-)
> 
> I've read the docs for this tool with interest. Still I do not feel very 
> comfortable with converting my storage for some unknown outcome. Sure, I can 
> take backups (and by any means: I will). But it takes time: backup, try, 
> restore, try again, maybe restore... I don't want to find out that it was 
> all useless because it's just not ready to boot a multi-device btrfs through 
> dracut. So you see, the point is: Will that work? I didn't see any docs 
> answering my questions.

Try it with a throwaway filesystem inside a VM.  The bcache list will
appreciate the feedback on Dracut, even if you don't make the switch
for real.

> Of course, if it would work I'd happily contribute documentation to your 
> project.

That would be very welcome.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (6 preceding siblings ...)
  2013-05-07 21:35 ` Gabriel de Perthuis
@ 2013-05-08 23:13 ` Zhi Yong Wu
  2013-05-09  6:30   ` Stefan Behrens
                     ` (2 more replies)
  2013-05-09  7:17 ` Gabriel de Perthuis
  2013-05-14 15:24 ` Zhi Yong Wu
  9 siblings, 3 replies; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-08 23:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

HI, all

   I saw that bcache will be merged into kernel upstream soon, so i
want to know if btrfs hot relocation support is still meanful, if no,
i will not continue to work on it. can anyone let me know this?
thanks.


On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
>   The patchset as RFC is sent out mainly to see if it goes in the
> correct development direction.
>
>   The patchset is trying to introduce hot relocation support
> for BTRFS. In hybrid storage environment, when the data in
> HDD disk get hot, it can be relocated to SSD disk by BTRFS
> hot relocation support automatically; also, if SSD disk ratio
> exceed its upper threshold, the data which get cold can be
> looked up and relocated to HDD disk to make more space in SSD
> disk at first, and then the data which get hot will be relocated
> to SSD disk automatically.
>
>   BTRFS hot relocation mainly reserve block space from SSD disk
> at first, load the hot data to page cache from HDD, allocate
> block space from SSD disk, and finally write the data to SSD disk.
>
>   If you'd like to play with it, pls pull the patchset from
> my git on github:
>   https://github.com/wuzhy/kernel.git hot_reloc
>
> For how to use, please refer too the example below:
>
> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
> ^^^ Above command will hack /dev/vdc to be one SSD disk
> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f
>
> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
> WARNING! - see http://btrfs.wiki.kernel.org before using
>
> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb
> [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
> [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
> adding device /dev/vdc id 2
> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc
> fs created label (null) on /dev/vdb
> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
> Btrfs v0.20-rc1-254-gb0136aa-dirty
> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb
> [ 144.870444] btrfs: disk space caching is enabled
> [ 144.904214] VFS: Turning on hot data tracking
> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
> root@debian-i386:~# df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/vda1 16G 13G 2.2G 86% /
> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
> udev 10M 176K 9.9M 2% /dev
> tmpfs 4.8G 0 4.8G 0% /dev/shm
> /dev/vdb 15G 2.0G 13G 14% /data2
> root@debian-i386:~# btrfs fi df /data2
> Data: total=3.01GB, used=2.00GB
> System: total=4.00MB, used=4.00KB
> Metadata: total=8.00MB, used=2.19MB
> Data_SSD: total=8.00MB, used=0.00
> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
> ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109
> root@debian-i386:~# df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/vda1 16G 13G 2.2G 86% /
> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
> udev 10M 176K 9.9M 2% /dev
> tmpfs 4.8G 0 4.8G 0% /dev/shm
> /dev/vdb 15G 2.1G 13G 14% /data2
> root@debian-i386:~# btrfs fi df /data2
> Data: total=3.01GB, used=6.25MB
> System: total=4.00MB, used=4.00KB
> Metadata: total=8.00MB, used=2.26MB
> Data_SSD: total=2.01GB, used=2.00GB
> root@debian-i386:~#
>
> Zhi Yong Wu (5):
>   vfs: add one list_head field
>   btrfs: add one new block group
>   btrfs: add one hot relocation kthread
>   procfs: add three proc interfaces
>   btrfs: add hot relocation support
>
>  fs/btrfs/Makefile            |   3 +-
>  fs/btrfs/ctree.h             |  26 +-
>  fs/btrfs/extent-tree.c       | 107 +++++-
>  fs/btrfs/extent_io.c         |  31 +-
>  fs/btrfs/extent_io.h         |   4 +
>  fs/btrfs/file.c              |  36 +-
>  fs/btrfs/hot_relocate.c      | 802 +++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/hot_relocate.h      |  48 +++
>  fs/btrfs/inode-map.c         |  13 +-
>  fs/btrfs/inode.c             |  92 ++++-
>  fs/btrfs/ioctl.c             |  23 +-
>  fs/btrfs/relocation.c        |  14 +-
>  fs/btrfs/super.c             |  30 +-
>  fs/btrfs/volumes.c           |  28 +-
>  fs/hot_tracking.c            |   1 +
>  include/linux/btrfs.h        |   4 +
>  include/linux/hot_tracking.h |   1 +
>  kernel/sysctl.c              |  22 ++
>  18 files changed, 1234 insertions(+), 51 deletions(-)
>  create mode 100644 fs/btrfs/hot_relocate.c
>  create mode 100644 fs/btrfs/hot_relocate.h
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-08 23:13 ` Zhi Yong Wu
@ 2013-05-09  6:30   ` Stefan Behrens
  2013-05-09  6:42     ` Zhi Yong Wu
  2013-05-09  7:28     ` Zheng Liu
  2013-05-09  6:56   ` Roger Binns
  2013-05-19 10:41   ` Martin Steigerwald
  2 siblings, 2 replies; 27+ messages in thread
From: Stefan Behrens @ 2013-05-09  6:30 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

On 05/09/2013 01:13, Zhi Yong Wu wrote:
> HI, all
>
>     I saw that bcache will be merged into kernel upstream soon, so i
> want to know if btrfs hot relocation support is still meanful, if no,
> i will not continue to work on it. can anyone let me know this?
> thanks.

Which one is better?

Please do some measurements. Select typical file system use cases, and 
publish and compare the measurement results of the two approaches.


> On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>    The patchset as RFC is sent out mainly to see if it goes in the
>> correct development direction.
>>
>>    The patchset is trying to introduce hot relocation support
>> for BTRFS. In hybrid storage environment, when the data in
>> HDD disk get hot, it can be relocated to SSD disk by BTRFS
>> hot relocation support automatically; also, if SSD disk ratio
>> exceed its upper threshold, the data which get cold can be
>> looked up and relocated to HDD disk to make more space in SSD
>> disk at first, and then the data which get hot will be relocated
>> to SSD disk automatically.
>>
>>    BTRFS hot relocation mainly reserve block space from SSD disk
>> at first, load the hot data to page cache from HDD, allocate
>> block space from SSD disk, and finally write the data to SSD disk.
>>
>>    If you'd like to play with it, pls pull the patchset from
>> my git on github:
>>    https://github.com/wuzhy/kernel.git hot_reloc
>>
>> For how to use, please refer too the example below:
>>
>> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
>> ^^^ Above command will hack /dev/vdc to be one SSD disk
>> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
>> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f
>>
>> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>
>> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb
>> [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
>> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
>> [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
>> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
>> adding device /dev/vdc id 2
>> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc
>> fs created label (null) on /dev/vdb
>> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
>> Btrfs v0.20-rc1-254-gb0136aa-dirty
>> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
>> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb
>> [ 144.870444] btrfs: disk space caching is enabled
>> [ 144.904214] VFS: Turning on hot data tracking
>> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
>> 2048+0 records in
>> 2048+0 records out
>> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
>> root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.0G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=2.00GB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.19MB
>> Data_SSD: total=8.00MB, used=0.00
>> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
>> ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109
>> root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.1G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=6.25MB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.26MB
>> Data_SSD: total=2.01GB, used=2.00GB
>> root@debian-i386:~#
>>
>> Zhi Yong Wu (5):
>>    vfs: add one list_head field
>>    btrfs: add one new block group
>>    btrfs: add one hot relocation kthread
>>    procfs: add three proc interfaces
>>    btrfs: add hot relocation support
>>
>>   fs/btrfs/Makefile            |   3 +-
>>   fs/btrfs/ctree.h             |  26 +-
>>   fs/btrfs/extent-tree.c       | 107 +++++-
>>   fs/btrfs/extent_io.c         |  31 +-
>>   fs/btrfs/extent_io.h         |   4 +
>>   fs/btrfs/file.c              |  36 +-
>>   fs/btrfs/hot_relocate.c      | 802 +++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/hot_relocate.h      |  48 +++
>>   fs/btrfs/inode-map.c         |  13 +-
>>   fs/btrfs/inode.c             |  92 ++++-
>>   fs/btrfs/ioctl.c             |  23 +-
>>   fs/btrfs/relocation.c        |  14 +-
>>   fs/btrfs/super.c             |  30 +-
>>   fs/btrfs/volumes.c           |  28 +-
>>   fs/hot_tracking.c            |   1 +
>>   include/linux/btrfs.h        |   4 +
>>   include/linux/hot_tracking.h |   1 +
>>   kernel/sysctl.c              |  22 ++
>>   18 files changed, 1234 insertions(+), 51 deletions(-)
>>   create mode 100644 fs/btrfs/hot_relocate.c
>>   create mode 100644 fs/btrfs/hot_relocate.h


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-09  6:30   ` Stefan Behrens
@ 2013-05-09  6:42     ` Zhi Yong Wu
  2013-05-09  7:41       ` Stefan Behrens
  2013-05-09  7:28     ` Zheng Liu
  1 sibling, 1 reply; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-09  6:42 UTC (permalink / raw)
  To: Stefan Behrens
  Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

btrfs maintainer's opinion is very important, i guess.

On Thu, May 9, 2013 at 2:30 PM, Stefan Behrens
<sbehrens@giantdisaster.de> wrote:
> On 05/09/2013 01:13, Zhi Yong Wu wrote:
>>
>> HI, all
>>
>>     I saw that bcache will be merged into kernel upstream soon, so i
>> want to know if btrfs hot relocation support is still meanful, if no,
>> i will not continue to work on it. can anyone let me know this?
>> thanks.
>
>
> Which one is better?
>
> Please do some measurements. Select typical file system use cases, and
> publish and compare the measurement results of the two approaches.
>
>
>
>> On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
>>>
>>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>
>>>    The patchset as RFC is sent out mainly to see if it goes in the
>>> correct development direction.
>>>
>>>    The patchset is trying to introduce hot relocation support
>>> for BTRFS. In hybrid storage environment, when the data in
>>> HDD disk get hot, it can be relocated to SSD disk by BTRFS
>>> hot relocation support automatically; also, if SSD disk ratio
>>> exceed its upper threshold, the data which get cold can be
>>> looked up and relocated to HDD disk to make more space in SSD
>>> disk at first, and then the data which get hot will be relocated
>>> to SSD disk automatically.
>>>
>>>    BTRFS hot relocation mainly reserve block space from SSD disk
>>> at first, load the hot data to page cache from HDD, allocate
>>> block space from SSD disk, and finally write the data to SSD disk.
>>>
>>>    If you'd like to play with it, pls pull the patchset from
>>> my git on github:
>>>    https://github.com/wuzhy/kernel.git hot_reloc
>>>
>>> For how to use, please refer too the example below:
>>>
>>> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
>>> ^^^ Above command will hack /dev/vdc to be one SSD disk
>>> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
>>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
>>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
>>> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc
>>> -f
>>>
>>> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
>>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>>
>>> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1
>>> transid 16 /dev/vdb
>>> [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2
>>> transid 16 /dev/vdc
>>> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>>> transid 3 /dev/vdb
>>> [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>>> transid 3 /dev/vdb
>>> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2
>>> transid 16 /dev/vdc
>>> adding device /dev/vdc id 2
>>> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2
>>> transid 3 /dev/vdc
>>> fs created label (null) on /dev/vdb
>>> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
>>> Btrfs v0.20-rc1-254-gb0136aa-dirty
>>> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
>>> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>>> transid 6 /dev/vdb
>>> [ 144.870444] btrfs: disk space caching is enabled
>>> [ 144.904214] VFS: Turning on hot data tracking
>>> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
>>> 2048+0 records in
>>> 2048+0 records out
>>> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
>>> root@debian-i386:~# df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/vda1 16G 13G 2.2G 86% /
>>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>>> udev 10M 176K 9.9M 2% /dev
>>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>>> /dev/vdb 15G 2.0G 13G 14% /data2
>>> root@debian-i386:~# btrfs fi df /data2
>>> Data: total=3.01GB, used=2.00GB
>>> System: total=4.00MB, used=4.00KB
>>> Metadata: total=8.00MB, used=2.19MB
>>> Data_SSD: total=8.00MB, used=0.00
>>> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
>>> ^^^ Above command will start HOT RLEOCATE, because The data temperature
>>> is currently 109
>>> root@debian-i386:~# df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/vda1 16G 13G 2.2G 86% /
>>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>>> udev 10M 176K 9.9M 2% /dev
>>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>>> /dev/vdb 15G 2.1G 13G 14% /data2
>>> root@debian-i386:~# btrfs fi df /data2
>>> Data: total=3.01GB, used=6.25MB
>>> System: total=4.00MB, used=4.00KB
>>> Metadata: total=8.00MB, used=2.26MB
>>> Data_SSD: total=2.01GB, used=2.00GB
>>> root@debian-i386:~#
>>>
>>> Zhi Yong Wu (5):
>>>    vfs: add one list_head field
>>>    btrfs: add one new block group
>>>    btrfs: add one hot relocation kthread
>>>    procfs: add three proc interfaces
>>>    btrfs: add hot relocation support
>>>
>>>   fs/btrfs/Makefile            |   3 +-
>>>   fs/btrfs/ctree.h             |  26 +-
>>>   fs/btrfs/extent-tree.c       | 107 +++++-
>>>   fs/btrfs/extent_io.c         |  31 +-
>>>   fs/btrfs/extent_io.h         |   4 +
>>>   fs/btrfs/file.c              |  36 +-
>>>   fs/btrfs/hot_relocate.c      | 802
>>> +++++++++++++++++++++++++++++++++++++++++++
>>>   fs/btrfs/hot_relocate.h      |  48 +++
>>>   fs/btrfs/inode-map.c         |  13 +-
>>>   fs/btrfs/inode.c             |  92 ++++-
>>>   fs/btrfs/ioctl.c             |  23 +-
>>>   fs/btrfs/relocation.c        |  14 +-
>>>   fs/btrfs/super.c             |  30 +-
>>>   fs/btrfs/volumes.c           |  28 +-
>>>   fs/hot_tracking.c            |   1 +
>>>   include/linux/btrfs.h        |   4 +
>>>   include/linux/hot_tracking.h |   1 +
>>>   kernel/sysctl.c              |  22 ++
>>>   18 files changed, 1234 insertions(+), 51 deletions(-)
>>>   create mode 100644 fs/btrfs/hot_relocate.c
>>>   create mode 100644 fs/btrfs/hot_relocate.h
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-08 23:13 ` Zhi Yong Wu
  2013-05-09  6:30   ` Stefan Behrens
@ 2013-05-09  6:56   ` Roger Binns
  2013-05-19 10:41   ` Martin Steigerwald
  2 siblings, 0 replies; 27+ messages in thread
From: Roger Binns @ 2013-05-09  6:56 UTC (permalink / raw)
  To: linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/05/13 16:13, Zhi Yong Wu wrote:
> i want to know if btrfs hot relocation support is still meanful

It is to me.  The problem with bcache is that it is a cache.  ie if you
have a 256GB SSD and a 500GB HDD then you'll have total storage of 500GB.
 Hot relocation was described as having 750GB available in the same scenario.

I'd also expect btrfs level support to be more friendly such as being able
to mount -o degraded if one of the devices is missing.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlGLSIkACgkQmOOfHg372QSHQACeLwbhnVsH7+/6ZSIaGAcMUyBe
gPwAoNAKBAFB65XZQbLyxyCJPHODR+9z
=6OXt
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (7 preceding siblings ...)
  2013-05-08 23:13 ` Zhi Yong Wu
@ 2013-05-09  7:17 ` Gabriel de Perthuis
  2013-05-14 15:24 ` Zhi Yong Wu
  9 siblings, 0 replies; 27+ messages in thread
From: Gabriel de Perthuis @ 2013-05-09  7:17 UTC (permalink / raw)
  To: Zhi Yong Wu, linux-btrfs

On Thu, 09 May 2013 07:13:56 +0800, Zhi Yong Wu wrote:
> HI, all
> 
>    I saw that bcache will be merged into kernel upstream soon, so i
> want to know if btrfs hot relocation support is still meanful, if no,
> i will not continue to work on it. can anyone let me know this?
> thanks.

bcache performance would be poor if Btrfs-raid1 is used;
the hits will be spread across mirrors are random, bcache will have
to cache twice as much and convergence could be slow.

Hot tracking also has much more precise info about which files are hot, 
whereas bcache can't know anything about files that are in the
page cache and risks evicting some of them.

It all comes down to benchmarks, but these two situations could be easy
wins for hot-reloc.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-09  6:30   ` Stefan Behrens
  2013-05-09  6:42     ` Zhi Yong Wu
@ 2013-05-09  7:28     ` Zheng Liu
  1 sibling, 0 replies; 27+ messages in thread
From: Zheng Liu @ 2013-05-09  7:28 UTC (permalink / raw)
  To: Stefan Behrens
  Cc: Zhi Yong Wu, linux-btrfs, sekharan, chris.mason, Ilya Dryomov,
	Zhi Yong Wu

On Thu, May 09, 2013 at 08:30:12AM +0200, Stefan Behrens wrote:
> On 05/09/2013 01:13, Zhi Yong Wu wrote:
> >HI, all
> >
> >    I saw that bcache will be merged into kernel upstream soon, so i
> >want to know if btrfs hot relocation support is still meanful, if no,
> >i will not continue to work on it. can anyone let me know this?
> >thanks.
> 
> Which one is better?
> 
> Please do some measurements. Select typical file system use cases,
> and publish and compare the measurement results of the two
> approaches.

Hi Stefan,

AFAIU, the key issue is that the hot relocation feature should be
implemented in file system or in block device because file system knows
which data is hot, and the application could use fadvise/ioctl/...
interfaces to give a hint to file system to keep some data in fast
device.  But IIUC dm-cache/bcache only can do is like: "hey, this data
should be hot just because it is touched twice."  In some cases, touched
twice not always means that it should be kept in fast device.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-09  6:42     ` Zhi Yong Wu
@ 2013-05-09  7:41       ` Stefan Behrens
  2013-05-09  7:49         ` Zhi Yong Wu
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Behrens @ 2013-05-09  7:41 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

On 05/09/2013 08:42, Zhi Yong Wu wrote:
> btrfs maintainer's opinion is very important, i guess.

My opinion is not important and I shall shut up?


> On Thu, May 9, 2013 at 2:30 PM, Stefan Behrens
> <sbehrens@giantdisaster.de> wrote:
>> On 05/09/2013 01:13, Zhi Yong Wu wrote:
>>>
>>> HI, all
>>>
>>>      I saw that bcache will be merged into kernel upstream soon, so i
>>> want to know if btrfs hot relocation support is still meanful, if no,
>>> i will not continue to work on it. can anyone let me know this?
>>> thanks.
>>
>>
>> Which one is better?
>>
>> Please do some measurements. Select typical file system use cases, and
>> publish and compare the measurement results of the two approaches.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-09  7:41       ` Stefan Behrens
@ 2013-05-09  7:49         ` Zhi Yong Wu
  0 siblings, 0 replies; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-09  7:49 UTC (permalink / raw)
  To: Stefan Behrens
  Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

No, no, you misunderstood what i mean.:)  You know, it will perhaps
take a lot of time to do some benchmark. We only hope to get some
suggestion about if it should be worth to continue working on. Very
welcome to your and other guys' opinions.

On Thu, May 9, 2013 at 3:41 PM, Stefan Behrens
<sbehrens@giantdisaster.de> wrote:
> On 05/09/2013 08:42, Zhi Yong Wu wrote:
>>
>> btrfs maintainer's opinion is very important, i guess.
>
>
> My opinion is not important and I shall shut up?
>
>
>
>> On Thu, May 9, 2013 at 2:30 PM, Stefan Behrens
>> <sbehrens@giantdisaster.de> wrote:
>>>
>>> On 05/09/2013 01:13, Zhi Yong Wu wrote:
>>>>
>>>>
>>>> HI, all
>>>>
>>>>      I saw that bcache will be merged into kernel upstream soon, so i
>>>> want to know if btrfs hot relocation support is still meanful, if no,
>>>> i will not continue to work on it. can anyone let me know this?
>>>> thanks.
>>>
>>>
>>>
>>> Which one is better?
>>>
>>> Please do some measurements. Select typical file system use cases, and
>>> publish and compare the measurement results of the two approaches.
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
                   ` (8 preceding siblings ...)
  2013-05-09  7:17 ` Gabriel de Perthuis
@ 2013-05-14 15:24 ` Zhi Yong Wu
  2013-05-16  7:12   ` Kai Krakow
  9 siblings, 1 reply; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-14 15:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

HI,

   What do you think if its design approach goes correctly? Do you
have any comments or better design idea for BTRFS hot relocation
support? any comments are appreciated, thanks.


On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
>   The patchset as RFC is sent out mainly to see if it goes in the
> correct development direction.
>
>   The patchset is trying to introduce hot relocation support
> for BTRFS. In hybrid storage environment, when the data in
> HDD disk get hot, it can be relocated to SSD disk by BTRFS
> hot relocation support automatically; also, if SSD disk ratio
> exceed its upper threshold, the data which get cold can be
> looked up and relocated to HDD disk to make more space in SSD
> disk at first, and then the data which get hot will be relocated
> to SSD disk automatically.
>
>   BTRFS hot relocation mainly reserve block space from SSD disk
> at first, load the hot data to page cache from HDD, allocate
> block space from SSD disk, and finally write the data to SSD disk.
>
>   If you'd like to play with it, pls pull the patchset from
> my git on github:
>   https://github.com/wuzhy/kernel.git hot_reloc
>
> For how to use, please refer too the example below:
>
> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
> ^^^ Above command will hack /dev/vdc to be one SSD disk
> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f
>
> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
> WARNING! - see http://btrfs.wiki.kernel.org before using
>
> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb
> [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
> [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
> adding device /dev/vdc id 2
> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc
> fs created label (null) on /dev/vdb
> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
> Btrfs v0.20-rc1-254-gb0136aa-dirty
> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb
> [ 144.870444] btrfs: disk space caching is enabled
> [ 144.904214] VFS: Turning on hot data tracking
> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
> root@debian-i386:~# df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/vda1 16G 13G 2.2G 86% /
> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
> udev 10M 176K 9.9M 2% /dev
> tmpfs 4.8G 0 4.8G 0% /dev/shm
> /dev/vdb 15G 2.0G 13G 14% /data2
> root@debian-i386:~# btrfs fi df /data2
> Data: total=3.01GB, used=2.00GB
> System: total=4.00MB, used=4.00KB
> Metadata: total=8.00MB, used=2.19MB
> Data_SSD: total=8.00MB, used=0.00
> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
> ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109
> root@debian-i386:~# df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/vda1 16G 13G 2.2G 86% /
> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
> udev 10M 176K 9.9M 2% /dev
> tmpfs 4.8G 0 4.8G 0% /dev/shm
> /dev/vdb 15G 2.1G 13G 14% /data2
> root@debian-i386:~# btrfs fi df /data2
> Data: total=3.01GB, used=6.25MB
> System: total=4.00MB, used=4.00KB
> Metadata: total=8.00MB, used=2.26MB
> Data_SSD: total=2.01GB, used=2.00GB
> root@debian-i386:~#
>
> Zhi Yong Wu (5):
>   vfs: add one list_head field
>   btrfs: add one new block group
>   btrfs: add one hot relocation kthread
>   procfs: add three proc interfaces
>   btrfs: add hot relocation support
>
>  fs/btrfs/Makefile            |   3 +-
>  fs/btrfs/ctree.h             |  26 +-
>  fs/btrfs/extent-tree.c       | 107 +++++-
>  fs/btrfs/extent_io.c         |  31 +-
>  fs/btrfs/extent_io.h         |   4 +
>  fs/btrfs/file.c              |  36 +-
>  fs/btrfs/hot_relocate.c      | 802 +++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/hot_relocate.h      |  48 +++
>  fs/btrfs/inode-map.c         |  13 +-
>  fs/btrfs/inode.c             |  92 ++++-
>  fs/btrfs/ioctl.c             |  23 +-
>  fs/btrfs/relocation.c        |  14 +-
>  fs/btrfs/super.c             |  30 +-
>  fs/btrfs/volumes.c           |  28 +-
>  fs/hot_tracking.c            |   1 +
>  include/linux/btrfs.h        |   4 +
>  include/linux/hot_tracking.h |   1 +
>  kernel/sysctl.c              |  22 ++
>  18 files changed, 1234 insertions(+), 51 deletions(-)
>  create mode 100644 fs/btrfs/hot_relocate.c
>  create mode 100644 fs/btrfs/hot_relocate.h
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-14 15:24 ` Zhi Yong Wu
@ 2013-05-16  7:12   ` Kai Krakow
  2013-05-17  7:23     ` Zhi Yong Wu
  0 siblings, 1 reply; 27+ messages in thread
From: Kai Krakow @ 2013-05-16  7:12 UTC (permalink / raw)
  To: linux-btrfs

Hi!

I think such a solution as part of the filesystem could do much better than 
something outside of it (like bcache). But I'm not sure: What makes data 
hot? I think the most benefit is detecting random read access and mark only 
those data as hot, also writes should go to the SSD first and then should be 
spooled to the harddisks in background. Bcache does a lot regarding this.

Since this is within the filesystem, users could even mark files as being 
always "hot" with some attribute or ioctl. This could be used by a boot-
readahead and preload implementation to automatically make files hot used 
during booting or for preloading when I start an application.

On the other side hot relocation should be able to reduce writes to the SSD 
as good as possible, for example: Do not defragment files during autodefrag, 
it makes no sense. Also write data in bursts of erase block size etc.

And also important: What if the SSD dies due to wearing? Will it gracefully 
fall back to harddisk? What does "relocation" mean? Files (hot data) should 
only be cached in copy to SSD, and not moved there. It should be possible 
for btrfs to just drop a failing SSD from the filesystem without data loss 
because otherwise one should use two SSDs in raid-1 mode to get a safe cache 
storage.

Altogether I think that a spinning media btrfs raid can outperform a single 
SSD so hot relocation should probably be used to reduce head movements 
because this is where SSD really excels. So everything that involves heavy 
head movement should go to SSD first, then written back to harddisk. And I 
think there's a lot potential to optimize because a COW filesystem like 
btrfs naturally has a lot of head movement.

What do you think?

BTW: I have not tried the one or the other yet because I'm still deciding 
which way to go. Your patches are more welcome because I do not need to 
migrate my storage to bcache-provided block devices. OTOH the bcache 
implementation looks a lot more mature (with regard to performance and 
safety) at this point because it provides many of the above mentioned 
features - most importantly gracefully handling failing SSDs.

Regarding btrfs raid outperforms SSD: During boot my spinning media 3 device 
btrfs raid reads boot files with up to 600 MB/s (from LZ compressed fs), 
boot takes about 7 seconds until the display manager starts (which takes 
another 30 seconds but that's another story), and the system is pretty 
crowded with services I actually wouldn't need if I optimized for boot 
performance. But I think systemd's read-ahead implementation has a lot 
influence on this fast booting: It defragments and relocates boot files on 
btrfs during boot so the harddisks can sequentially read all this stuff. I 
think it also compresses boot files if compression is enabled because 
booting is IO bound, not CPU bound. Benchmarks showed that my btrfs raid 
could technically read up to 450 MB/s, so I think the 600 MB/s counts for 
decompressed data. A single SSD could not do that. For that same reason I 
created a small script to defragment and compress files used by the preload 
daemon. Without benchmarking it, this felt like another small performance 
boost. So I'm eager what could be next with some sort of SSD cache because 
the only problem left seems to be heavy head movement which slows down the 
system.

Zhi Yong Wu <zwu.kernel@gmail.com> schrieb:

> HI,
> 
>    What do you think if its design approach goes correctly? Do you
> have any comments or better design idea for BTRFS hot relocation
> support? any comments are appreciated, thanks.
> 
> 
> On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   The patchset as RFC is sent out mainly to see if it goes in the
>> correct development direction.
>>
>>   The patchset is trying to introduce hot relocation support
>> for BTRFS. In hybrid storage environment, when the data in
>> HDD disk get hot, it can be relocated to SSD disk by BTRFS
>> hot relocation support automatically; also, if SSD disk ratio
>> exceed its upper threshold, the data which get cold can be
>> looked up and relocated to HDD disk to make more space in SSD
>> disk at first, and then the data which get hot will be relocated
>> to SSD disk automatically.
>>
>>   BTRFS hot relocation mainly reserve block space from SSD disk
>> at first, load the hot data to page cache from HDD, allocate
>> block space from SSD disk, and finally write the data to SSD disk.
>>
>>   If you'd like to play with it, pls pull the patchset from
>> my git on github:
>>   https://github.com/wuzhy/kernel.git hot_reloc
>>
>> For how to use, please refer too the example below:
>>
>> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
>> ^^^ Above command will hack /dev/vdc to be one SSD disk
>> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
>> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc
>> -f
>>
>> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>
>> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1
>> [ transid 16 /dev/vdb 140.283650] device fsid
>> [ c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
>> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>> [ transid 3 /dev/vdb 140.550759] device fsid
>> [ 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
>> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2
>> [ transid 16 /dev/vdc
>> adding device /dev/vdc id 2
>> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2
>> [ transid 3 /dev/vdc
>> fs created label (null) on /dev/vdb
>> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
>> Btrfs v0.20-rc1-254-gb0136aa-dirty
>> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
>> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>> [ transid 6 /dev/vdb 144.870444] btrfs: disk space caching is enabled
>> [ 144.904214] VFS: Turning on hot data tracking
>> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
>> 2048+0 records in
>> 2048+0 records out
>> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
>> root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.0G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=2.00GB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.19MB
>> Data_SSD: total=8.00MB, used=0.00
>> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
>> ^^^ Above command will start HOT RLEOCATE, because The data temperature
>> is currently 109 root@debian-i386:~# df -h
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/vda1 16G 13G 2.2G 86% /
>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>> udev 10M 176K 9.9M 2% /dev
>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>> /dev/vdb 15G 2.1G 13G 14% /data2
>> root@debian-i386:~# btrfs fi df /data2
>> Data: total=3.01GB, used=6.25MB
>> System: total=4.00MB, used=4.00KB
>> Metadata: total=8.00MB, used=2.26MB
>> Data_SSD: total=2.01GB, used=2.00GB
>> root@debian-i386:~#
>>
>> Zhi Yong Wu (5):
>>   vfs: add one list_head field
>>   btrfs: add one new block group
>>   btrfs: add one hot relocation kthread
>>   procfs: add three proc interfaces
>>   btrfs: add hot relocation support
>>
>>  fs/btrfs/Makefile            |   3 +-
>>  fs/btrfs/ctree.h             |  26 +-
>>  fs/btrfs/extent-tree.c       | 107 +++++-
>>  fs/btrfs/extent_io.c         |  31 +-
>>  fs/btrfs/extent_io.h         |   4 +
>>  fs/btrfs/file.c              |  36 +-
>>  fs/btrfs/hot_relocate.c      | 802
>>  +++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/hot_relocate.h      |  48 +++
>>  fs/btrfs/inode-map.c         |  13 +-
>>  fs/btrfs/inode.c             |  92 ++++-
>>  fs/btrfs/ioctl.c             |  23 +-
>>  fs/btrfs/relocation.c        |  14 +-
>>  fs/btrfs/super.c             |  30 +-
>>  fs/btrfs/volumes.c           |  28 +-
>>  fs/hot_tracking.c            |   1 +
>>  include/linux/btrfs.h        |   4 +
>>  include/linux/hot_tracking.h |   1 +
>>  kernel/sysctl.c              |  22 ++
>>  18 files changed, 1234 insertions(+), 51 deletions(-)
>>  create mode 100644 fs/btrfs/hot_relocate.c
>>  create mode 100644 fs/btrfs/hot_relocate.h
>>
>> --
>> 1.7.11.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-16  7:12   ` Kai Krakow
@ 2013-05-17  7:23     ` Zhi Yong Wu
  0 siblings, 0 replies; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-17  7:23 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

HI,

You should check the patchset about VFS hot tracking
https://lwn.net/Articles/550495/





On Thu, May 16, 2013 at 3:12 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> Hi!
>
> I think such a solution as part of the filesystem could do much better than
> something outside of it (like bcache). But I'm not sure: What makes data
> hot? I think the most benefit is detecting random read access and mark only
> those data as hot, also writes should go to the SSD first and then should be
> spooled to the harddisks in background. Bcache does a lot regarding this.
>
> Since this is within the filesystem, users could even mark files as being
> always "hot" with some attribute or ioctl. This could be used by a boot-
> readahead and preload implementation to automatically make files hot used
> during booting or for preloading when I start an application.
>
> On the other side hot relocation should be able to reduce writes to the SSD
> as good as possible, for example: Do not defragment files during autodefrag,
> it makes no sense. Also write data in bursts of erase block size etc.
>
> And also important: What if the SSD dies due to wearing? Will it gracefully
> fall back to harddisk? What does "relocation" mean? Files (hot data) should
> only be cached in copy to SSD, and not moved there. It should be possible
> for btrfs to just drop a failing SSD from the filesystem without data loss
> because otherwise one should use two SSDs in raid-1 mode to get a safe cache
> storage.
>
> Altogether I think that a spinning media btrfs raid can outperform a single
> SSD so hot relocation should probably be used to reduce head movements
> because this is where SSD really excels. So everything that involves heavy
> head movement should go to SSD first, then written back to harddisk. And I
> think there's a lot potential to optimize because a COW filesystem like
> btrfs naturally has a lot of head movement.
>
> What do you think?
>
> BTW: I have not tried the one or the other yet because I'm still deciding
> which way to go. Your patches are more welcome because I do not need to
> migrate my storage to bcache-provided block devices. OTOH the bcache
> implementation looks a lot more mature (with regard to performance and
> safety) at this point because it provides many of the above mentioned
> features - most importantly gracefully handling failing SSDs.
>
> Regarding btrfs raid outperforms SSD: During boot my spinning media 3 device
> btrfs raid reads boot files with up to 600 MB/s (from LZ compressed fs),
> boot takes about 7 seconds until the display manager starts (which takes
> another 30 seconds but that's another story), and the system is pretty
> crowded with services I actually wouldn't need if I optimized for boot
> performance. But I think systemd's read-ahead implementation has a lot
> influence on this fast booting: It defragments and relocates boot files on
> btrfs during boot so the harddisks can sequentially read all this stuff. I
> think it also compresses boot files if compression is enabled because
> booting is IO bound, not CPU bound. Benchmarks showed that my btrfs raid
> could technically read up to 450 MB/s, so I think the 600 MB/s counts for
> decompressed data. A single SSD could not do that. For that same reason I
> created a small script to defragment and compress files used by the preload
> daemon. Without benchmarking it, this felt like another small performance
> boost. So I'm eager what could be next with some sort of SSD cache because
> the only problem left seems to be heavy head movement which slows down the
> system.
>
> Zhi Yong Wu <zwu.kernel@gmail.com> schrieb:
>
>> HI,
>>
>>    What do you think if its design approach goes correctly? Do you
>> have any comments or better design idea for BTRFS hot relocation
>> support? any comments are appreciated, thanks.
>>
>>
>> On Mon, May 6, 2013 at 4:53 PM,  <zwu.kernel@gmail.com> wrote:
>>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>
>>>   The patchset as RFC is sent out mainly to see if it goes in the
>>> correct development direction.
>>>
>>>   The patchset is trying to introduce hot relocation support
>>> for BTRFS. In hybrid storage environment, when the data in
>>> HDD disk get hot, it can be relocated to SSD disk by BTRFS
>>> hot relocation support automatically; also, if SSD disk ratio
>>> exceed its upper threshold, the data which get cold can be
>>> looked up and relocated to HDD disk to make more space in SSD
>>> disk at first, and then the data which get hot will be relocated
>>> to SSD disk automatically.
>>>
>>>   BTRFS hot relocation mainly reserve block space from SSD disk
>>> at first, load the hot data to page cache from HDD, allocate
>>> block space from SSD disk, and finally write the data to SSD disk.
>>>
>>>   If you'd like to play with it, pls pull the patchset from
>>> my git on github:
>>>   https://github.com/wuzhy/kernel.git hot_reloc
>>>
>>> For how to use, please refer too the example below:
>>>
>>> root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational
>>> ^^^ Above command will hack /dev/vdc to be one SSD disk
>>> root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval
>>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval
>>> root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval
>>> root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc
>>> -f
>>>
>>> WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL
>>> WARNING! - see http://btrfs.wiki.kernel.org before using
>>>
>>> [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1
>>> [ transid 16 /dev/vdb 140.283650] device fsid
>>> [ c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc
>>> [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>>> [ transid 3 /dev/vdb 140.550759] device fsid
>>> [ 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb
>>> [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2
>>> [ transid 16 /dev/vdc
>>> adding device /dev/vdc id 2
>>> [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2
>>> [ transid 3 /dev/vdc
>>> fs created label (null) on /dev/vdb
>>> nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB
>>> Btrfs v0.20-rc1-254-gb0136aa-dirty
>>> root@debian-i386:~# mount -o hot_move /dev/vdb /data2
>>> [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1
>>> [ transid 6 /dev/vdb 144.870444] btrfs: disk space caching is enabled
>>> [ 144.904214] VFS: Turning on hot data tracking
>>> root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048
>>> 2048+0 records in
>>> 2048+0 records out
>>> 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s
>>> root@debian-i386:~# df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/vda1 16G 13G 2.2G 86% /
>>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>>> udev 10M 176K 9.9M 2% /dev
>>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>>> /dev/vdb 15G 2.0G 13G 14% /data2
>>> root@debian-i386:~# btrfs fi df /data2
>>> Data: total=3.01GB, used=2.00GB
>>> System: total=4.00MB, used=4.00KB
>>> Metadata: total=8.00MB, used=2.19MB
>>> Data_SSD: total=8.00MB, used=0.00
>>> root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold
>>> ^^^ Above command will start HOT RLEOCATE, because The data temperature
>>> is currently 109 root@debian-i386:~# df -h
>>> Filesystem Size Used Avail Use% Mounted on
>>> /dev/vda1 16G 13G 2.2G 86% /
>>> tmpfs 4.8G 0 4.8G 0% /lib/init/rw
>>> udev 10M 176K 9.9M 2% /dev
>>> tmpfs 4.8G 0 4.8G 0% /dev/shm
>>> /dev/vdb 15G 2.1G 13G 14% /data2
>>> root@debian-i386:~# btrfs fi df /data2
>>> Data: total=3.01GB, used=6.25MB
>>> System: total=4.00MB, used=4.00KB
>>> Metadata: total=8.00MB, used=2.26MB
>>> Data_SSD: total=2.01GB, used=2.00GB
>>> root@debian-i386:~#
>>>
>>> Zhi Yong Wu (5):
>>>   vfs: add one list_head field
>>>   btrfs: add one new block group
>>>   btrfs: add one hot relocation kthread
>>>   procfs: add three proc interfaces
>>>   btrfs: add hot relocation support
>>>
>>>  fs/btrfs/Makefile            |   3 +-
>>>  fs/btrfs/ctree.h             |  26 +-
>>>  fs/btrfs/extent-tree.c       | 107 +++++-
>>>  fs/btrfs/extent_io.c         |  31 +-
>>>  fs/btrfs/extent_io.h         |   4 +
>>>  fs/btrfs/file.c              |  36 +-
>>>  fs/btrfs/hot_relocate.c      | 802
>>>  +++++++++++++++++++++++++++++++++++++++++++
>>>  fs/btrfs/hot_relocate.h      |  48 +++
>>>  fs/btrfs/inode-map.c         |  13 +-
>>>  fs/btrfs/inode.c             |  92 ++++-
>>>  fs/btrfs/ioctl.c             |  23 +-
>>>  fs/btrfs/relocation.c        |  14 +-
>>>  fs/btrfs/super.c             |  30 +-
>>>  fs/btrfs/volumes.c           |  28 +-
>>>  fs/hot_tracking.c            |   1 +
>>>  include/linux/btrfs.h        |   4 +
>>>  include/linux/hot_tracking.h |   1 +
>>>  kernel/sysctl.c              |  22 ++
>>>  18 files changed, 1234 insertions(+), 51 deletions(-)
>>>  create mode 100644 fs/btrfs/hot_relocate.c
>>>  create mode 100644 fs/btrfs/hot_relocate.h
>>>
>>> --
>>> 1.7.11.7
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-08 23:13 ` Zhi Yong Wu
  2013-05-09  6:30   ` Stefan Behrens
  2013-05-09  6:56   ` Roger Binns
@ 2013-05-19 10:41   ` Martin Steigerwald
  2013-05-19 13:43     ` Zhi Yong Wu
  2013-05-19 13:46     ` Zhi Yong Wu
  2 siblings, 2 replies; 27+ messages in thread
From: Martin Steigerwald @ 2013-05-19 10:41 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

Am Donnerstag, 9. Mai 2013, 07:13:56 schrieb Zhi Yong Wu:
> HI, all

Hi!

>    I saw that bcache will be merged into kernel upstream soon, so i
> want to know if btrfs hot relocation support is still meanful, if no,
> i will not continue to work on it. can anyone let me know this?
> thanks.

I really look forward to VFS hot data tracking with BTRFS (and other 
filesystem) support.

ZFS and BTRFS have shown that RAID support within the filesystem can make a lot 
of sense. I think hot relocation is another area which can be done more 
accurate within the filesystem. I think there are several features only 
possible by going this route:

1) Mark files as hot via ioctl. MySQL can mark InnoDB journal files for example.

2) Proper BTRFS RAID support (as noted elsewhere in this thread)

3) Easier setup. With BTRFS flexibility I would expect that a SSD as hot data 
cache can be added and removed on the fly during filesystem is mounted. Only 
seems supported at mkfs-time as I read the patch docs, but from my basic 
technical understanding of BTRFS it can be extented to be done on the fly with 
a mounted FS as well.


Block based caching can make sense for other use cases like raw device with 
Oracle DB on it or maybe a swap device. Or for filesystems not (yet) supporting 
VFS hot data tracking.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-19 10:41   ` Martin Steigerwald
@ 2013-05-19 13:43     ` Zhi Yong Wu
  2013-05-19 14:42       ` Martin Steigerwald
  2013-05-19 13:46     ` Zhi Yong Wu
  1 sibling, 1 reply; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-19 13:43 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

On Sun, May 19, 2013 at 6:41 PM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> Am Donnerstag, 9. Mai 2013, 07:13:56 schrieb Zhi Yong Wu:
>> HI, all
>
> Hi!
>
>>    I saw that bcache will be merged into kernel upstream soon, so i
>> want to know if btrfs hot relocation support is still meanful, if no,
>> i will not continue to work on it. can anyone let me know this?
>> thanks.
>
> I really look forward to VFS hot data tracking with BTRFS (and other
> filesystem) support.
I also really want to see this will happen.
>
> ZFS and BTRFS have shown that RAID support within the filesystem can make a lot
> of sense. I think hot relocation is another area which can be done more
> accurate within the filesystem. I think there are several features only
> possible by going this route:
>
> 1) Mark files as hot via ioctl. MySQL can mark InnoDB journal files for example.
It has been implemented to get its hot status of one specified file
via ioctl. It is not hard to enable its set function. Before it is
done, i more hope that the core of VFS hot tracking can get merged by
kernel upstream. You know, it is harder if too new functions have been
done.
We can put it in to-do list at first, what do you think?
>
> 2) Proper BTRFS RAID support (as noted elsewhere in this thread)
>
> 3) Easier setup. With BTRFS flexibility I would expect that a SSD as hot data
> cache can be added and removed on the fly during filesystem is mounted. Only
> seems supported at mkfs-time as I read the patch docs, but from my basic
> technical understanding of BTRFS it can be extented to be done on the fly with
> a mounted FS as well.
Yes, it is supported only at mkfs time. It shouldn't be hard to enable
adding or removing a nonrotating disk as hot cache on the fly. You
know, before this is done, i would like to make sure that the current
design for BTRFS hot relocation is this RFC patchset is going in the
direction and is appropriate. If it is going in wrong direction, we
will be doing a lot of meanless work. E.g. How about the current idea
on introducing one new block group for nonrotating disks.
When working on this work, i were trying to change the least current
btrfs code as far as possible. You know, if we change too many current
btrfs code, it will introduce regression bug more easily, and so the
patchset will be also harder to get accepted by btrfs upstream.

What do you think?

>
>
> Block based caching can make sense for other use cases like raw device with
> Oracle DB on it or maybe a swap device. Or for filesystems not (yet) supporting
> VFS hot data tracking.
>
> Thanks,
> --
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7



--
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-19 10:41   ` Martin Steigerwald
  2013-05-19 13:43     ` Zhi Yong Wu
@ 2013-05-19 13:46     ` Zhi Yong Wu
  1 sibling, 0 replies; 27+ messages in thread
From: Zhi Yong Wu @ 2013-05-19 13:46 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

On Sun, May 19, 2013 at 6:41 PM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> Am Donnerstag, 9. Mai 2013, 07:13:56 schrieb Zhi Yong Wu:
>> HI, all
>
> Hi!
>
>>    I saw that bcache will be merged into kernel upstream soon, so i
>> want to know if btrfs hot relocation support is still meanful, if no,
>> i will not continue to work on it. can anyone let me know this?
>> thanks.
>
> I really look forward to VFS hot data tracking with BTRFS (and other
> filesystem) support.
>
> ZFS and BTRFS have shown that RAID support within the filesystem can make a lot
> of sense. I think hot relocation is another area which can be done more
> accurate within the filesystem. I think there are several features only
> possible by going this route:
>
> 1) Mark files as hot via ioctl. MySQL can mark InnoDB journal files for example.
>
> 2) Proper BTRFS RAID support (as noted elsewhere in this thread)
After the current design draft for BTRFS hot relocation in this RFC
get basic agreement as btrfs community, maintainer or main developers,
i will dive into this further.
By the way, it is in my to-do list, thanks.
>
> 3) Easier setup. With BTRFS flexibility I would expect that a SSD as hot data
> cache can be added and removed on the fly during filesystem is mounted. Only
> seems supported at mkfs-time as I read the patch docs, but from my basic
> technical understanding of BTRFS it can be extented to be done on the fly with
> a mounted FS as well.
>
>
> Block based caching can make sense for other use cases like raw device with
> Oracle DB on it or maybe a swap device. Or for filesystems not (yet) supporting
> VFS hot data tracking.
>
> Thanks,
> --
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7



--
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 0/5] BTRFS hot relocation support
  2013-05-19 13:43     ` Zhi Yong Wu
@ 2013-05-19 14:42       ` Martin Steigerwald
  0 siblings, 0 replies; 27+ messages in thread
From: Martin Steigerwald @ 2013-05-19 14:42 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: linux-btrfs, sekharan, chris.mason, Ilya Dryomov, Zhi Yong Wu

Am Sonntag, 19. Mai 2013, 21:43:14 schrieb Zhi Yong Wu:
> On Sun, May 19, 2013 at 6:41 PM, Martin Steigerwald <Martin@lichtvoll.de> 
wrote:
> > Am Donnerstag, 9. Mai 2013, 07:13:56 schrieb Zhi Yong Wu:
[…]
> > ZFS and BTRFS have shown that RAID support within the filesystem can make
> > a lot of sense. I think hot relocation is another area which can be done
> > more accurate within the filesystem. I think there are several features
> > only possible by going this route:
> > 
> > 1) Mark files as hot via ioctl. MySQL can mark InnoDB journal files for
> > example.
> It has been implemented to get its hot status of one specified file
> via ioctl. It is not hard to enable its set function. Before it is
> done, i more hope that the core of VFS hot tracking can get merged by
> kernel upstream. You know, it is harder if too new functions have been
> done.
> We can put it in to-do list at first, what do you think?

Yes, of course. I didn´t want to imply that you have to implement it. I didn´t 
contribute to BTRFS code yet, so I do not see myself in a position to request 
anything from you – and even if I did, what you code is still your decision. 
Many thanks for your work on this!

I just wanted to note that IMHO its possible with the VFS approach while its 
at least difficult to do with the block layer based approaches.

> > 2) Proper BTRFS RAID support (as noted elsewhere in this thread)
> > 
> > 3) Easier setup. With BTRFS flexibility I would expect that a SSD as hot
> > data cache can be added and removed on the fly during filesystem is
> > mounted. Only seems supported at mkfs-time as I read the patch docs, but
> > from my basic technical understanding of BTRFS it can be extented to be
> > done on the fly with a mounted FS as well.
> 
> Yes, it is supported only at mkfs time. It shouldn't be hard to enable
> adding or removing a nonrotating disk as hot cache on the fly. You
> know, before this is done, i would like to make sure that the current
> design for BTRFS hot relocation is this RFC patchset is going in the
> direction and is appropriate. If it is going in wrong direction, we
> will be doing a lot of meanless work. E.g. How about the current idea
> on introducing one new block group for nonrotating disks.
> When working on this work, i were trying to change the least current
> btrfs code as far as possible. You know, if we change too many current
> btrfs code, it will introduce regression bug more easily, and so the
> patchset will be also harder to get accepted by btrfs upstream.
> 
> What do you think?

Yes, see above. Keep it simple and extend later I´d say :)

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-05-19 14:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-06  8:53 [RFC 0/5] BTRFS hot relocation support zwu.kernel
2013-05-06  8:53 ` [RFC 1/5] vfs: add one list_head field zwu.kernel
2013-05-06  8:53 ` [RFC 2/5] btrfs: add one new block group zwu.kernel
2013-05-06  8:53 ` [RFC 3/5] btrfs: add one hot relocation kthread zwu.kernel
2013-05-06  8:53 ` [RFC 4/5] procfs: add three proc interfaces zwu.kernel
2013-05-06  8:53 ` [RFC 5/5] btrfs: add hot relocation support zwu.kernel
2013-05-06 20:36 ` [RFC 0/5] BTRFS " Kai Krakow
2013-05-07  5:17   ` Tomasz Torcz
2013-05-07 21:17     ` Kai Krakow
2013-05-07 21:35 ` Gabriel de Perthuis
2013-05-07 21:58   ` Kai Krakow
2013-05-07 22:27     ` Gabriel de Perthuis
2013-05-08 23:13 ` Zhi Yong Wu
2013-05-09  6:30   ` Stefan Behrens
2013-05-09  6:42     ` Zhi Yong Wu
2013-05-09  7:41       ` Stefan Behrens
2013-05-09  7:49         ` Zhi Yong Wu
2013-05-09  7:28     ` Zheng Liu
2013-05-09  6:56   ` Roger Binns
2013-05-19 10:41   ` Martin Steigerwald
2013-05-19 13:43     ` Zhi Yong Wu
2013-05-19 14:42       ` Martin Steigerwald
2013-05-19 13:46     ` Zhi Yong Wu
2013-05-09  7:17 ` Gabriel de Perthuis
2013-05-14 15:24 ` Zhi Yong Wu
2013-05-16  7:12   ` Kai Krakow
2013-05-17  7:23     ` Zhi Yong Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.