All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/11] Update version of write stream ID patchset
@ 2016-03-04 16:10 Jens Axboe
  2016-03-04 16:10 ` [PATCH 01/11] idr: make ida_simple_remove() return an error Jens Axboe
                   ` (12 more replies)
  0 siblings, 13 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger

It's been a while since I last posted the write stream ID patchset, but
here is an updated version.

The original patchset was centered around the current NVMe streams
proposal, but there was a number of issues with that. It's now in a
much beter state, and hopefully will make it into 1.3 of the spec
soon.

To quickly re-summarize the intent behind write stream IDs, it's to
be able to provide a hint to the underlying storage device on what
writes could feasibly be grouped together. If the device is able to
group writes of similar life times on media, then we can greatly reduce
the amount of data that needs to be copied around at garbage collection
time. This gives us a better write amplification factor, which leads
to better device life times and better (and more predictable)
performance at steady staet.

There's been a number of changes to this patchset since it was last
posted. In summary:

1) The bio parts have been bumped to carry 16 bits of stream data, up
   from 8 and 12 in the original series.

2) Since the interface grew some more options, I've moved away from
   fadvise and instead added a new system call. I don't feel strongly
   about what interface we use here, another option would be to have a
   (big) set of fcntl() commands instead.

3) The kernel now manages the ID space, since we have moved to a host
   assigned model. This is done on a backing_dev_info basis, and the
   btrfs patch has been updated to show how this can be used for nested
   devices on btrfs/md/dm/etc. This could be moved to the request queue
   as well, again I don't feel too strongly aboout this specific part.

Those are the big changes.

The patches are against Linus' current -git tip.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/11] idr: make ida_simple_remove() return an error
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 02/11] block: add support for carrying a stream ID in a bio Jens Axboe
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

The idr interface is pretty horrible, in that it doesn't return an
error if we attempt to free an invalid ID, instead it calls
WARN(). That's not great if we're potentially exposing this as a
user visible interface, indirectly.

So add __ida_remove() that returns an error instead of warning,
and change ida_simple_remove() to use that interface. Alternatively
we could make ida_remove() return an error, but then I'd have to
verify all call sites...

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/idr.h |  3 ++-
 lib/idr.c           | 30 +++++++++++++++++++++---------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 083d61e92706..5d9b35233200 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -175,13 +175,14 @@ struct ida {
 
 int ida_pre_get(struct ida *ida, gfp_t gfp_mask);
 int ida_get_new_above(struct ida *ida, int starting_id, int *p_id);
+int __ida_remove(struct ida *ida, int id);
 void ida_remove(struct ida *ida, int id);
 void ida_destroy(struct ida *ida);
 void ida_init(struct ida *ida);
 
 int ida_simple_get(struct ida *ida, unsigned int start, unsigned int end,
 		   gfp_t gfp_mask);
-void ida_simple_remove(struct ida *ida, unsigned int id);
+int ida_simple_remove(struct ida *ida, unsigned int id);
 
 /**
  * ida_get_new - allocate new ID
diff --git a/lib/idr.c b/lib/idr.c
index 6098336df267..106aca69a5e9 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1007,7 +1007,7 @@ EXPORT_SYMBOL(ida_get_new_above);
  * @ida:	ida handle
  * @id:		ID to free
  */
-void ida_remove(struct ida *ida, int id)
+int __ida_remove(struct ida *ida, int id)
 {
 	struct idr_layer *p = ida->idr.top;
 	int shift = (ida->idr.layers - 1) * IDR_BITS;
@@ -1017,7 +1017,7 @@ void ida_remove(struct ida *ida, int id)
 	struct ida_bitmap *bitmap;
 
 	if (idr_id > idr_max(ida->idr.layers))
-		goto err;
+		return -EINVAL;
 
 	/* clear full bits while looking up the leaf idr_layer */
 	while ((shift > 0) && p) {
@@ -1028,14 +1028,14 @@ void ida_remove(struct ida *ida, int id)
 	}
 
 	if (p == NULL)
-		goto err;
+		return -EINVAL;
 
 	n = idr_id & IDR_MASK;
 	__clear_bit(n, p->bitmap);
 
 	bitmap = (void *)p->ary[n];
 	if (!bitmap || !test_bit(offset, bitmap->bitmap))
-		goto err;
+		return -EINVAL;
 
 	/* update bitmap and remove it if empty */
 	__clear_bit(offset, bitmap->bitmap);
@@ -1045,10 +1045,19 @@ void ida_remove(struct ida *ida, int id)
 		free_bitmap(ida, bitmap);
 	}
 
-	return;
+	return 0;
+}
+EXPORT_SYMBOL(__ida_remove);
 
- err:
-	WARN(1, "ida_remove called for id=%d which is not allocated.\n", id);
+/**
+ * ida_remove - remove the given ID
+ * @ida:	ida handle
+ * @id:		ID to free
+ */
+void ida_remove(struct ida *ida, int id)
+{
+	if (__ida_remove(ida, id))
+		WARN(1, "ida_remove called for id=%d which is not allocated.\n", id);
 }
 EXPORT_SYMBOL(ida_remove);
 
@@ -1120,14 +1129,17 @@ EXPORT_SYMBOL(ida_simple_get);
  * @ida: the (initialized) ida.
  * @id: the id returned by ida_simple_get.
  */
-void ida_simple_remove(struct ida *ida, unsigned int id)
+int ida_simple_remove(struct ida *ida, unsigned int id)
 {
 	unsigned long flags;
+	int ret;
 
 	BUG_ON((int)id < 0);
 	spin_lock_irqsave(&simple_ida_lock, flags);
-	ida_remove(ida, id);
+	ret = __ida_remove(ida, id);
 	spin_unlock_irqrestore(&simple_ida_lock, flags);
+
+	return ret;
 }
 EXPORT_SYMBOL(ida_simple_remove);
 
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/11] block: add support for carrying a stream ID in a bio
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
  2016-03-04 16:10 ` [PATCH 01/11] idr: make ida_simple_remove() return an error Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 03/11] Add support for per-file/inode stream ID Jens Axboe
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

The top bits of bio->bi_flags are reserved for keeping the
allocation pool, set aside the next 16 bits for carrying
a stream ID. That leaves us with support for 64k - 1 streams,
0 is reserved as a "stream not set" value.

Add helpers for setting/getting stream ID of a bio.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/bio.c               |  5 ++++
 block/blk-core.c          |  4 ++++
 include/linux/blk_types.h | 60 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 53 insertions(+), 16 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index cf7591551b17..d42e69c23271 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -584,6 +584,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_rw = bio_src->bi_rw;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+	bio_set_streamid(bio, bio_get_streamid(bio_src));
 }
 EXPORT_SYMBOL(__bio_clone_fast);
 
@@ -689,6 +690,7 @@ integrity_clone:
 		}
 	}
 
+	bio_set_streamid(bio, bio_get_streamid(bio_src));
 	return bio;
 }
 EXPORT_SYMBOL(bio_clone_bioset);
@@ -2037,6 +2039,9 @@ static void __init biovec_init_slabs(void)
 
 static int __init init_bio(void)
 {
+	BUILD_BUG_ON(BIO_FLAG_LAST > ((8 * sizeof(unsigned int)) -
+			BIO_POOL_BITS - BIO_STREAM_BITS));
+
 	bio_slab_max = 2;
 	bio_slab_nr = 0;
 	bio_slabs = kzalloc(bio_slab_max * sizeof(struct bio_slab), GFP_KERNEL);
diff --git a/block/blk-core.c b/block/blk-core.c
index b83d29755b5a..e4b12693c567 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2071,6 +2071,10 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			if (bio_streamid_valid(bio))
+				blk_add_trace_msg(q, "StreamID=%u",
+							bio_get_streamid(bio));
+
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..fc63a82a9944 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -112,21 +112,23 @@ struct bio {
 /*
  * bio flags
  */
-#define BIO_SEG_VALID	1	/* bi_phys_segments valid */
-#define BIO_CLONED	2	/* doesn't own data */
-#define BIO_BOUNCED	3	/* bio is a bounce bio */
-#define BIO_USER_MAPPED 4	/* contains user pages */
-#define BIO_NULL_MAPPED 5	/* contains invalid user pages */
-#define BIO_QUIET	6	/* Make BIO Quiet */
-#define BIO_CHAIN	7	/* chained bio, ->bi_remaining in effect */
-#define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
-
-/*
- * Flags starting here get preserved by bio_reset() - this includes
- * BIO_POOL_IDX()
- */
-#define BIO_RESET_BITS	13
-#define BIO_OWNS_VEC	13	/* bio_free() should free bvec */
+enum {
+	BIO_SEG_VALID	= 0,	/* bi_phys_segments valid */
+	BIO_CLONED,		/* doesn't own data */
+	BIO_BOUNCED,		/* bio is a bounce bio */
+	BIO_USER_MAPPED,	/* contains user pages */
+	BIO_NULL_MAPPED,	/* contains invalid user pages */
+	BIO_QUIET,		/* Mnake BIO Quiet */
+	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
+	BIO_REFFED,		/* bio has elevated ->bi_cnt */
+	BIO_OWNS_VEC,		/* bio_free() should free bvec */
+	BIO_FLAG_LAST,		/* end of bits */
+	/*
+	 * Flags starting here get preserved by bio_reset() - this includes
+	 * BIO_POOL_IDX()
+	 */
+	BIO_RESET_BITS = BIO_FLAG_LAST,
+};
 
 /*
  * top 4 bits of bio flags indicate the pool this bio came from
@@ -134,9 +136,35 @@ struct bio {
 #define BIO_POOL_BITS		(4)
 #define BIO_POOL_NONE		((1UL << BIO_POOL_BITS) - 1)
 #define BIO_POOL_OFFSET		(32 - BIO_POOL_BITS)
-#define BIO_POOL_MASK		(1UL << BIO_POOL_OFFSET)
 #define BIO_POOL_IDX(bio)	((bio)->bi_flags >> BIO_POOL_OFFSET)
 
+/*
+ * after the pool bits, next 16 bits are for the stream id
+ */
+#define BIO_STREAM_BITS		(16)
+#define BIO_STREAM_OFFSET	(BIO_POOL_OFFSET - BIO_STREAM_BITS)
+#define BIO_STREAM_MASK		((1 << BIO_STREAM_BITS) - 1)
+
+static inline unsigned long streamid_to_flags(unsigned int id)
+{
+	return (unsigned long) (id & BIO_STREAM_MASK) << BIO_STREAM_OFFSET;
+}
+
+static inline void bio_set_streamid(struct bio *bio, unsigned int id)
+{
+	bio->bi_flags |= streamid_to_flags(id);
+}
+
+static inline unsigned int bio_get_streamid(struct bio *bio)
+{
+	return (bio->bi_flags >> BIO_STREAM_OFFSET) & BIO_STREAM_MASK;
+}
+
+static inline bool bio_streamid_valid(struct bio *bio)
+{
+	return bio_get_streamid(bio) != 0;
+}
+
 #endif /* CONFIG_BLOCK */
 
 /*
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/11] Add support for per-file/inode stream ID
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
  2016-03-04 16:10 ` [PATCH 01/11] idr: make ida_simple_remove() return an error Jens Axboe
  2016-03-04 16:10 ` [PATCH 02/11] block: add support for carrying a stream ID in a bio Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
       [not found]   ` <CAJVOszBXU-qQENcOGG8pWeARwoWL2G3gNJ0H2uNPjXkiVa8S+Q@mail.gmail.com>
  2016-03-04 16:10 ` [PATCH 04/11] Add system call for setting inode/file write " Jens Axboe
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

Writing on flash devices can be much more efficient, if we can
inform the device what kind of data can be grouped together. If
the device is able to group data together with similar lifetimes,
then it can be more efficient in garbage collection. This, in turn,
leads to lower write amplification, which is a win on both device
wear and performance.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/inode.c         |  1 +
 fs/open.c          |  1 +
 include/linux/fs.h | 22 ++++++++++++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 69b8b526c194..3e3652a04509 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -147,6 +147,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_blocks = 0;
 	inode->i_bytes = 0;
 	inode->i_generation = 0;
+	inode->i_streamid = 0;
 	inode->i_pipe = NULL;
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
diff --git a/fs/open.c b/fs/open.c
index 55bdc75e2172..38999d86fae0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -749,6 +749,7 @@ static int do_dentry_open(struct file *f,
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
+	f->f_streamid = 0;
 
 	return 0;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae681002100a..10599d231b53 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -660,6 +660,7 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+	unsigned int		i_streamid;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock_context	*i_flctx;
 	struct address_space	i_data;
@@ -681,6 +682,14 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 };
 
+static inline unsigned int inode_streamid(struct inode *inode)
+{
+	if (inode)
+		return inode->i_streamid;
+
+	return 0;
+}
+
 static inline int inode_unhashed(struct inode *inode)
 {
 	return hlist_unhashed(&inode->i_hash);
@@ -877,6 +886,7 @@ struct file {
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
+	unsigned int		f_streamid;
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
@@ -908,6 +918,18 @@ struct file_handle {
 	unsigned char f_handle[0];
 };
 
+/*
+ * If the file doesn't have a stream ID set, return the inode stream ID
+ * in case that has been set.
+ */
+static inline unsigned int file_streamid(struct file *f)
+{
+	if (f->f_streamid)
+		return f->f_streamid;
+
+	return inode_streamid(f->f_inode);
+}
+
 static inline struct file *get_file(struct file *f)
 {
 	atomic_long_inc(&f->f_count);
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/11] Add system call for setting inode/file write stream ID
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (2 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 03/11] Add support for per-file/inode stream ID Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 05/11] wire up system call for x86/x86-64 Jens Axboe
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

Single system call:

ssize_t streamid(int fd, int cmd, unsigned int flags, int streamid);

where 'fd' is a file descriptor to the file that we want to set the
write stream ID on. 'cmd' is one of the following:

STREAMID_OPEN	Open/allocate a new stream ID
STREAMID_CLOSE	Close/free a previously allocated stream ID
STREAMID_GET	Return the currently assigned stream ID

'flags' is a mask of one or more of the following:

STREAMID_F_INODE	Set stream ID on the inode
STREAMID_F_FILE		Set stream ID on the file

'streamid' is either 0, which means that streamid() will return the
first available stream ID, or it's set to some integer value between 1
and STREAMID_MAX (both inclusive) to ask for a specific stream ID value.

streamid() returns the allocated stream ID on succes, or -1 and sets
errno appropriately. Possible error values:

-EINVAL		cmd/flags isn't valid
-ESPIPE		'fd' refers to a pipe
-EBADF		'fd' isn't valid
-EBUSY		'streamid' is already allocated/assigned

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/read_write.c           | 63 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/streamid.h  | 13 ++++++++++
 include/uapi/linux/Kbuild |  1 +
 3 files changed, 77 insertions(+)
 create mode 100644 include/linux/streamid.h

diff --git a/fs/read_write.c b/fs/read_write.c
index dadf24e5c95b..6dcae3eb7b0f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -18,6 +18,7 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/fs.h>
+#include <linux/streamid.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1668,3 +1669,65 @@ out:
 	return ret;
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range);
+
+SYSCALL_DEFINE4(streamid, int, fd, int, cmd,
+		unsigned int, flags, int, streamid)
+{
+	struct inode *inode;
+	ssize_t ret = 0;
+	struct fd f;
+
+	if (cmd != STREAMID_OPEN && cmd != STREAMID_CLOSE &&
+	    cmd != STREAMID_GET)
+		return -EINVAL;
+	if (flags & ~(STREAMID_F_INODE | STREAMID_F_FILE))
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	inode = file_inode(f.file);
+	if (S_ISFIFO(inode->i_mode)) {
+		ret = -ESPIPE;
+		goto done;
+	}
+
+	if (cmd == STREAMID_OPEN) {
+		if (flags & STREAMID_F_FILE) {
+			if (f.file->f_streamid) {
+				ret = -EBUSY;
+				goto done;
+			}
+			f.file->f_streamid = ret;
+		}
+		if (flags & STREAMID_F_INODE) {
+			spin_lock(&inode->i_lock);
+			if (inode_streamid(inode))
+				ret = -EBUSY;
+			else
+				inode->i_streamid = ret;
+			spin_unlock(&inode->i_lock);
+		}
+	} else if (cmd == STREAMID_CLOSE) {
+		if (f.file->f_streamid == streamid)
+			f.file->f_streamid = 0;
+		if (inode_streamid(inode) == streamid) {
+			spin_lock(&inode->i_lock);
+			inode->i_streamid = 0;
+			spin_unlock(&inode->i_lock);
+		}
+	} else if (cmd == STREAMID_GET) {
+		ret = 0;
+		if (flags & STREAMID_F_FILE)
+			ret = f.file->f_streamid;
+		if (!ret && (flags & STREAMID_F_INODE))
+			ret = inode_streamid(inode);
+		if (!(flags & (STREAMID_F_FILE | STREAMID_F_INODE)))
+			ret = file_streamid(f.file);
+	}
+
+done:
+	fdput(f);
+	return ret;
+}
diff --git a/include/linux/streamid.h b/include/linux/streamid.h
new file mode 100644
index 000000000000..ea1f8e7eb8a5
--- /dev/null
+++ b/include/linux/streamid.h
@@ -0,0 +1,13 @@
+#ifndef STREAMID_H
+#define STREAMID_H
+
+enum {
+	STREAMID_OPEN	= 1,		/* open new stream */
+	STREAMID_CLOSE	= 2,		/* close stream */
+	STREAMID_GET	= 3,		/* get file/inode stream ID */
+
+	STREAMID_F_INODE	= 1,	/* set streamid on the inode */
+	STREAMID_F_FILE		= 2,	/* set streamid on the file */
+};
+
+#endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ebd10e624598..7d1540da0bb1 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -385,6 +385,7 @@ header-y += soundcard.h
 header-y += sound.h
 header-y += stat.h
 header-y += stddef.h
+header-y += streamid.h
 header-y += string.h
 header-y += suspend_ioctls.h
 header-y += swab.h
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/11] wire up system call for x86/x86-64
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (3 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 04/11] Add system call for setting inode/file write " Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 06/11] Add support for bdi tracking of stream ID Jens Axboe
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index cb713df81180..1b3c43aa3a6d 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -384,3 +384,4 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	copy_file_range		sys_copy_file_range
+378	i386	streamid		sys_streamid
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dc1040a50bdc..74edd268119c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -333,6 +333,7 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	copy_file_range		sys_copy_file_range
+327	64	streamid		sys_streamid
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/11] Add support for bdi tracking of stream ID
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (4 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 05/11] wire up system call for x86/x86-64 Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 07/11] direct-io: add support for write stream IDs Jens Axboe
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

We can now allocate and free host assigned stream IDs through the
backing device. Add support for a hook for virtual backing devices,
that map to a bunch of real backing devices (btrfs, raid, etc).

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/read_write.c                  | 35 ++++++++++++++++++--------
 include/linux/backing-dev-defs.h |  7 ++++++
 include/linux/backing-dev.h      |  3 +++
 include/linux/streamid.h         |  4 +++
 mm/backing-dev.c                 | 53 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6dcae3eb7b0f..7218d69b1763 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1674,7 +1674,8 @@ SYSCALL_DEFINE4(streamid, int, fd, int, cmd,
 		unsigned int, flags, int, streamid)
 {
 	struct inode *inode;
-	ssize_t ret = 0;
+	int alloc_id;
+	ssize_t ret;
 	struct fd f;
 
 	if (cmd != STREAMID_OPEN && cmd != STREAMID_CLOSE &&
@@ -1682,6 +1683,8 @@ SYSCALL_DEFINE4(streamid, int, fd, int, cmd,
 		return -EINVAL;
 	if (flags & ~(STREAMID_F_INODE | STREAMID_F_FILE))
 		return -EINVAL;
+	if (streamid > STREAMID_MAX)
+		return -EINVAL;
 
 	f = fdget(fd);
 	if (!f.file)
@@ -1693,11 +1696,26 @@ SYSCALL_DEFINE4(streamid, int, fd, int, cmd,
 		goto done;
 	}
 
+	if (cmd == STREAMID_GET) {
+		ret = 0;
+		if (flags & STREAMID_F_FILE)
+			ret = f.file->f_streamid;
+		if (!ret && (flags & STREAMID_F_INODE))
+			ret = inode_streamid(inode);
+		if (!(flags & (STREAMID_F_FILE | STREAMID_F_INODE)))
+			ret = file_streamid(f.file);
+		goto done;
+	}
+
+	alloc_id = ret = bdi_streamid(f.file->f_mapping->host, cmd, streamid);
+	if (ret < 0)
+		goto done;
+
 	if (cmd == STREAMID_OPEN) {
 		if (flags & STREAMID_F_FILE) {
 			if (f.file->f_streamid) {
 				ret = -EBUSY;
-				goto done;
+				goto error_close;
 			}
 			f.file->f_streamid = ret;
 		}
@@ -1708,6 +1726,8 @@ SYSCALL_DEFINE4(streamid, int, fd, int, cmd,
 			else
 				inode->i_streamid = ret;
 			spin_unlock(&inode->i_lock);
+			if (ret < 0)
+				goto error_close;
 		}
 	} else if (cmd == STREAMID_CLOSE) {
 		if (f.file->f_streamid == streamid)
@@ -1717,17 +1737,12 @@ SYSCALL_DEFINE4(streamid, int, fd, int, cmd,
 			inode->i_streamid = 0;
 			spin_unlock(&inode->i_lock);
 		}
-	} else if (cmd == STREAMID_GET) {
-		ret = 0;
-		if (flags & STREAMID_F_FILE)
-			ret = f.file->f_streamid;
-		if (!ret && (flags & STREAMID_F_INODE))
-			ret = inode_streamid(inode);
-		if (!(flags & (STREAMID_F_FILE | STREAMID_F_INODE)))
-			ret = file_streamid(f.file);
 	}
 
 done:
 	fdput(f);
 	return ret;
+error_close:
+	bdi_streamid(f.file->f_mapping->host, STREAMID_CLOSE, alloc_id);
+	goto done;
 }
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1b4d69f68c33..8c1914b15cd0 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -10,6 +10,7 @@
 #include <linux/flex_proportions.h>
 #include <linux/timer.h>
 #include <linux/workqueue.h>
+#include <linux/idr.h>
 
 struct page;
 struct device;
@@ -30,6 +31,7 @@ enum wb_congested_state {
 };
 
 typedef int (congested_fn)(void *, int);
+typedef int (streamid_fn)(void *, unsigned int);
 
 enum wb_stat_item {
 	WB_RECLAIMABLE,
@@ -170,6 +172,11 @@ struct backing_dev_info {
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
 #endif
+
+	struct ida stream_ids;
+	streamid_fn *streamid_open;
+	streamid_fn *streamid_close;
+	void *streamid_data;
 };
 
 enum {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c82794f20110..b20e33b71071 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -514,4 +514,7 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << WB_async_congested));
 }
 
+int bdi_streamid_open(struct backing_dev_info *bdi, unsigned int id);
+int bdi_streamid_close(struct backing_dev_info *bdi, unsigned int id);
+
 #endif	/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/streamid.h b/include/linux/streamid.h
index ea1f8e7eb8a5..c7ba4770a792 100644
--- a/include/linux/streamid.h
+++ b/include/linux/streamid.h
@@ -6,8 +6,12 @@ enum {
 	STREAMID_CLOSE	= 2,		/* close stream */
 	STREAMID_GET	= 3,		/* get file/inode stream ID */
 
+	STREAMID_MAX	= 65535,
+
 	STREAMID_F_INODE	= 1,	/* set streamid on the inode */
 	STREAMID_F_FILE		= 2,	/* set streamid on the file */
 };
 
+ssize_t bdi_streamid(struct inode *inode, int cmd, int streamid);
+
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c554d173a65f..2ed97181ad49 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -10,6 +10,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include <linux/streamid.h>
 #include <trace/events/writeback.h>
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
@@ -809,6 +810,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 
 	bdi_debug_register(bdi, dev_name(dev));
 	set_bit(WB_registered, &bdi->wb.state);
+	ida_init(&bdi->stream_ids);
 
 	spin_lock_bh(&bdi_lock);
 	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
@@ -843,6 +845,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 	bdi_remove_from_list(bdi);
 	wb_shutdown(&bdi->wb);
 	cgwb_bdi_destroy(bdi);
+	ida_destroy(&bdi->stream_ids);
 
 	if (bdi->dev) {
 		bdi_debug_unregister(bdi);
@@ -1014,6 +1017,56 @@ out:
 }
 EXPORT_SYMBOL(wait_iff_congested);
 
+/*
+ * Get a new stream ID for backing device 'bdi'. If 'id' is zero, then get
+ * any new ID. If 'id' is non-zero, attempt to get that specific ID. If
+ * the bdi has a streamid_fn assigned, use that.
+ */
+int bdi_streamid_open(struct backing_dev_info *bdi, unsigned int id)
+{
+	if (bdi->streamid_open)
+		return bdi->streamid_open(bdi->streamid_data, id);
+
+	if (id)
+		return ida_simple_get(&bdi->stream_ids, id, id + 1, GFP_NOIO);
+
+	return ida_simple_get(&bdi->stream_ids, 1, STREAMID_MAX, GFP_NOIO);
+}
+EXPORT_SYMBOL(bdi_streamid_open);
+
+/*
+ * Close stream ID 'id' on bdi 'bdi'.
+ */
+int bdi_streamid_close(struct backing_dev_info *bdi, unsigned int id)
+{
+	if (bdi->streamid_close)
+		return bdi->streamid_close(bdi->streamid_data, id);
+
+	/*
+	 * If we don't have a specific open, free the ID we allocated
+	 */
+	if (!bdi->streamid_open)
+		return ida_simple_remove(&bdi->stream_ids, id);
+
+	return 0;
+}
+EXPORT_SYMBOL(bdi_streamid_close);
+
+ssize_t bdi_streamid(struct inode *inode, int cmd, int streamid)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	if (bdi == &noop_backing_dev_info)
+		return -ENXIO;
+
+	if (cmd == STREAMID_OPEN)
+		return bdi_streamid_open(bdi, streamid);
+	else if (cmd == STREAMID_CLOSE)
+		return bdi_streamid_close(bdi, streamid);
+
+	return -EINVAL;
+}
+
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 			void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/11] direct-io: add support for write stream IDs
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (5 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 06/11] Add support for bdi tracking of stream ID Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 08/11] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

Get the streamid from the file, if any, and set it on the bio.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/direct-io.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index d6a9012d42ad..25d6fc191eca 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -76,6 +76,7 @@ struct dio_submit {
 	int reap_counter;		/* rate limit reaping */
 	sector_t final_block_in_request;/* doesn't change */
 	int boundary;			/* prev block is at a boundary */
+	int streamid;			/* Write stream ID */
 	get_block_t *get_block;		/* block mapping function */
 	dio_submit_t *submit_io;	/* IO submition function */
 
@@ -376,6 +377,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
+
+	bio_set_streamid(bio, sdio->streamid);
 }
 
 /*
@@ -1224,6 +1227,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	sdio.blkbits = blkbits;
 	sdio.blkfactor = i_blkbits - blkbits;
 	sdio.block_in_file = offset >> blkbits;
+	sdio.streamid = file_streamid(iocb->ki_filp);
 
 	sdio.get_block = get_block;
 	dio->end_io = end_io;
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/11] Add stream ID support for buffered mpage/__block_write_full_page()
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (6 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 07/11] direct-io: add support for write stream IDs Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 09/11] btrfs: add support for write stream IDs Jens Axboe
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

Pass on the inode stream ID to the bio allocation.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/buffer.c | 8 ++++++--
 fs/mpage.c  | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e1632abb4ca9..b7b5cbae7482 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1788,7 +1788,9 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh_wbc(write_op, bh, 0, wbc);
+			submit_bh_wbc(write_op, bh,
+				      streamid_to_flags(inode_streamid(inode)),
+				      wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -1842,7 +1844,9 @@ recover:
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh_wbc(write_op, bh, 0, wbc);
+			submit_bh_wbc(write_op, bh,
+				      streamid_to_flags(inode_streamid(inode)),
+				      wbc);
 			nr_underway++;
 		}
 		bh = next;
diff --git a/fs/mpage.c b/fs/mpage.c
index 1480d3a18037..8b093ab150df 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -281,6 +281,7 @@ alloc_new:
 				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
 		if (bio == NULL)
 			goto confused;
+		bio_set_streamid(bio, inode_streamid(inode));
 	}
 
 	length = first_hole << blkbits;
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/11] btrfs: add support for write stream IDs
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (7 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 08/11] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 20:44   ` Chris Mason
  2016-03-04 16:10 ` [PATCH 10/11] xfs: add support for buffered writeback stream ID Jens Axboe
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: calvinowens, hch, adilger, Jens Axboe

Both buffered and O_DIRECT supported, and support for iterating
the below backing devices for open/close of streams.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/btrfs/disk-io.c   | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.c |  1 +
 fs/btrfs/inode.c     |  1 +
 3 files changed, 71 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4545e2e2ad45..2fd3b4a6bdcd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1748,6 +1748,72 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 	return ret;
 }
 
+static int btrfs_streamid_close(struct btrfs_fs_info *info, int id)
+{
+	struct btrfs_device *device;
+
+	mutex_lock(&info->fs_devices->device_list_mutex);
+	list_for_each_entry_rcu(device, &info->fs_devices->devices, dev_list) {
+		struct backing_dev_info *bdi;
+
+		if (!device->bdev)
+			continue;
+
+		bdi = blk_get_backing_dev_info(device->bdev);
+		bdi_streamid_close(bdi, id);
+	}
+	mutex_unlock(&info->fs_devices->device_list_mutex);
+
+	return 0;
+}
+
+static int btrfs_streamid_open(struct btrfs_fs_info *info, int id)
+{
+	struct btrfs_device *device;
+	int ret = -EINVAL;
+
+	mutex_lock(&info->fs_devices->device_list_mutex);
+	list_for_each_entry_rcu(device, &info->fs_devices->devices, dev_list) {
+		struct backing_dev_info *bdi;
+
+		if (!device->bdev)
+			continue;
+
+		bdi = blk_get_backing_dev_info(device->bdev);
+		ret = bdi_streamid_open(bdi, id);
+		if (ret < 0)
+			break;
+	}
+	mutex_unlock(&info->fs_devices->device_list_mutex);
+
+	return ret;
+}
+
+static int btrfs_streamid_open_fn(void *data, unsigned int id)
+{
+	struct btrfs_fs_info *info = (struct btrfs_fs_info *) data;
+	int ret = 0;
+
+	/*
+	 * > 0 is success, return it. If we fail, fall through to
+	 * freeing the ID, if we did set it on a device.
+	 */
+	ret = btrfs_streamid_open(info, id);
+	if (ret > 0)
+		return ret;
+
+	btrfs_streamid_close(info, id);
+	return ret;
+}
+
+static int btrfs_streamid_close_fn(void *data, unsigned int id)
+{
+	struct btrfs_fs_info *info = (struct btrfs_fs_info *) data;
+
+	btrfs_streamid_close(info, id);
+	return 0;
+}
+
 static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 {
 	int err;
@@ -1760,6 +1826,9 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 	bdi->congested_fn	= btrfs_congested_fn;
 	bdi->congested_data	= info;
 	bdi->capabilities |= BDI_CAP_CGROUP_WRITEBACK;
+	bdi->streamid_open	= btrfs_streamid_open_fn;
+	bdi->streamid_close	= btrfs_streamid_close_fn;
+	bdi->streamid_data	= info;
 	return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 392592dc7010..0f1507af8266 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2808,6 +2808,7 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree,
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
+	bio_set_streamid(bio, inode_streamid(page->mapping->host));
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
 		wbc_account_io(wbc, page, page_size);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d96f5cf38a2d..77661a1f9e5e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8338,6 +8338,7 @@ static void btrfs_submit_direct(int rw, struct bio *dio_bio,
 	atomic_set(&dip->pending_bios, 0);
 	btrfs_bio = btrfs_io_bio(io_bio);
 	btrfs_bio->logical = file_offset;
+	bio_set_streamid(io_bio, bio_get_streamid(dio_bio));
 
 	if (write) {
 		io_bio->bi_end_io = btrfs_endio_direct_write;
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/11] xfs: add support for buffered writeback stream ID
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (8 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 09/11] btrfs: add support for write stream IDs Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 16:10 ` [PATCH 11/11] ext4: add support for write stream IDs Jens Axboe
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block
  Cc: calvinowens, hch, adilger, Jens Axboe, Dave Chinner

Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/xfs/xfs_aops.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a9ebabfe7587..9a35e61b51c6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -382,6 +382,7 @@ xfs_submit_ioend_bio(
 	atomic_inc(&ioend->io_remaining);
 	bio->bi_private = ioend;
 	bio->bi_end_io = xfs_end_bio;
+	bio_set_streamid(bio, ioend->io_inode->i_streamid);
 	submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
 }
 
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/11] ext4: add support for write stream IDs
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (9 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 10/11] xfs: add support for buffered writeback stream ID Jens Axboe
@ 2016-03-04 16:10 ` Jens Axboe
  2016-03-04 19:42 ` [PATCH 0/11] Update version of write stream ID patchset Jeff Moyer
  2016-03-06  6:13 ` Andreas Dilger
  12 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 16:10 UTC (permalink / raw)
  To: linux-fsdevel, linux-block
  Cc: calvinowens, hch, adilger, Jens Axboe, Theodore Ts'o

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/ext4/page-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 090b3498638e..1973a5db56ba 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -403,6 +403,7 @@ submit_and_retry:
 		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;
+		bio_set_streamid(io->io_bio, inode_streamid(inode));
 	}
 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
-- 
2.4.1.168.g1ea28e1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (10 preceding siblings ...)
  2016-03-04 16:10 ` [PATCH 11/11] ext4: add support for write stream IDs Jens Axboe
@ 2016-03-04 19:42 ` Jeff Moyer
  2016-03-04 20:34   ` Jens Axboe
  2016-03-06  6:13 ` Andreas Dilger
  12 siblings, 1 reply; 36+ messages in thread
From: Jeff Moyer @ 2016-03-04 19:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger,
	Martin K. Petersen

Jens Axboe <axboe@fb.com> writes:

> It's been a while since I last posted the write stream ID patchset, but
> here is an updated version.
>
> The original patchset was centered around the current NVMe streams
> proposal, but there was a number of issues with that. It's now in a
> much beter state, and hopefully will make it into 1.3 of the spec
> soon.

But the spec is still not public.  The only documentation I can find on
this stuff is from t10, dated May of last year.

> To quickly re-summarize the intent behind write stream IDs, it's to
> be able to provide a hint to the underlying storage device on what
> writes could feasibly be grouped together. If the device is able to
> group writes of similar life times on media, then we can greatly reduce
> the amount of data that needs to be copied around at garbage collection
> time. This gives us a better write amplification factor, which leads
> to better device life times and better (and more predictable)
> performance at steady staet.
>
> There's been a number of changes to this patchset since it was last
> posted. In summary:
>
> 1) The bio parts have been bumped to carry 16 bits of stream data, up
>    from 8 and 12 in the original series.
>
> 2) Since the interface grew some more options, I've moved away from
>    fadvise and instead added a new system call. I don't feel strongly
>    about what interface we use here, another option would be to have a
>    (big) set of fcntl() commands instead.
>
> 3) The kernel now manages the ID space, since we have moved to a host
>    assigned model. This is done on a backing_dev_info basis, and the
>    btrfs patch has been updated to show how this can be used for nested
>    devices on btrfs/md/dm/etc. This could be moved to the request queue
>    as well, again I don't feel too strongly aboout this specific part.
>
> Those are the big changes.

My main question is why expose this to userspace at all?  If we're
keeping track of write streams per file, then why not implement that in
the kernel, transparent to the application?  That would benefit all
applications instead of requiring application developers to opt in.

I'm sure your argument will have something to do with how stream id's
are allocated/freed (expensive/slow, limited resource, whatever), but
that really just gets back to Martin's original questions about what we
should expect from the hardware and what the programming model should
look like (questions that are, afaik, still open).

I'm not against write streams, I think it's a neat idea.  I just think
it will die on the vine if you require application developers to opt
in.  Not all storage is SSDs, and I don't like that SSDs now have to be
treated differently by the programmer.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 19:42 ` [PATCH 0/11] Update version of write stream ID patchset Jeff Moyer
@ 2016-03-04 20:34   ` Jens Axboe
  2016-03-04 21:01     ` Jeff Moyer
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 20:34 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger,
	Martin K. Petersen

On 03/04/2016 12:42 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@fb.com> writes:
>
>> It's been a while since I last posted the write stream ID patchset, but
>> here is an updated version.
>>
>> The original patchset was centered around the current NVMe streams
>> proposal, but there was a number of issues with that. It's now in a
>> much beter state, and hopefully will make it into 1.3 of the spec
>> soon.
>
> But the spec is still not public.  The only documentation I can find on
> this stuff is from t10, dated May of last year.

That is correct, but the important changes are basically in the cover 
letter that I wrote for the patchset :-)

>> To quickly re-summarize the intent behind write stream IDs, it's to
>> be able to provide a hint to the underlying storage device on what
>> writes could feasibly be grouped together. If the device is able to
>> group writes of similar life times on media, then we can greatly reduce
>> the amount of data that needs to be copied around at garbage collection
>> time. This gives us a better write amplification factor, which leads
>> to better device life times and better (and more predictable)
>> performance at steady staet.
>>
>> There's been a number of changes to this patchset since it was last
>> posted. In summary:
>>
>> 1) The bio parts have been bumped to carry 16 bits of stream data, up
>>     from 8 and 12 in the original series.
>>
>> 2) Since the interface grew some more options, I've moved away from
>>     fadvise and instead added a new system call. I don't feel strongly
>>     about what interface we use here, another option would be to have a
>>     (big) set of fcntl() commands instead.
>>
>> 3) The kernel now manages the ID space, since we have moved to a host
>>     assigned model. This is done on a backing_dev_info basis, and the
>>     btrfs patch has been updated to show how this can be used for nested
>>     devices on btrfs/md/dm/etc. This could be moved to the request queue
>>     as well, again I don't feel too strongly aboout this specific part.
>>
>> Those are the big changes.
>
> My main question is why expose this to userspace at all?  If we're
> keeping track of write streams per file, then why not implement that in
> the kernel, transparent to the application?  That would benefit all
> applications instead of requiring application developers to opt in.

Because lots of different files could be the same write ID. It's not 
like we're going to have millions of streams available, you have to 
group them more wisely. Unless the policy is one-stream-per-file always, 
then we can't put that sort of thing in the kernel. The kernel has no 
way of knowing.

> I'm sure your argument will have something to do with how stream id's
> are allocated/freed (expensive/slow, limited resource, whatever), but
> that really just gets back to Martin's original questions about what we
> should expect from the hardware and what the programming model should
> look like (questions that are, afaik, still open).

That's orthogonal, really. The open/close might be expensive, or it 
might not be, it has no real bearing on how you assign specific writes 
to specific stream IDs.

> I'm not against write streams, I think it's a neat idea.  I just think
> it will die on the vine if you require application developers to opt
> in.  Not all storage is SSDs, and I don't like that SSDs now have to be
> treated differently by the programmer.

But that's why it's kept really simple. There are people that want to 
make this more involved, and tie QoS criteria to streams. My argument 
there has been what you are saying, it will never be used or get 
adopted. For streams in general, the wins are big enough that 
applications will care. And it's not difficult to use at all...

It's not just SSDs, either. Could be used for tiered storage in general. 
That would mostly require going a bit further and assigning performance 
characteristics to specific stream IDs, but there's nothing preventing 
that from being done down the road. For now, this is just a basic 
interface with a kernel managed stream ID space attached.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/11] Add support for per-file/inode stream ID
       [not found]   ` <CAJVOszBXU-qQENcOGG8pWeARwoWL2G3gNJ0H2uNPjXkiVa8S+Q@mail.gmail.com>
@ 2016-03-04 20:35     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 20:35 UTC (permalink / raw)
  To: Shaun Tancheff; +Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger

On 03/04/2016 11:53 AM, Shaun Tancheff wrote:
> What do you think of falling back to i_ino for applications which don't
> make use of setting stream_id yet?
>
> Something like this (or using a crc16 instead of the xor hack)?
>
> static inline unsigned int inode_streamid(struct inode *inode)
> {
> if (inode) {
> if (inode->i_streamid)
> return inode->i_streamid;
> return ((inode->i_ino >> 16) ^ inode->i_ino) & 0xFFFF;
> }
> return 0;
> }

If the device supports a high enough number of concurrently open 
streams, that might work fine. But if the mask ends up being 0x7, then I 
think it'll just introduce randomness into the whole thing.

But it's something that could be experimented with, certainly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/11] btrfs: add support for write stream IDs
  2016-03-04 16:10 ` [PATCH 09/11] btrfs: add support for write stream IDs Jens Axboe
@ 2016-03-04 20:44   ` Chris Mason
  2016-03-04 20:45     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Chris Mason @ 2016-03-04 20:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger

On Fri, Mar 04, 2016 at 09:10:51AM -0700, Jens Axboe wrote:
> Both buffered and O_DIRECT supported, and support for iterating
> the below backing devices for open/close of streams.

Looks good to me Jens, we'll just want to double check the dio streamid
is getting passed all the way down through the btrfs maze of dio
mirrors.

Signed-off-by: Chris Mason <clm@fb.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/11] btrfs: add support for write stream IDs
  2016-03-04 20:44   ` Chris Mason
@ 2016-03-04 20:45     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 20:45 UTC (permalink / raw)
  To: Chris Mason, linux-fsdevel, linux-block, calvinowens, hch, adilger

On 03/04/2016 01:44 PM, Chris Mason wrote:
> On Fri, Mar 04, 2016 at 09:10:51AM -0700, Jens Axboe wrote:
>> Both buffered and O_DIRECT supported, and support for iterating
>> the below backing devices for open/close of streams.
>
> Looks good to me Jens, we'll just want to double check the dio streamid
> is getting passed all the way down through the btrfs maze of dio
> mirrors.

Thanks, yes once we have the blktrace bits cleaned up and done, we can 
run that full stack test and verify that it works as intended.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 20:34   ` Jens Axboe
@ 2016-03-04 21:01     ` Jeff Moyer
  2016-03-04 21:06       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff Moyer @ 2016-03-04 21:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger,
	Martin K. Petersen

Jens Axboe <axboe@fb.com> writes:

> On 03/04/2016 12:42 PM, Jeff Moyer wrote:
>> My main question is why expose this to userspace at all?  If we're
>> keeping track of write streams per file, then why not implement that in
>> the kernel, transparent to the application?  That would benefit all
>> applications instead of requiring application developers to opt in.
>
> Because lots of different files could be the same write ID.

Do you mean that the application will want to have multiple files that
share the same write ID, or that there will be collisions due to the
small ID space (I think the former)?  There's no way to avoid the
latter, of course.  For the former, I agree that encoding a per-file
policy in the kernel could be limiting.

> It's not like we're going to have millions of streams available, you
> have to group them more wisely. Unless the policy is
> one-stream-per-file always, then we can't put that sort of thing in
> the kernel. The kernel has no way of knowing.

I know that hard-coding a one stream per file (or directory) scheme
wouldn't be the most ideal situation, but I wonder if it covers 90% of
the use cases without requiring application involvement.  Some numbers
here would be very useful in supporting one scheme versus another.

>> I'm sure your argument will have something to do with how stream id's
>> are allocated/freed (expensive/slow, limited resource, whatever), but
>> that really just gets back to Martin's original questions about what we
>> should expect from the hardware and what the programming model should
>> look like (questions that are, afaik, still open).
>
> That's orthogonal, really. The open/close might be expensive, or it
> might not be, it has no real bearing on how you assign specific writes
> to specific stream IDs.

It may be important if you plan to open and close the streams often.
Again, it's not clear to me what the hardware supports or requires in
this area, so I'm not sure if it's a relevant question.  -ENOSPEC.  :)

>> I'm not against write streams, I think it's a neat idea.  I just think
>> it will die on the vine if you require application developers to opt
>> in.  Not all storage is SSDs, and I don't like that SSDs now have to be
>> treated differently by the programmer.
>
> But that's why it's kept really simple. There are people that want to
> make this more involved, and tie QoS criteria to streams. My argument
> there has been what you are saying, it will never be used or get
> adopted. For streams in general, the wins are big enough that
> applications will care. And it's not difficult to use at all...

I'm not convinced applications will care.  :)  How many developers will
even know that they should use this interface?

> It's not just SSDs, either. Could be used for tiered storage in
> general. That would mostly require going a bit further and assigning
> performance characteristics to specific stream IDs, but there's
> nothing preventing that from being done down the road. For now, this
> is just a basic interface with a kernel managed stream ID space
> attached.

OK.  I'm still of the opinion that we should try to make this
transparent.  I could be swayed by workload descriptions and numbers
comparing approaches, though.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 21:01     ` Jeff Moyer
@ 2016-03-04 21:06       ` Jens Axboe
  2016-03-04 22:03         ` Jeff Moyer
  2016-03-05 20:48         ` Martin K. Petersen
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 21:06 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger,
	Martin K. Petersen

On 03/04/2016 02:01 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@fb.com> writes:
>
>> On 03/04/2016 12:42 PM, Jeff Moyer wrote:
>>> My main question is why expose this to userspace at all?  If we're
>>> keeping track of write streams per file, then why not implement that in
>>> the kernel, transparent to the application?  That would benefit all
>>> applications instead of requiring application developers to opt in.
>>
>> Because lots of different files could be the same write ID.
>
> Do you mean that the application will want to have multiple files that
> share the same write ID, or that there will be collisions due to the
> small ID space (I think the former)?  There's no way to avoid the
> latter, of course.  For the former, I agree that encoding a per-file
> policy in the kernel could be limiting.

The former - and it won't just be limiting, it'll be mostly useless...

>> It's not like we're going to have millions of streams available, you
>> have to group them more wisely. Unless the policy is
>> one-stream-per-file always, then we can't put that sort of thing in
>> the kernel. The kernel has no way of knowing.
>
> I know that hard-coding a one stream per file (or directory) scheme
> wouldn't be the most ideal situation, but I wonder if it covers 90% of
> the use cases without requiring application involvement.  Some numbers
> here would be very useful in supporting one scheme versus another.

It'd be even harder to convince applications to change their file 
layout, and it'd be a more involved change than simply opting in to add 
a system call to set a stream ID when you open a file. It's really not a 
hard change at all.

>>> I'm sure your argument will have something to do with how stream id's
>>> are allocated/freed (expensive/slow, limited resource, whatever), but
>>> that really just gets back to Martin's original questions about what we
>>> should expect from the hardware and what the programming model should
>>> look like (questions that are, afaik, still open).
>>
>> That's orthogonal, really. The open/close might be expensive, or it
>> might not be, it has no real bearing on how you assign specific writes
>> to specific stream IDs.
>
> It may be important if you plan to open and close the streams often.

Of course, I'm saying that the fact that if it's expensive or not, it 
has no impact on that part of the discussion.

> Again, it's not clear to me what the hardware supports or requires in
> this area, so I'm not sure if it's a relevant question.  -ENOSPEC.  :)
>
>>> I'm not against write streams, I think it's a neat idea.  I just think
>>> it will die on the vine if you require application developers to opt
>>> in.  Not all storage is SSDs, and I don't like that SSDs now have to be
>>> treated differently by the programmer.
>>
>> But that's why it's kept really simple. There are people that want to
>> make this more involved, and tie QoS criteria to streams. My argument
>> there has been what you are saying, it will never be used or get
>> adopted. For streams in general, the wins are big enough that
>> applications will care. And it's not difficult to use at all...
>
> I'm not convinced applications will care.  :)  How many developers will
> even know that they should use this interface?

You have to market it a bit, obviously. But if you consume flash 
devices, then you WILL be interested in cutting your write amplification 
down a lot. Especially if you have tons of them.

So applications might not care, the people that use them certainly will. 
And most people care about performance, too.

What I'm saying is that I disagree wildly. There's enough of a carrot 
here for people to care.

>> It's not just SSDs, either. Could be used for tiered storage in
>> general. That would mostly require going a bit further and assigning
>> performance characteristics to specific stream IDs, but there's
>> nothing preventing that from being done down the road. For now, this
>> is just a basic interface with a kernel managed stream ID space
>> attached.
>
> OK.  I'm still of the opinion that we should try to make this
> transparent.  I could be swayed by workload descriptions and numbers
> comparing approaches, though.

You can't just waive that flag and not have a solution. Any solution in 
that space would imply having policy in the kernel. A "just use a stream 
per file" is never going to work.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 21:06       ` Jens Axboe
@ 2016-03-04 22:03         ` Jeff Moyer
  2016-03-04 22:13           ` Jens Axboe
  2016-03-05 20:48         ` Martin K. Petersen
  1 sibling, 1 reply; 36+ messages in thread
From: Jeff Moyer @ 2016-03-04 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger,
	Martin K. Petersen

Jens Axboe <axboe@fb.com> writes:

> On 03/04/2016 02:01 PM, Jeff Moyer wrote:
>> OK.  I'm still of the opinion that we should try to make this
>> transparent.  I could be swayed by workload descriptions and numbers
>> comparing approaches, though.
>
> You can't just waive that flag and not have a solution. Any solution
> in that space would imply having policy in the kernel. A "just use a
> stream per file" is never going to work.

Jens, I'm obviously missing a lot of the background information, here.
I want to stress that I'm not against your patches. I'm just trying to
understand if there's a sensible way to use the write stream support in
the kernel so that applcations don't /have/ to be converted.  It sounds
like that's hard, and without any specs or hardware, I'm not going to be
able to even try to come up with solutions to that problem.  I think it
would make for interesting research, though.  I recall a paper from one
of the USENIX conferences that dealt with automatically identifying
write streams on a network storage server, but alas, I can't find the
reference right now.

Anyway, thanks for taking the time to reply.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 22:03         ` Jeff Moyer
@ 2016-03-04 22:13           ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2016-03-04 22:13 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-fsdevel, linux-block, calvinowens, hch, adilger,
	Martin K. Petersen

On 03/04/2016 03:03 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@fb.com> writes:
>
>> On 03/04/2016 02:01 PM, Jeff Moyer wrote:
>>> OK.  I'm still of the opinion that we should try to make this
>>> transparent.  I could be swayed by workload descriptions and numbers
>>> comparing approaches, though.
>>
>> You can't just waive that flag and not have a solution. Any solution
>> in that space would imply having policy in the kernel. A "just use a
>> stream per file" is never going to work.
>
> Jens, I'm obviously missing a lot of the background information, here.
> I want to stress that I'm not against your patches. I'm just trying to
> understand if there's a sensible way to use the write stream support in
> the kernel so that applcations don't /have/ to be converted.  It sounds
> like that's hard, and without any specs or hardware, I'm not going to be
> able to even try to come up with solutions to that problem.

It's not hard to update an application to do this. As an example, one 
thing I tried was converting RocksDB to use streams. A naive approach 
was used, where we simply mapped each compaction level to a specific 
stream, and got about a 30% reduction in WA just through that. The guys 
from Samsung has done that with RocksDB as well, just a bit more 
involved, and got better results. The application change was really no 
more involved than calling fadvise() on the fd after opening it. That is 
it. I don't know why you think that is hard.

As to doing this automagically, you'll need knowledge that you do not 
have. The kernel or file system has no idea if data written to file X 
and file Y have similar life times. You could start tracking that, of 
course, but that would make you very unhappy. If I'm an application 
storing files, I have a much better idea of what is related time wise.

And you don't really need a spec to understand how this works, the spec 
will just tell you the mechanics of how we pass this information to the 
device, how we find out what the device can support, etc. The basic gist 
of it is that we can write data with similar life times to the right 
place on media. For a flash disk, that would be the same EB.

> I think it
> would make for interesting research, though.  I recall a paper from one
> of the USENIX conferences that dealt with automatically identifying
> write streams on a network storage server, but alas, I can't find the
> reference right now.

Samsung released a paper on RocksDB and streams, iirc.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 21:06       ` Jens Axboe
  2016-03-04 22:03         ` Jeff Moyer
@ 2016-03-05 20:48         ` Martin K. Petersen
  2016-03-08 21:56           ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2016-03-05 20:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Moyer, linux-fsdevel, linux-block, calvinowens, hch,
	adilger, Martin K. Petersen

>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:

Jens,

>> OK.  I'm still of the opinion that we should try to make this
>> transparent.  I could be swayed by workload descriptions and numbers
>> comparing approaches, though.

Jens> You can't just waive that flag and not have a solution. Any
Jens> solution in that space would imply having policy in the kernel. A
Jens> "just use a stream per file" is never going to work.

I totally understand the desire to have explicit, long-lived
"from-file-open to file-close" streams for things like database journals
and whatnot.

However, I think that you are dismissing the benefits of being able to
group I/Os to disjoint LBA ranges within a brief period of time as
belonging to a single file. It's something that we know works well on
other types of storage. And it's also a much better heuristic for data
placement on SSDs than just picking the next available bucket. It does
require some pipelining on the drive but they will need some front end
logic to handle the proposed stream ID separation in any case.

Also, in our experiments we essentially got the explicit stream ID for
free by virtue of the journal being written often enough that it was
rarely if ever evicted as an active stream by the device. With no
changes whatsoever to any application.

My gripe with the current stuff is the same as before: The protocol is
squarely aimed at papering over issues with current flash technology. It
kinda-sorta works for other types of devices but it is very limiting. I
appreciate that it is a great fit for the "handful of apps sharing a
COTS NVMe drive on a cloud server" use case. But I think it is horrible
for NVMe over Fabrics and pretty much everything else. That wouldn't be
a big deal if the traditional storage models were going away. But I
don't think they are...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
                   ` (11 preceding siblings ...)
  2016-03-04 19:42 ` [PATCH 0/11] Update version of write stream ID patchset Jeff Moyer
@ 2016-03-06  6:13 ` Andreas Dilger
  2016-03-06 13:03   ` Martin K. Petersen
  12 siblings, 1 reply; 36+ messages in thread
From: Andreas Dilger @ 2016-03-06  6:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, calvinowens, hch

[-- Attachment #1: Type: text/plain, Size: 3195 bytes --]

On Mar 4, 2016, at 9:10 AM, Jens Axboe <axboe@fb.com> wrote:
> 
> It's been a while since I last posted the write stream ID patchset, but
> here is an updated version.
> 
> The original patchset was centered around the current NVMe streams
> proposal, but there was a number of issues with that. It's now in a
> much beter state, and hopefully will make it into 1.3 of the spec
> soon.
> 
> To quickly re-summarize the intent behind write stream IDs, it's to
> be able to provide a hint to the underlying storage device on what
> writes could feasibly be grouped together. If the device is able to
> group writes of similar life times on media, then we can greatly reduce
> the amount of data that needs to be copied around at garbage collection
> time. This gives us a better write amplification factor, which leads
> to better device life times and better (and more predictable)
> performance at steady state.

What are your thoughts on reserving a small number of the stream ID values
for filesystem metadata (e.g. the first 31 since 0 == unused)?  One of the
requests by several people at FAST was to be able to identify filesystem
metadata at the block layer for a variety of reasons (e.g. blktrace, etc).
I believe Ted said that Google is doing something similar for IO analysis.

For example, something like the following:

enum {
       SID_UNSET           = 0,
       SID_SUPERBLOCK      = 1,
       SID_ALLOCATIONGROUP = 2,
       SID_BLOCK_BITMAP    = 3,
       SID_INODE_BITMAP    = 4,
       SID_INODE           = 5,
       SID_INTERNAL_TREE   = 6,
       SID_DIRECTORY       = 7,
       SID_JOURNAL         = 8,
       SID_EXTENT          = 9,
       SID_XATTR           = 10,
       SID_DATA_FILE_4KB   = 11,
       SID_DATA_FILE_16KB  = 12,
       SID_DATA_FILE_64KB  = 13,
       SID_DATA_FILE_256KB = 14,
       SID_DATA_FILE_1MB   = 15,
       SID_DATA_FILE_4MB   = 16,
       SID_DATA_FILE_16MB  = 17,
       SID_DATA_FILE_64MB  = 18,
       SID_DATA_FILE_256MB = 19,
       SID_DATA_FILE_1GB   = 20,
       SID_DATA_FILE_LARGE = 21,
       SID_DATA_DIRECT     = 22,
       SID_LAST
};

though it would need to be expanded somewhat to include generic metadata
types from other filesystems.

Cheers, Andreas

> There's been a number of changes to this patchset since it was last
> posted. In summary:
> 
> 1) The bio parts have been bumped to carry 16 bits of stream data, up
>   from 8 and 12 in the original series.
> 
> 2) Since the interface grew some more options, I've moved away from
>   fadvise and instead added a new system call. I don't feel strongly
>   about what interface we use here, another option would be to have a
>   (big) set of fcntl() commands instead.
> 
> 3) The kernel now manages the ID space, since we have moved to a host
>   assigned model. This is done on a backing_dev_info basis, and the
>   btrfs patch has been updated to show how this can be used for nested
>   devices on btrfs/md/dm/etc. This could be moved to the request queue
>   as well, again I don't feel too strongly aboout this specific part.
> 
> Those are the big changes.
> 
> The patches are against Linus' current -git tip.
> 
> --
> Jens Axboe
> 
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06  6:13 ` Andreas Dilger
@ 2016-03-06 13:03   ` Martin K. Petersen
  2016-03-06 16:08     ` Boaz Harrosh
  2016-03-06 22:42     ` Andreas Dilger
  0 siblings, 2 replies; 36+ messages in thread
From: Martin K. Petersen @ 2016-03-06 13:03 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jens Axboe, linux-fsdevel, linux-block, calvinowens, hch

>>>>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes:

Andreas,

Andreas> What are your thoughts on reserving a small number of the
Andreas> stream ID values for filesystem metadata (e.g. the first 31
Andreas> since 0 == unused)?

The stream ID address space might be 16-bit but the devices currently
being discussed can only handle a few concurrently open streams (4, 8,
16).

Discussions are ongoing about whether the devices should be able to
implicitly close a stream based on LRU or something similar when the hw
max is reached. But as it stands you have to explicitly close an
existing stream with a separate command when the max is reached.
Otherwise a write with a previously unused stream ID will fail.

I.e. there is a significant cost to switching between stream IDs and
thus I am afraid the current stuff isn't well suited to having a fine
grained tagging approach like the one you are proposing.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06 13:03   ` Martin K. Petersen
@ 2016-03-06 16:08     ` Boaz Harrosh
  2016-03-06 20:51       ` Shaun Tancheff
  2016-03-07 15:34       ` Martin K. Petersen
  2016-03-06 22:42     ` Andreas Dilger
  1 sibling, 2 replies; 36+ messages in thread
From: Boaz Harrosh @ 2016-03-06 16:08 UTC (permalink / raw)
  To: Martin K. Petersen, Andreas Dilger
  Cc: Jens Axboe, linux-fsdevel, linux-block, calvinowens, hch

On 03/06/2016 03:03 PM, Martin K. Petersen wrote:
>>>>>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes:
> 
> Andreas,
> 
> Andreas> What are your thoughts on reserving a small number of the
> Andreas> stream ID values for filesystem metadata (e.g. the first 31
> Andreas> since 0 == unused)?
> 
> The stream ID address space might be 16-bit but the devices currently
> being discussed can only handle a few concurrently open streams (4, 8,
> 16).
> 

So I can't for the life of me understand what is the use of all this.

If what Jens said at beginning for data grouping than what is it at all
got to do with open and close of anything? Really?

Does it make any sense to you? I mean with multy-queue with HW channels
for every CPU, parallel IO, and poling because even interrupts are too
slow, you want me to cram everything and serialize it on 4 open/close
streams?

Either I'm completely missing something. Or please please explain
what is going on. How does 4 stream make any sense in today's NvME
HW? How does open/close anything make any sense?

On the surface it looks like someone is smoking something really
bad.

> Discussions are ongoing about whether the devices should be able to
> implicitly close a stream based on LRU or something similar when the hw
> max is reached. But as it stands you have to explicitly close an
> existing stream with a separate command when the max is reached.
> Otherwise a write with a previously unused stream ID will fail.
> 

?? (see smoking above ;-) )

> I.e. there is a significant cost to switching between stream IDs and
> thus I am afraid the current stuff isn't well suited to having a fine
> grained tagging approach like the one you are proposing.
> 

So again who cares for anything that hurts performance? Why would
I want to use anything that has "significant cost" at all?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06 16:08     ` Boaz Harrosh
@ 2016-03-06 20:51       ` Shaun Tancheff
  2016-03-07 15:41         ` Martin K. Petersen
  2016-03-07 15:34       ` Martin K. Petersen
  1 sibling, 1 reply; 36+ messages in thread
From: Shaun Tancheff @ 2016-03-06 20:51 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Martin K. Petersen, Andreas Dilger, Jens Axboe, linux-fsdevel,
	linux-block, calvinowens, hch

On Sun, Mar 6, 2016 at 10:08 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/06/2016 03:03 PM, Martin K. Petersen wrote:
>>>>>>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes:
>>
>> Andreas,
>>
>> Andreas> What are your thoughts on reserving a small number of the
>> Andreas> stream ID values for filesystem metadata (e.g. the first 31
>> Andreas> since 0 == unused)?
>>
>> The stream ID address space might be 16-bit but the devices currently
>> being discussed can only handle a few concurrently open streams (4, 8,
>> 16).
>>
>
> So I can't for the life of me understand what is the use of all this.
>
> If what Jens said at beginning for data grouping than what is it at all
> got to do with open and close of anything? Really?
>
> Does it make any sense to you? I mean with multy-queue with HW channels
> for every CPU, parallel IO, and poling because even interrupts are too
> slow, you want me to cram everything and serialize it on 4 open/close
> streams?
>
> Either I'm completely missing something. Or please please explain
> what is going on. How does 4 stream make any sense in today's NvME
> HW? How does open/close anything make any sense?
>
> On the surface it looks like someone is smoking something really
> bad.
>
>> Discussions are ongoing about whether the devices should be able to
>> implicitly close a stream based on LRU or something similar when the hw
>> max is reached. But as it stands you have to explicitly close an
>> existing stream with a separate command when the max is reached.
>> Otherwise a write with a previously unused stream ID will fail.
>>
>
> ?? (see smoking above ;-) )

As I see we can slice it up 3 ways. Either the limited stream resource
management is pushed up to the user space, kernel space, or device
space. I think there are reasonable gains and choices at anyone of
these logical partitions.

In the user space the application has (or should have) a good idea
of what data is related. However each application is essentially
independent from other parts of user space trying to make use of
the same limited resource.

This takes us to the next logical step of stream id management in
kernel space which can better manage the limited resource and
group application stream requests together. The kernel space *could*
handle all the LRU semantics of opening and closing streams or
simply just rotating around and reusing the oldest stream id.
This presupposes that the device itself is attached to a single
machine and the kernel is in full control of the device.

Finally we could push back and demand that the device handle the
LRU semantics and never fail a write or open to a stream.

I think we should be doing all of the above.

In kernel space we know the file system volatile meta data and it's
probably a good thing to keep a stream for that. Optionally it may
be useful to have a second stream for the journal.

The rest of the hinting is best to be seeded from the application with
some sort of fallback if the application fails to set an affinity.
How the applications' hinting is muxed together once the devices
stream id's are exhausted probably doesn't matter a great deal so
long as it is consistent. Be it 4 or 256 underlying streams available
it is still in effect a spectrum of volatility to stability at different
levels of granularity. Larger sized devices will presumably have
more streams.

The only thing we really don't want to see (IMO) is the device
co-mingling streams that really shouldn't be together because
the firmware does a bad job with it's LRU and mixes volatile data
with stable data.


>> I.e. there is a significant cost to switching between stream IDs and
>> thus I am afraid the current stuff isn't well suited to having a fine
>> grained tagging approach like the one you are proposing.
>>
>
> So again who cares for anything that hurts performance? Why would
> I want to use anything that has "significant cost" at all?
>
> Thanks
> Boaz
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=CwICaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=Wg5NqlNlVTT7Ugl8V50qIHLe856QW0qfG3WVYGOrWzA&m=JVe7H9CKksyV104sj3-n5S30M69tK1rmoOYQnc5L2_c&s=V8k2l5c5QZsZzWcWZD8DGQi_wQ_DXbubz-sGvIEfNWA&e=

-- 
Shaun Tancheff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06 13:03   ` Martin K. Petersen
  2016-03-06 16:08     ` Boaz Harrosh
@ 2016-03-06 22:42     ` Andreas Dilger
  2016-03-07 15:52       ` Martin K. Petersen
  1 sibling, 1 reply; 36+ messages in thread
From: Andreas Dilger @ 2016-03-06 22:42 UTC (permalink / raw)
  To: Martin Petersen
  Cc: Jens Axboe, linux-fsdevel, linux-block, calvinowens,
	Theodore Ts'o, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2627 bytes --]

On Mar 6, 2016, at 6:03 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
> 
>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes:
> 
> Andreas,
> 
> Andreas> What are your thoughts on reserving a small number of the
> Andreas> stream ID values for filesystem metadata (e.g. the first 31
> Andreas> since 0 == unused)?
> 
> The stream ID address space might be 16-bit but the devices currently
> being discussed can only handle a few concurrently open streams (4, 8,
> 16).
> 
> Discussions are ongoing about whether the devices should be able to
> implicitly close a stream based on LRU or something similar when the hw
> max is reached. But as it stands you have to explicitly close an
> existing stream with a separate command when the max is reached.
> Otherwise a write with a previously unused stream ID will fail.
> 
> I.e. there is a significant cost to switching between stream IDs and
> thus I am afraid the current stuff isn't well suited to having a fine
> grained tagging approach like the one you are proposing.

It makes sense to isolate userspace from the number of streams that are
available on a device, otherwise they will have to grub into the details
of different kinds of hardware (SCSI mode pages, SATA, complexity with
RAID, etc) that is best kept within the kernel.

I think everyone agrees that there isn't going to be a 1:1 mapping from
the Stream ID given by userspace to the actual device, but there are many
different uses for data/metadata labels at the block layer beyond just
the SSD write aggregation.  If the device can't handle these streams, we
can always merge them at the point they are sent to the device, but you
can't invent the data you want if you don't have it in the first place.

It doesn't cost anything to reserve the first 32 values for filesystem
metadata, and they can be aggregated more or less depending on the hardware
capabilities.  Even if current devices only support 4 or 8 streams, I'm
sure that this will improve in the future, so it doesn't make sense to
limit ourselves based on the very first devices on the market.

Also, this opens up interesting possibilities for blktrace, DM layers like
dm-thinp, bcache, etc. that are currently lacking any kind of data on how
they should allocate blocks.  Ted described the contortions he does to map
from block offsets in blktrace to filesystem metadata using debugfs output
and scripts, and not everyone is as knowledgeable about filesystem internals
as he is, but still wants to be able to diagnose filesystem IO latency issues.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06 16:08     ` Boaz Harrosh
  2016-03-06 20:51       ` Shaun Tancheff
@ 2016-03-07 15:34       ` Martin K. Petersen
  1 sibling, 0 replies; 36+ messages in thread
From: Martin K. Petersen @ 2016-03-07 15:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Martin K. Petersen, Andreas Dilger, Jens Axboe, linux-fsdevel,
	linux-block, calvinowens, hch

>>>>> "Boaz" == Boaz Harrosh <boaz@plexistor.com> writes:

Boaz> Does it make any sense to you? I mean with multy-queue with HW
Boaz> channels for every CPU, parallel IO, and poling because even
Boaz> interrupts are too slow, you want me to cram everything and
Boaz> serialize it on 4 open/close streams?

Boaz> Either I'm completely missing something. Or please please explain
Boaz> what is going on. How does 4 stream make any sense in today's NvME
Boaz> HW? How does open/close anything make any sense?

For number of streams you should think hardware flash channels.

Boaz> On the surface it looks like someone is smoking something really
Boaz> bad.

No, the hardware folks have good reasons why they want things to work a
certain way.

I am mostly objecting to the fact that the model of the current streams
proposal is heavily biased towards a fairly narrow use case and closely
tied to how current generations of flash controllers are implemented.

I would like to see a bit more abstraction put into the model so it is
less "controller with N flash-channels and anemic ARM core"
centric. That obviously means more complexity in the controller design,
hence the pushback from the vendors. However, I think it'll be worth it
in the long run.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06 20:51       ` Shaun Tancheff
@ 2016-03-07 15:41         ` Martin K. Petersen
  0 siblings, 0 replies; 36+ messages in thread
From: Martin K. Petersen @ 2016-03-07 15:41 UTC (permalink / raw)
  To: Shaun Tancheff
  Cc: Boaz Harrosh, Martin K. Petersen, Andreas Dilger, Jens Axboe,
	linux-fsdevel, linux-block, calvinowens, hch

>>>>> "Shaun" == Shaun Tancheff <shaun.tancheff@seagate.com> writes:

Shaun> This presupposes that the device itself is attached to a single
Shaun> machine and the kernel is in full control of the device.

Currently the NVMe TWG is leaning towards having each "I-T nexus" have
dedicated stream resources.

So if your controller supports 16 streams, in a 4-way multipath setup
each host would have a cap of 4 concurrent streams.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-06 22:42     ` Andreas Dilger
@ 2016-03-07 15:52       ` Martin K. Petersen
  0 siblings, 0 replies; 36+ messages in thread
From: Martin K. Petersen @ 2016-03-07 15:52 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Martin Petersen, Jens Axboe, linux-fsdevel, linux-block,
	calvinowens, Theodore Ts'o, Christoph Hellwig

>>>>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes:

Andreas> It doesn't cost anything to reserve the first 32 values for
Andreas> filesystem metadata, and they can be aggregated more or less
Andreas> depending on the hardware capabilities.  Even if current
Andreas> devices only support 4 or 8 streams, I'm sure that this will
Andreas> improve in the future, so it doesn't make sense to limit
Andreas> ourselves based on the very first devices on the market.

I am not against reserving parts of the namespace. As long as the
anticipation is that those 32 values in practice will be consolidated
into probably 2 (journal and metadata) in a static fashion.

Because, as I said, there are performance and serialization issues
wrt. closing and opening streams. So I don't think doing it in the
hotpath is going to fly.

Andreas> Also, this opens up interesting possibilities for blktrace, DM
Andreas> layers like dm-thinp, bcache, etc. that are currently lacking
Andreas> any kind of data on how they should allocate blocks.  Ted
Andreas> described the contortions he does to map from block offsets in
Andreas> blktrace to filesystem metadata using debugfs output and
Andreas> scripts, and not everyone is as knowledgeable about filesystem
Andreas> internals as he is, but still wants to be able to diagnose
Andreas> filesystem IO latency issues.

I am a big proponent of I/O tagging. We use it extensively in various
products and it works beautifully. That's why I would like to see the
storage standards being able to support it.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-05 20:48         ` Martin K. Petersen
@ 2016-03-08 21:56           ` Jens Axboe
  2016-03-17 23:43             ` Dan Williams
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2016-03-08 21:56 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jeff Moyer, linux-fsdevel, linux-block, calvinowens, hch, adilger

On 03/05/2016 01:48 PM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>
> Jens,
>
>>> OK.  I'm still of the opinion that we should try to make this
>>> transparent.  I could be swayed by workload descriptions and numbers
>>> comparing approaches, though.
>
> Jens> You can't just waive that flag and not have a solution. Any
> Jens> solution in that space would imply having policy in the kernel. A
> Jens> "just use a stream per file" is never going to work.
>
> I totally understand the desire to have explicit, long-lived
> "from-file-open to file-close" streams for things like database journals
> and whatnot.

That is an appealing use case.

> However, I think that you are dismissing the benefits of being able to
> group I/Os to disjoint LBA ranges within a brief period of time as
> belonging to a single file. It's something that we know works well on
> other types of storage. And it's also a much better heuristic for data
> placement on SSDs than just picking the next available bucket. It does
> require some pipelining on the drive but they will need some front end
> logic to handle the proposed stream ID separation in any case.

I'm not a huge fan of heuristics based exclusively around the temporal 
and spacial locality. Using that as a hint for a case where no stream ID 
(or write tag) is given would be an improvement, though. And perhaps 
parts of the space should be reserved to just that.

But I don't think that should exclude doing this in a much more managed 
fashion, personally I find that a lot saner than adding this sort of 
state tracking in the kernel.

> Also, in our experiments we essentially got the explicit stream ID for
> free by virtue of the journal being written often enough that it was
> rarely if ever evicted as an active stream by the device. With no
> changes whatsoever to any application.

Journal would be an easy one to guess, for sure.

> My gripe with the current stuff is the same as before: The protocol is
> squarely aimed at papering over issues with current flash technology. It
> kinda-sorta works for other types of devices but it is very limiting. I
> appreciate that it is a great fit for the "handful of apps sharing a
> COTS NVMe drive on a cloud server" use case. But I think it is horrible
> for NVMe over Fabrics and pretty much everything else. That wouldn't be
> a big deal if the traditional storage models were going away. But I
> don't think they are...

I don't think erase blocks are going to go away in the near future. 
We're going to have better media as well, that's a given, but cheaper 
TLC flash is just going to make the current problem much worse. The 
patchset is really about tagging the writes with a stream ID, nothing 
else. That could potentially be any type of hinting, it's not exclusive 
to being used with NVMe write directives at all.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-08 21:56           ` Jens Axboe
@ 2016-03-17 23:43             ` Dan Williams
  2016-03-18  0:18               ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Williams @ 2016-03-17 23:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Jeff Moyer, linux-fsdevel, linux-block,
	calvinowens, Christoph Hellwig, Andreas Dilger

On Tue, Mar 8, 2016 at 1:56 PM, Jens Axboe <axboe@fb.com> wrote:
> On 03/05/2016 01:48 PM, Martin K. Petersen wrote:
>>>>>>>
>>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>>
>>
>> Jens,
>>
>>>> OK.  I'm still of the opinion that we should try to make this
>>>> transparent.  I could be swayed by workload descriptions and numbers
>>>> comparing approaches, though.
>>
>>
>> Jens> You can't just waive that flag and not have a solution. Any
>> Jens> solution in that space would imply having policy in the kernel. A
>> Jens> "just use a stream per file" is never going to work.
>>
>> I totally understand the desire to have explicit, long-lived
>> "from-file-open to file-close" streams for things like database journals
>> and whatnot.
>
>
> That is an appealing use case.
>
>> However, I think that you are dismissing the benefits of being able to
>> group I/Os to disjoint LBA ranges within a brief period of time as
>> belonging to a single file. It's something that we know works well on
>> other types of storage. And it's also a much better heuristic for data
>> placement on SSDs than just picking the next available bucket. It does
>> require some pipelining on the drive but they will need some front end
>> logic to handle the proposed stream ID separation in any case.
>
>
> I'm not a huge fan of heuristics based exclusively around the temporal and
> spacial locality. Using that as a hint for a case where no stream ID (or
> write tag) is given would be an improvement, though. And perhaps parts of
> the space should be reserved to just that.
>
> But I don't think that should exclude doing this in a much more managed
> fashion, personally I find that a lot saner than adding this sort of state
> tracking in the kernel.
>
>> Also, in our experiments we essentially got the explicit stream ID for
>> free by virtue of the journal being written often enough that it was
>> rarely if ever evicted as an active stream by the device. With no
>> changes whatsoever to any application.
>
>
> Journal would be an easy one to guess, for sure.
>
>> My gripe with the current stuff is the same as before: The protocol is
>> squarely aimed at papering over issues with current flash technology. It
>> kinda-sorta works for other types of devices but it is very limiting. I
>> appreciate that it is a great fit for the "handful of apps sharing a
>> COTS NVMe drive on a cloud server" use case. But I think it is horrible
>> for NVMe over Fabrics and pretty much everything else. That wouldn't be
>> a big deal if the traditional storage models were going away. But I
>> don't think they are...
>
>
> I don't think erase blocks are going to go away in the near future. We're
> going to have better media as well, that's a given, but cheaper TLC flash is
> just going to make the current problem much worse. The patchset is really
> about tagging the writes with a stream ID, nothing else. That could
> potentially be any type of hinting, it's not exclusive to being used with
> NVMe write directives at all.
>

Maybe I'm misunderstanding, but why does stream-id imply anything more
than just "opaque tag set at the top of the stack that makes it down
to a driver".  Sure NVMe can interpret these as NVMe streams, but any
other driver can have its own transport specific translation of what
the hint means.  I think the minute the opaque number requires
specific driver behavior we'll fall into a rat hole of how to
translate intent across usages.

In other words, I think it will always be the case that the hint has
application + transport/driver meaning, but otherwise the kernel is
just a conduit.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-17 23:43             ` Dan Williams
@ 2016-03-18  0:18               ` Jens Axboe
  2016-03-18  2:39                 ` Martin K. Petersen
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2016-03-18  0:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Martin K. Petersen, Jeff Moyer, linux-fsdevel, linux-block,
	calvinowens, Christoph Hellwig, Andreas Dilger

On 03/17/2016 04:43 PM, Dan Williams wrote:
> On Tue, Mar 8, 2016 at 1:56 PM, Jens Axboe <axboe@fb.com> wrote:
>> On 03/05/2016 01:48 PM, Martin K. Petersen wrote:
>>>>>>>>
>>>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>>>
>>>
>>> Jens,
>>>
>>>>> OK.  I'm still of the opinion that we should try to make this
>>>>> transparent.  I could be swayed by workload descriptions and numbers
>>>>> comparing approaches, though.
>>>
>>>
>>> Jens> You can't just waive that flag and not have a solution. Any
>>> Jens> solution in that space would imply having policy in the kernel. A
>>> Jens> "just use a stream per file" is never going to work.
>>>
>>> I totally understand the desire to have explicit, long-lived
>>> "from-file-open to file-close" streams for things like database journals
>>> and whatnot.
>>
>>
>> That is an appealing use case.
>>
>>> However, I think that you are dismissing the benefits of being able to
>>> group I/Os to disjoint LBA ranges within a brief period of time as
>>> belonging to a single file. It's something that we know works well on
>>> other types of storage. And it's also a much better heuristic for data
>>> placement on SSDs than just picking the next available bucket. It does
>>> require some pipelining on the drive but they will need some front end
>>> logic to handle the proposed stream ID separation in any case.
>>
>>
>> I'm not a huge fan of heuristics based exclusively around the temporal and
>> spacial locality. Using that as a hint for a case where no stream ID (or
>> write tag) is given would be an improvement, though. And perhaps parts of
>> the space should be reserved to just that.
>>
>> But I don't think that should exclude doing this in a much more managed
>> fashion, personally I find that a lot saner than adding this sort of state
>> tracking in the kernel.
>>
>>> Also, in our experiments we essentially got the explicit stream ID for
>>> free by virtue of the journal being written often enough that it was
>>> rarely if ever evicted as an active stream by the device. With no
>>> changes whatsoever to any application.
>>
>>
>> Journal would be an easy one to guess, for sure.
>>
>>> My gripe with the current stuff is the same as before: The protocol is
>>> squarely aimed at papering over issues with current flash technology. It
>>> kinda-sorta works for other types of devices but it is very limiting. I
>>> appreciate that it is a great fit for the "handful of apps sharing a
>>> COTS NVMe drive on a cloud server" use case. But I think it is horrible
>>> for NVMe over Fabrics and pretty much everything else. That wouldn't be
>>> a big deal if the traditional storage models were going away. But I
>>> don't think they are...
>>
>>
>> I don't think erase blocks are going to go away in the near future. We're
>> going to have better media as well, that's a given, but cheaper TLC flash is
>> just going to make the current problem much worse. The patchset is really
>> about tagging the writes with a stream ID, nothing else. That could
>> potentially be any type of hinting, it's not exclusive to being used with
>> NVMe write directives at all.
>>
>
> Maybe I'm misunderstanding, but why does stream-id imply anything more
> than just "opaque tag set at the top of the stack that makes it down
> to a driver".  Sure NVMe can interpret these as NVMe streams, but any
> other driver can have its own transport specific translation of what
> the hint means.  I think the minute the opaque number requires
> specific driver behavior we'll fall into a rat hole of how to
> translate intent across usages.
>
> In other words, I think it will always be the case that the hint has
> application + transport/driver meaning, but otherwise the kernel is
> just a conduit.

You are not missing anything, that's exactly how it is intended, and 
that's how the interface is designed as well. If you want to tie extra 
meaning to this for specific drivers or transports, that's fine, and 
there's nothing that prevents that from happening. As it stands, and as 
it is proposed, it's just a write tag/stream ID that we can set on a 
file or inode, and have passed to the driver. Nothing more, nothing less.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-18  0:18               ` Jens Axboe
@ 2016-03-18  2:39                 ` Martin K. Petersen
  2016-03-18 17:37                   ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2016-03-18  2:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dan Williams, Martin K. Petersen, Jeff Moyer, linux-fsdevel,
	linux-block, calvinowens, Christoph Hellwig, Andreas Dilger

>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:

Jens> You are not missing anything, that's exactly how it is intended,
Jens> and that's how the interface is designed as well. If you want to
Jens> tie extra meaning to this for specific drivers or transports,
Jens> that's fine, and there's nothing that prevents that from
Jens> happening. As it stands, and as it is proposed, it's just a write
Jens> tag/stream ID that we can set on a file or inode, and have passed
Jens> to the driver. Nothing more, nothing less.

Yep. I'm fine with having the option of letting applications request an
ID from the kernel and being able to tag I/Os. But it is also imperative
that we are able to tag all remaining I/Os automatically. Hashing inode
and device/partition is fine, it does not have to be particularly
clever.

There are several types of non-flash storage in the pipeline that depend
on being able to identify blocks to disjoint LBA ranges as belonging to
the same "file". So as long as we don't go down the 10-channel flash
controller rat hole I'm perfectly happy...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-18  2:39                 ` Martin K. Petersen
@ 2016-03-18 17:37                   ` Jens Axboe
  2016-03-18 17:56                     ` Dan Williams
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2016-03-18 17:37 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Dan Williams, Jeff Moyer, linux-fsdevel, linux-block,
	calvinowens, Christoph Hellwig, Andreas Dilger

On 03/17/2016 07:39 PM, Martin K. Petersen wrote:
>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>
> Jens> You are not missing anything, that's exactly how it is intended,
> Jens> and that's how the interface is designed as well. If you want to
> Jens> tie extra meaning to this for specific drivers or transports,
> Jens> that's fine, and there's nothing that prevents that from
> Jens> happening. As it stands, and as it is proposed, it's just a write
> Jens> tag/stream ID that we can set on a file or inode, and have passed
> Jens> to the driver. Nothing more, nothing less.
>
> Yep. I'm fine with having the option of letting applications request an
> ID from the kernel and being able to tag I/Os. But it is also imperative
> that we are able to tag all remaining I/Os automatically. Hashing inode
> and device/partition is fine, it does not have to be particularly
> clever.
>
> There are several types of non-flash storage in the pipeline that depend
> on being able to identify blocks to disjoint LBA ranges as belonging to
> the same "file". So as long as we don't go down the 10-channel flash
> controller rat hole I'm perfectly happy...

So where do we go from here? Seems to me like the core parts we all 
agree on, what do we do about the user interface? I can split the 
patchset into 2 so we can get the core bits merged, and start further 
discussion on the interface part, if we need to.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/11] Update version of write stream ID patchset
  2016-03-18 17:37                   ` Jens Axboe
@ 2016-03-18 17:56                     ` Dan Williams
  0 siblings, 0 replies; 36+ messages in thread
From: Dan Williams @ 2016-03-18 17:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Martin K. Petersen, Jeff Moyer, linux-fsdevel, linux-block,
	calvinowens, Christoph Hellwig, Andreas Dilger

On Fri, Mar 18, 2016 at 10:37 AM, Jens Axboe <axboe@fb.com> wrote:
> On 03/17/2016 07:39 PM, Martin K. Petersen wrote:
>>>>>>>
>>>>>>> "Jens" == Jens Axboe <axboe@fb.com> writes:
>>
>>
>> Jens> You are not missing anything, that's exactly how it is intended,
>> Jens> and that's how the interface is designed as well. If you want to
>> Jens> tie extra meaning to this for specific drivers or transports,
>> Jens> that's fine, and there's nothing that prevents that from
>> Jens> happening. As it stands, and as it is proposed, it's just a write
>> Jens> tag/stream ID that we can set on a file or inode, and have passed
>> Jens> to the driver. Nothing more, nothing less.
>>
>> Yep. I'm fine with having the option of letting applications request an
>> ID from the kernel and being able to tag I/Os. But it is also imperative
>> that we are able to tag all remaining I/Os automatically. Hashing inode
>> and device/partition is fine, it does not have to be particularly
>> clever.
>>
>> There are several types of non-flash storage in the pipeline that depend
>> on being able to identify blocks to disjoint LBA ranges as belonging to
>> the same "file". So as long as we don't go down the 10-channel flash
>> controller rat hole I'm perfectly happy...
>
>
> So where do we go from here? Seems to me like the core parts we all agree
> on, what do we do about the user interface? I can split the patchset into 2
> so we can get the core bits merged, and start further discussion on the
> interface part, if we need to.

I think the ability to set a stream-id on an inode is interesting, but
the use cases we went through for enabling host-hinted SSHDs [1] found
that a good amount of benefit could be had by specifying per-process
stream-ids.

It seems in the inode case we're worried about communicating permanent
placement hints to the controller, while in the per-process hint we
can communicate some amount of temporal and access pattern based
differentiation.

[1]: https://lkml.org/lkml/2014/10/29/805

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-03-18 17:56 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-04 16:10 [PATCH 0/11] Update version of write stream ID patchset Jens Axboe
2016-03-04 16:10 ` [PATCH 01/11] idr: make ida_simple_remove() return an error Jens Axboe
2016-03-04 16:10 ` [PATCH 02/11] block: add support for carrying a stream ID in a bio Jens Axboe
2016-03-04 16:10 ` [PATCH 03/11] Add support for per-file/inode stream ID Jens Axboe
     [not found]   ` <CAJVOszBXU-qQENcOGG8pWeARwoWL2G3gNJ0H2uNPjXkiVa8S+Q@mail.gmail.com>
2016-03-04 20:35     ` Jens Axboe
2016-03-04 16:10 ` [PATCH 04/11] Add system call for setting inode/file write " Jens Axboe
2016-03-04 16:10 ` [PATCH 05/11] wire up system call for x86/x86-64 Jens Axboe
2016-03-04 16:10 ` [PATCH 06/11] Add support for bdi tracking of stream ID Jens Axboe
2016-03-04 16:10 ` [PATCH 07/11] direct-io: add support for write stream IDs Jens Axboe
2016-03-04 16:10 ` [PATCH 08/11] Add stream ID support for buffered mpage/__block_write_full_page() Jens Axboe
2016-03-04 16:10 ` [PATCH 09/11] btrfs: add support for write stream IDs Jens Axboe
2016-03-04 20:44   ` Chris Mason
2016-03-04 20:45     ` Jens Axboe
2016-03-04 16:10 ` [PATCH 10/11] xfs: add support for buffered writeback stream ID Jens Axboe
2016-03-04 16:10 ` [PATCH 11/11] ext4: add support for write stream IDs Jens Axboe
2016-03-04 19:42 ` [PATCH 0/11] Update version of write stream ID patchset Jeff Moyer
2016-03-04 20:34   ` Jens Axboe
2016-03-04 21:01     ` Jeff Moyer
2016-03-04 21:06       ` Jens Axboe
2016-03-04 22:03         ` Jeff Moyer
2016-03-04 22:13           ` Jens Axboe
2016-03-05 20:48         ` Martin K. Petersen
2016-03-08 21:56           ` Jens Axboe
2016-03-17 23:43             ` Dan Williams
2016-03-18  0:18               ` Jens Axboe
2016-03-18  2:39                 ` Martin K. Petersen
2016-03-18 17:37                   ` Jens Axboe
2016-03-18 17:56                     ` Dan Williams
2016-03-06  6:13 ` Andreas Dilger
2016-03-06 13:03   ` Martin K. Petersen
2016-03-06 16:08     ` Boaz Harrosh
2016-03-06 20:51       ` Shaun Tancheff
2016-03-07 15:41         ` Martin K. Petersen
2016-03-07 15:34       ` Martin K. Petersen
2016-03-06 22:42     ` Andreas Dilger
2016-03-07 15:52       ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.