All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v5] Add support for write life time hints
@ 2017-06-15 16:41 Jens Axboe
  2017-06-15 16:41 ` [PATCH 01/12] block: add support for carrying stream information in a bio Jens Axboe
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:41 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen

A new iteration of this patchset, previously known as write streams.
As before, this patchset aims at enabling applications split up
writes into separate streams, based on the perceived life time
of the data written. This is useful for a variety of reasons:

- For NVMe, this feature is ratified and released with the NVMe 1.3
  spec. Devices implementing Directives can expose multiple streams.
  Separating data written into streams based on life time can
  drastically reduce the write amplification. This helps device
  endurance, and increases performance. Testing just performed
  internally at Facebook with these patches showed up to a 25% reduction
  in NAND writes in a RocksDB setup.

- Software caching solutions can make more intelligent decisions
  on how and where to place data.

Contrary to previous patches, we're not exposing numeric stream values anymore.
I've previously advocated for just doing a set of hints that makes sense
instead. See the coverage from the LSFMM summit this year:

https://lwn.net/Articles/717755/

This patchset attempts to do that. We define 4 flags for the pwritev2
system call:

RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
			a high overwrite rate, or life time.

RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT

RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM

RWF_WRITE_LIFE_EXTREME	Longer life time than LONG

The idea is that these are relative values, so an application can
use them as they see fit. The underlying device can then place
data appropriately, or be free to ignore the hint. It's just a hint.

Similarly, to query and set these values on the side, there's now
an fcntl based interface. This exposes the WRITE_HINT_* values to
userspace, and defines F_{GET,SET}_WRITE_LIFE commands to get and
set them as well.

A branch based on current master can be pulled
from here:

git://git.kernel.dk/linux-block write-stream.5

Changes since v4:

- Add enum write_hint and the WRITE_HINT_* values. This is what we
  use internally (until transformed to req/bio flags), and what is
  exposed to user space with the fcntl() interface. Maps directly
  to the RWF_WRITE_LIFE_* values.

- Add fcntl() interface for getting/setting hint values.

- Get rid of inode ->i_write_hint, encode the 3 bits of hint info
  in the inode flags intead.

- Allow a write with no hint to clear the old hint. Previously we
  only changed the hint if a new valid hint was given, not if no
  hint was passed in.

- Shrink flag space grabbed from 4 to 3 bits for RWF_* and the inode
  flags.

Changes since v3:

- Change any naming of stream ID to write hint.
- Various little API changes, suggested by Christoph
- Cleanup the NVMe bits, dump the debug info.
- Change NVMe to lazily allocate the streams.
- Various NVMe error handling improvements and command checking.

Changes since v2:

- Get rid of bio->bi_stream and replace with four request/bio flags.
  These map directly to the RWF_WRITE_* flags that the user passes in.
- Cleanup the NVMe stream setting.
- Drivers now responsible for updating the queue stream write counter,
  as they determine what stream to map a given flag to.

Changes since v1:

- Guard queue stream stats to ensure we don't mess up memory, if
  bio_stream() ever were to return a larger value than we support.
- NVMe: ensure we set the stream modulo the name space defined count.
- Cleanup the RWF_ and IOCB_ flags. Set aside 4 bits, and just store
  the stream value in there. This makes the passing of stream ID from
  RWF_ space to IOCB_ (and IOCB_ to bio) more efficient, and cleans it
  up in general.
- Kill the block internal definitions of the stream type, we don't need
  them anymore. See above.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/12] block: add support for carrying stream information in a bio
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
@ 2017-06-15 16:41 ` Jens Axboe
  2017-06-16 16:39   ` Martin K. Petersen
  2017-06-15 16:42 ` [PATCH 02/12] blk-mq: expose stream write stats through debugfs Jens Axboe
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:41 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

No functional changes in this patch, we just add four flags
that will be used to denote a stream type, and ensure that we
don't merge across different stream types.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c         | 16 ++++++++++++++++
 include/linux/blk_types.h | 11 +++++++++++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..7d299df3b12b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,14 @@ static struct request *attempt_merge(struct request_queue *q,
 		return NULL;
 
 	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+	    (next->cmd_flags & REQ_WRITE_LIFE_MASK))
+		return NULL;
+
+	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
 	 * will have updated segment counts, update sector
@@ -811,6 +819,14 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !blk_write_same_mergeable(rq->bio, bio))
 		return false;
 
+	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+	    (bio->bi_opf & REQ_WRITE_LIFE_MASK))
+		return false;
+
 	return true;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..57d1eb530799 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -201,6 +201,10 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_WRITE_SHORT,	/* short life time write */
+	__REQ_WRITE_MEDIUM,	/* medium life time write */
+	__REQ_WRITE_LONG,	/* long life time write */
+	__REQ_WRITE_EXTREME,	/* extremely long life time write */
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
@@ -221,6 +225,13 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT		(1ULL << __REQ_WRITE_SHORT)
+#define REQ_WRITE_MEDIUM	(1ULL << __REQ_WRITE_MEDIUM)
+#define REQ_WRITE_LONG		(1ULL << __REQ_WRITE_LONG)
+#define REQ_WRITE_EXTREME	(1ULL << __REQ_WRITE_EXTREME)
+
+#define REQ_WRITE_LIFE_MASK	(REQ_WRITE_SHORT | REQ_WRITE_MEDIUM | \
+					REQ_WRITE_LONG | REQ_WRITE_EXTREME)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/12] blk-mq: expose stream write stats through debugfs
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
  2017-06-15 16:41 ` [PATCH 01/12] block: add support for carrying stream information in a bio Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-16 16:38   ` Martin K. Petersen
  2017-06-15 16:42 ` [PATCH 03/12] fs: add support for an inode to carry write hint related data Jens Axboe
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Useful to verify that things are working the way they should.
Reading the file will return number of kb written to each
stream. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->stream_writes[] if they handle a stream.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
 include/linux/blkdev.h |  3 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..0a37c848961d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
 	}
 }
 
+static int queue_streams_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	int i;
+
+	for (i = 0; i < BLK_MAX_STREAM; i++)
+		seq_printf(m, "stream%d: %llu\n", i, q->stream_writes[i]);
+
+	return 0;
+}
+
+static ssize_t queue_streams_store(void *data, const char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	struct request_queue *q = data;
+	int i;
+
+	for (i = 0; i < BLK_MAX_STREAM; i++)
+		q->stream_writes[i] = 0;
+
+	return count;
+}
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{"poll_stat", 0400, queue_poll_stat_show},
 	{"state", 0600, queue_state_show, queue_state_write},
+	{"streams", 0600, queue_streams_show, queue_streams_store},
 	{},
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab92c4ea138b..88719c6f3edf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,9 @@ struct request_queue {
 
 	size_t			cmd_size;
 	void			*rq_alloc_data;
+
+#define BLK_MAX_STREAM	5
+	u64			stream_writes[BLK_MAX_STREAM];
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/12] fs: add support for an inode to carry write hint related data
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
  2017-06-15 16:41 ` [PATCH 01/12] block: add support for carrying stream information in a bio Jens Axboe
  2017-06-15 16:42 ` [PATCH 02/12] blk-mq: expose stream write stats through debugfs Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 04/12] fs: add support for allowing applications to pass in write life time hints Jens Axboe
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes. Set aside 3 bits for carrying hint information
in the inode flags.

Adds the public hints as well, which are:

WRITE_HINT_NONE		No hints about write life time
WRITE_HINT_SHORT	Data written has a short life time
WRITE_HINT_MEDIUM	Data written has a medium life time
WRITE_HINT_LONG		Data written has a long life time
WRITE_HINT_EXTREME	Data written has an extremely long life tim

Helpers are defined to store these values in flags, by passing in
the shift that's appropriate for the given use case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/inode.c              | 11 +++++++++++
 include/linux/fs.h      | 29 +++++++++++++++++++++++++++++
 include/uapi/linux/fs.h | 13 +++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..cc8a05c4c1be 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2120,3 +2120,14 @@ struct timespec current_time(struct inode *inode)
 	return timespec_trunc(now, inode->i_sb->s_time_gran);
 }
 EXPORT_SYMBOL(current_time);
+
+void inode_set_write_hint(struct inode *inode, enum write_hint hint)
+{
+	unsigned int flags = write_hint_to_mask(hint, S_WRITE_LIFE_SHIFT);
+
+	if (flags != mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+		inode_lock(inode);
+		inode_set_flags(inode, flags, S_WRITE_LIFE_MASK);
+		inode_unlock(inode);
+	}
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..bef0b350f890 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1828,6 +1828,14 @@ struct super_operations {
 #endif
 
 /*
+ * Expected life time hint of a write for this inode. This uses the
+ * WRITE_HINT_* encoding, we just need to define the shift. We need
+ * 3 bits for this. Next S_* value is 131072, bit 17.
+ */
+#define S_WRITE_LIFE_MASK	0x1c000	/* bits 14..16 */
+#define S_WRITE_LIFE_SHIFT	14	/* 16384, next bit */
+
+/*
  * Note that nosuid etc flags are inode-specific: setting some file-system
  * flags just means all the inodes inherit those flags by default. It might be
  * possible to override it selectively if you really wanted to with some
@@ -1873,6 +1881,26 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
 	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
 }
 
+static inline unsigned int write_hint_to_mask(enum write_hint hint,
+					      unsigned int shift)
+{
+	return hint << shift;
+}
+
+static inline enum write_hint mask_to_write_hint(unsigned int mask,
+						 unsigned int shift)
+{
+	return (mask >> shift) & 0x7;
+}
+
+static inline unsigned int inode_write_hint(struct inode *inode)
+{
+	if (inode)
+		return mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+
+	return 0;
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
@@ -2757,6 +2785,7 @@ extern struct inode *new_inode(struct super_block *sb);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_privs(struct file *);
+extern void inode_set_write_hint(struct inode *inode, enum write_hint hint);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 static inline void insert_inode_hash(struct inode *inode)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..58fbe0903016 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -356,6 +356,19 @@ struct fscrypt_key {
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * Write life time hint values.
+ */
+enum write_hint {
+	WRITE_HINT_NONE = 0,
+	WRITE_HINT_SHORT,
+	WRITE_HINT_MEDIUM,
+	WRITE_HINT_LONG,
+	WRITE_HINT_EXTREME,
+};
+
+#define WRITE_HINT_MASK		0x7	/* 3 bits */
+
 /* flags for preadv2/pwritev2: */
 #define RWF_HIPRI			0x00000001 /* high priority request, poll if possible */
 #define RWF_DSYNC			0x00000002 /* per-IO O_DSYNC */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/12] fs: add support for allowing applications to pass in write life time hints
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (2 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 03/12] fs: add support for an inode to carry write hint related data Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 05/12] fs: add fcntl() interface for setting/getting " Jens Axboe
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Set aside 3 bits in the iocb flags structure to carry this information
over from the pwritev2 RWF_WRITE_LIFE_* flags.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/read_write.c         | 12 +++++++++++-
 include/linux/fs.h      | 12 ++++++++++++
 include/uapi/linux/fs.h | 10 ++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..871e97ae4147 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -675,10 +675,11 @@ EXPORT_SYMBOL(iov_shorten);
 static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 		loff_t *ppos, int type, int flags)
 {
+	struct inode *inode = file_inode(filp);
 	struct kiocb kiocb;
 	ssize_t ret;
 
-	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
 		return -EOPNOTSUPP;
 
 	init_sync_kiocb(&kiocb, filp);
@@ -688,6 +689,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 		kiocb.ki_flags |= IOCB_DSYNC;
 	if (flags & RWF_SYNC)
 		kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+	if ((flags & RWF_WRITE_LIFE_MASK) ||
+	    mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+		enum write_hint hint;
+
+		hint = mask_to_write_hint(flags, RWF_WRITE_LIFE_SHIFT);
+
+		inode_set_write_hint(inode, hint);
+		kiocb.ki_flags |= write_hint_to_mask(hint, IOCB_WRITE_LIFE_SHIFT);
+	}
 	kiocb.ki_pos = *ppos;
 
 	if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bef0b350f890..24803ed57ec6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,12 @@ struct writeback_control;
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 
+/*
+ * Steal 3 bits for stream information, this allows 8 valid streams
+ */
+#define IOCB_WRITE_LIFE_SHIFT	7
+#define IOCB_WRITE_LIFE_MASK	(BIT(7) | BIT(8) | BIT(9))
+
 struct kiocb {
 	struct file		*ki_filp;
 	loff_t			ki_pos;
@@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 	};
 }
 
+static inline int iocb_write_hint(const struct kiocb *iocb)
+{
+	return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
+			IOCB_WRITE_LIFE_SHIFT;
+}
+
 /*
  * "descriptor" for what we're up to with a read.
  * This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 58fbe0903016..1feac96b43b9 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -374,4 +374,14 @@ enum write_hint {
 #define RWF_DSYNC			0x00000002 /* per-IO O_DSYNC */
 #define RWF_SYNC			0x00000004 /* per-IO O_SYNC */
 
+/*
+ * Data life time write flags, steal 3 bits for that
+ */
+#define RWF_WRITE_LIFE_SHIFT		4
+#define RWF_WRITE_LIFE_MASK		0x00000070 /* 3 bits of write hints */
+#define RWF_WRITE_LIFE_SHORT		(WRITE_HINT_SHORT << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_MEDIUM		(WRITE_HINT_MEDIUM << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_LONG		(WRITE_HINT_LONG << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_EXTREME		(WRITE_HINT_EXTREME << RWF_WRITE_LIFE_SHIFT)
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/12] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (3 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 04/12] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-16 16:44   ` Martin K. Petersen
  2017-06-15 16:42 ` [PATCH 06/12] block: add helpers for setting/checking write hint validity Jens Axboe
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

We have a pwritev2(2) interface based on passing in flags. Add an
fcntl interface for querying these flags, and also for setting them
as well:

F_GET_WRITE_LIFE	Returns one of the valid type of write hints,
			like WRITE_HINT_MEDIUM.

F_SET_WRITE_LIFE	Pass in a WRITE_HINT_* type to set the
			write life time hint for this file/inode.
			Returns 0 on succes, -1 otherwise.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fcntl.c                 | 38 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fcntl.h |  6 ++++++
 2 files changed, 44 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f4e7267d117f..f89fef847f73 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 }
 #endif
 
+long fcntl_write_life(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct inode *inode = file_inode(file);
+	long ret;
+
+	switch (cmd) {
+	case F_GET_WRITE_LIFE:
+		ret = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+		break;
+	case F_SET_WRITE_LIFE: {
+		enum write_hint hint = arg;
+
+		switch (hint) {
+		case WRITE_HINT_NONE:
+		case WRITE_HINT_SHORT:
+		case WRITE_HINT_MEDIUM:
+		case WRITE_HINT_LONG:
+		case WRITE_HINT_EXTREME:
+			inode_set_write_hint(inode, hint);
+			ret = 0;
+			break;
+		default:
+			ret = -EINVAL;
+		}
+		break;
+		}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
@@ -337,6 +371,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GET_SEALS:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
+	case F_GET_WRITE_LIFE:
+	case F_SET_WRITE_LIFE:
+		err = fcntl_write_life(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 813afd6eee71..1c5b2a95e9c9 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,12 @@
 /* (1U << 31) is reserved for signed error codes */
 
 /*
+ * Set/Get write life time hints
+ */
+#define F_GET_WRITE_LIFE	(F_LINUX_SPECIFIC_BASE + 11)
+#define F_SET_WRITE_LIFE	(F_LINUX_SPECIFIC_BASE + 20)
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/12] block: add helpers for setting/checking write hint validity
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (4 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 05/12] fs: add fcntl() interface for setting/getting " Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-16 16:47   ` Martin K. Petersen
  2017-06-15 16:42 ` [PATCH 07/12] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

We map the WRITE_HINT_* life time hints to the internal flags.
Drivers can then, in turn, map those flags to a suitable stream
type.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 16 ++++++++++++++++
 include/linux/bio.h       |  1 +
 include/linux/blk_types.h |  5 +++++
 3 files changed, 22 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..758d83d91bb0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2082,6 +2082,22 @@ void bio_clone_blkcg_association(struct bio *dst, struct bio *src)
 
 #endif /* CONFIG_BLK_CGROUP */
 
+static const unsigned int rwf_write_to_opf_flag[] = {
+	0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
+};
+
+/*
+ * Convert WRITE_LIFE_* hints into req/bio flags
+ */
+unsigned int bio_op_write_hint(enum write_hint hint)
+{
+	if (WARN_ON_ONCE(hint >= ARRAY_SIZE(rwf_write_to_opf_flag)))
+		return 0;
+
+	return rwf_write_to_opf_flag[hint];
+}
+EXPORT_SYMBOL_GPL(bio_op_write_hint);
+
 static void __init biovec_init_slabs(void)
 {
 	int i;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..e9360dc5ea07 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -443,6 +443,7 @@ extern struct bio *bio_copy_kern(struct request_queue *, void *, unsigned int,
 				 gfp_t, int);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
+extern unsigned int bio_op_write_hint(enum write_hint hint);
 
 void generic_start_io_acct(int rw, unsigned long sectors,
 			   struct hd_struct *part);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 57d1eb530799..23646eb433e7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,4 +323,9 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+static inline bool op_write_hint_valid(unsigned int opf)
+{
+	return (opf & REQ_WRITE_LIFE_MASK) != 0;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/12] fs: add O_DIRECT support for sending down bio stream information
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (5 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 06/12] block: add helpers for setting/checking write hint validity Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 08/12] fs: add support for buffered writeback to pass down write hints Jens Axboe
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c     | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..de4301168710 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 			should_dirty = true;
 	} else {
 		bio.bi_opf = dio_bio_write_op(iocb);
+		bio.bi_opf |= bio_op_write_hint(iocb_write_hint(iocb));
 		task_io_account_write(ret);
 	}
 
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio_set_pages_dirty(bio);
 		} else {
 			bio->bi_opf = dio_bio_write_op(iocb);
+			bio->bi_opf |= bio_op_write_hint(iocb_write_hint(iocb));
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..98874478ec8a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	bio->bi_opf |= bio_op_write_hint(iocb_write_hint(dio->iocb));
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..7e18e760e421 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			bio->bi_opf |= bio_op_write_hint(inode_write_hint(inode));
 			task_io_account_write(bio->bi_iter.bi_size);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/12] fs: add support for buffered writeback to pass down write hints
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (6 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 07/12] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 09/12] ext4: add support for passing in write hints for buffered writes Jens Axboe
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/buffer.c | 14 +++++++++-----
 fs/mpage.c  |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..3faf73a71d4b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-			 struct writeback_control *wbc);
+			 unsigned int stream, struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+					inode_write_hint(inode), wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+					inode_write_hint(inode), wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
 }
 
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-			 struct writeback_control *wbc)
+			 unsigned int write_hint, struct writeback_control *wbc)
 {
 	struct bio *bio;
 
@@ -3134,6 +3136,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 		op_flags |= REQ_META;
 	if (buffer_prio(bh))
 		op_flags |= REQ_PRIO;
+
+	op_flags |= bio_op_write_hint(write_hint);
 	bio_set_op_attrs(bio, op, op_flags);
 
 	submit_bio(bio);
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 
 int submit_bh(int op, int op_flags, struct buffer_head *bh)
 {
-	return submit_bh_wbc(op, op_flags, bh, NULL);
+	return submit_bh_wbc(op, op_flags, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..df0635c8a512 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 			goto confused;
 
 		wbc_init_bio(wbc, bio);
+		bio->bi_opf |= bio_op_write_hint(inode_write_hint(inode));
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/12] ext4: add support for passing in write hints for buffered writes
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (7 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 08/12] fs: add support for buffered writeback to pass down write hints Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 10/12] xfs: " Jens Axboe
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/page-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..764bf0ddecd4 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -349,6 +349,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
 	if (bio) {
 		int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
 				  REQ_SYNC : 0;
+		io_op_flags |= bio_op_write_hint(inode_write_hint(io->io_end->inode));
 		bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
 		submit_bio(io->io_bio);
 	}
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
 		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;
+		io->io_bio->bi_opf |= bio_op_write_hint(inode_write_hint(inode));
 	}
 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/12] xfs: add support for passing in write hints for buffered writes
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (8 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 09/12] ext4: add support for passing in write hints for buffered writes Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 11/12] btrfs: " Jens Axboe
  2017-06-15 16:42 ` [PATCH 12/12] nvme: add support for streams and directives Jens Axboe
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..fe11fe47d235 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
 		return status;
 	}
 
+	ioend->io_bio->bi_opf |= bio_op_write_hint(inode_write_hint(ioend->io_inode));
 	submit_bio(ioend->io_bio);
 	return 0;
 }
@@ -564,6 +565,7 @@ xfs_chain_bio(
 	bio_chain(ioend->io_bio, new);
 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
 	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+	ioend->io_bio->bi_opf |= bio_op_write_hint(inode_write_hint(ioend->io_inode));
 	submit_bio(ioend->io_bio);
 	ioend->io_bio = new;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/12] btrfs: add support for passing in write hints for buffered writes
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (9 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 10/12] xfs: " Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  2017-06-15 16:42 ` [PATCH 12/12] nvme: add support for streams and directives Jens Axboe
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/btrfs/extent_io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..2bc2dfca87c2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2826,6 +2826,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
+	op_flags |= bio_op_write_hint(inode_write_hint(page->mapping->host));
 	bio_set_op_attrs(bio, op, op_flags);
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/12] nvme: add support for streams and directives
  2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
                   ` (10 preceding siblings ...)
  2017-06-15 16:42 ` [PATCH 11/12] btrfs: " Jens Axboe
@ 2017-06-15 16:42 ` Jens Axboe
  11 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-15 16:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data,
so that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and
life time of the device.

We default to allocating 4 streams per name space, but it is
configurable with the 'streams_per_ns' module option. If a write stream
is set in a write, flag is as such before sending it to the device. The
streams are allocated lazily - if we get a write request with a life
time hint, then background allocate streams and use them once that
is done.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/core.c | 175 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   5 ++
 include/linux/nvme.h     |  48 +++++++++++++
 3 files changed, 228 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..30a6473b68cc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
 module_param(force_apst, bool, 0644);
 MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
 
+static char streams_per_ns = 4;
+module_param(streams_per_ns, byte, 0644);
+MODULE_PARM_DESC(streams_per_ns, "if available, allocate this many streams per NS");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -331,6 +335,151 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 	return BLK_MQ_RQ_QUEUE_OK;
 }
 
+static int nvme_enable_streams(struct nvme_ctrl *ctrl)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_send;
+	c.directive.nsid = cpu_to_le32(0xffffffff);
+	c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+	c.directive.dtype = NVME_DIR_IDENTIFY;
+	c.directive.tdtype = NVME_DIR_STREAMS;
+	c.directive.endir = NVME_DIR_ENDIR;
+
+	return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_probe_directives(struct nvme_ctrl *ctrl)
+{
+	struct streams_directive_params s;
+	struct nvme_command c;
+	int ret;
+
+	if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+		return 0;
+
+	ret = nvme_enable_streams(ctrl);
+	if (ret)
+		return ret;
+
+	memset(&c, 0, sizeof(c));
+	memset(&s, 0, sizeof(s));
+
+	c.directive.opcode = nvme_admin_directive_recv;
+	c.directive.nsid = cpu_to_le32(0xffffffff);
+	c.directive.numd = sizeof(s);
+	c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+	c.directive.dtype = NVME_DIR_STREAMS;
+
+	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, &s, sizeof(s));
+	if (ret)
+		return ret;
+
+	ctrl->nssa = le16_to_cpu(s.nssa);
+	return 0;
+}
+
+/*
+ * Returns number of streams allocated for use by this ns, or -1 on error.
+ */
+static int nvme_streams_allocate(struct nvme_ns *ns, unsigned int streams)
+{
+	struct nvme_command c;
+	union nvme_result res;
+	int ret;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_recv;
+	c.directive.nsid = cpu_to_le32(ns->ns_id);
+	c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+	c.directive.dtype = NVME_DIR_STREAMS;
+	c.directive.endir = streams;
+
+	ret = __nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, &res, NULL, 0, 0,
+					NVME_QID_ANY, 0, 0);
+	if (ret)
+		return -1;
+
+	return le32_to_cpu(res.u32) & 0xffff;
+}
+
+static int nvme_streams_deallocate(struct nvme_ns *ns)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_send;
+	c.directive.nsid = cpu_to_le32(ns->ns_id);
+	c.directive.doper = NVME_DIR_SND_ST_OP_REL_RSC;
+	c.directive.dtype = NVME_DIR_STREAMS;
+
+	return nvme_submit_sync_cmd(ns->ctrl->admin_q, &c, NULL, 0);
+}
+
+static void nvme_write_hint_work(struct work_struct *work)
+{
+	struct nvme_ns *ns = container_of(work, struct nvme_ns, write_hint_work);
+	int ret, nr_streams;
+
+	if (ns->nr_streams)
+		return;
+
+	nr_streams = streams_per_ns;
+	if (nr_streams > ns->ctrl->nssa)
+		nr_streams = ns->ctrl->nssa;
+
+	ret = nvme_streams_allocate(ns, nr_streams);
+	if (ret <= 0)
+		goto err;
+
+	ns->nr_streams = ret;
+	dev_info(ns->ctrl->device, "successfully enabled %d streams\n", ret);
+	return;
+err:
+	dev_info(ns->ctrl->device, "failed enabling streams\n");
+	ns->ctrl->failed_streams = true;
+}
+
+static void nvme_configure_streams(struct nvme_ns *ns)
+{
+	/*
+	 * If we already called this function, we've either marked it
+	 * as a failure or set the number of streams.
+	 */
+	if (ns->ctrl->failed_streams)
+		return;
+	if (ns->nr_streams)
+		return;
+	schedule_work(&ns->write_hint_work);
+}
+
+static unsigned int nvme_get_write_stream(struct nvme_ns *ns,
+					  struct request *req)
+{
+	unsigned int streamid = 0;
+
+	if (req->cmd_flags & REQ_WRITE_SHORT)
+		streamid = 1;
+	else if (req->cmd_flags & REQ_WRITE_MEDIUM)
+		streamid = 2;
+	else if (req->cmd_flags & REQ_WRITE_LONG)
+		streamid = 3;
+	else if (req->cmd_flags & REQ_WRITE_EXTREME)
+		streamid = 4;
+
+	req->q->stream_writes[streamid] += blk_rq_bytes(req) >> 9;
+
+	if (streamid <= ns->nr_streams)
+		return streamid;
+
+	/* for now just round-robin, do something more clever later */
+	return (streamid % (ns->nr_streams + 1));
+}
+
 static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
@@ -351,6 +500,25 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
+	/*
+	 * If we support streams and it isn't enabled, so so now. Until it's
+	 * enabled, we won't flag the write with a stream. If we don't support
+	 * streams, just ignore the life time hint.
+	 */
+	if (req_op(req) == REQ_OP_WRITE && op_write_hint_valid(req->cmd_flags)) {
+		struct nvme_ctrl *ctrl = ns->ctrl;
+
+		if (ns->nr_streams) {
+			unsigned int stream = nvme_get_write_stream(ns, req);
+
+			if (stream) {
+				control |= NVME_RW_DTYPE_STREAMS;
+				dsmgmt |= (stream << 16);
+			}
+		} else if (ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)
+			nvme_configure_streams(ns);
+	}
+
 	if (ns->ms) {
 		switch (ns->pi_type) {
 		case NVME_NS_DPS_PI_TYPE3:
@@ -1650,6 +1818,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		dev_pm_qos_hide_latency_tolerance(ctrl->device);
 
 	nvme_configure_apst(ctrl);
+	nvme_probe_directives(ctrl);
 
 	ctrl->identified = true;
 
@@ -2049,6 +2218,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
 	nvme_set_queue_limits(ctrl, ns->queue);
 
+	INIT_WORK(&ns->write_hint_work, nvme_write_hint_work);
+
 	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
 
 	if (nvme_revalidate_ns(ns, &id))
@@ -2105,6 +2276,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
+	flush_work(&ns->write_hint_work);
+
 	if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
@@ -2112,6 +2285,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 					&nvme_ns_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
+		if (ns->nr_streams)
+			nvme_streams_deallocate(ns);
 		del_gendisk(ns->disk);
 		blk_cleanup_queue(ns->queue);
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..918b6126d38b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -118,6 +118,7 @@ enum nvme_ctrl_state {
 struct nvme_ctrl {
 	enum nvme_ctrl_state state;
 	bool identified;
+	bool failed_streams;
 	spinlock_t lock;
 	const struct nvme_ctrl_ops *ops;
 	struct request_queue *admin_q;
@@ -147,6 +148,7 @@ struct nvme_ctrl {
 	u16 oncs;
 	u16 vid;
 	u16 oacs;
+	u16 nssa;
 	atomic_t abort_limit;
 	u8 event_limit;
 	u8 vwc;
@@ -192,6 +194,7 @@ struct nvme_ns {
 	u8 uuid[16];
 
 	unsigned ns_id;
+	unsigned nr_streams;
 	int lba_shift;
 	u16 ms;
 	bool ext;
@@ -203,6 +206,8 @@ struct nvme_ns {
 
 	u64 mode_select_num_blocks;
 	u32 mode_select_block_len;
+
+	struct work_struct write_hint_work;
 };
 
 struct nvme_ctrl_ops {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
@@ -295,6 +296,19 @@ enum {
 };
 
 enum {
+	NVME_DIR_IDENTIFY		= 0x00,
+	NVME_DIR_STREAMS		= 0x01,
+	NVME_DIR_SND_ID_OP_ENABLE	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_ID	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_RSC	= 0x02,
+	NVME_DIR_RCV_ID_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_STATUS	= 0x02,
+	NVME_DIR_RCV_ST_OP_RESOURCE	= 0x03,
+	NVME_DIR_ENDIR			= 0x01,
+};
+
+enum {
 	NVME_NS_FEAT_THIN	= 1 << 0,
 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
 	NVME_NS_FLBAS_META_EXT	= 0x10,
@@ -535,6 +549,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_APP	= 1 << 11,
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
+	NVME_RW_DTYPE_STREAMS		= 1 << 4,
 };
 
 struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_directive_send	= 0x19,
+	nvme_admin_directive_recv	= 0x1a,
 	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
 	__u32			rsvd14[2];
 };
 
+struct nvme_directive_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le32			numd;
+	__u8			doper;
+	__u8			dtype;
+	__le16			dspec;
+	__u8			endir;
+	__u8			tdtype;
+	__u16			rsvd15;
+
+	__u32			rsvd16[3];
+};
+
 /*
  * Fabrics subcommands.
  */
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
 	__u32			rsvd12[6];
 };
 
+struct streams_directive_params {
+	__u16	msl;
+	__u16	nssa;
+	__u16	nsso;
+	__u8	rsvd[10];
+	__u32	sws;
+	__u16	sgs;
+	__u16	nsa;
+	__u16	nso;
+	__u8	rsvd2[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
 		struct nvme_dbbuf dbbuf;
+		struct nvme_directive_cmd directive;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 02/12] blk-mq: expose stream write stats through debugfs
  2017-06-15 16:42 ` [PATCH 02/12] blk-mq: expose stream write stats through debugfs Jens Axboe
@ 2017-06-16 16:38   ` Martin K. Petersen
  2017-06-16 16:41     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Martin K. Petersen @ 2017-06-16 16:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen


Jens,

> Useful to verify that things are working the way they should.
> Reading the file will return number of kb written to each
> stream. Writing the file will reset the statistics. No care
> is taken to ensure that we don't race on updates.
>
> Drivers will write to q->stream_writes[] if they handle a stream.

s/stream/write_lifetime_bucket/ or something like that.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 01/12] block: add support for carrying stream information in a bio
  2017-06-15 16:41 ` [PATCH 01/12] block: add support for carrying stream information in a bio Jens Axboe
@ 2017-06-16 16:39   ` Martin K. Petersen
  2017-06-16 16:42     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Martin K. Petersen @ 2017-06-16 16:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen


Jens,

> No functional changes in this patch, we just add four flags
> that will be used to denote a stream type, and ensure that we
> don't merge across different stream types.

More stream terminology...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 02/12] blk-mq: expose stream write stats through debugfs
  2017-06-16 16:38   ` Martin K. Petersen
@ 2017-06-16 16:41     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-16 16:41 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, linux-block, adilger, hch

On 06/16/2017 10:38 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> Useful to verify that things are working the way they should.
>> Reading the file will return number of kb written to each
>> stream. Writing the file will reset the statistics. No care
>> is taken to ensure that we don't race on updates.
>>
>> Drivers will write to q->stream_writes[] if they handle a stream.
> 
> s/stream/write_lifetime_bucket/ or something like that.

Yeah, it's the only piece left over. I'll make that change.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 01/12] block: add support for carrying stream information in a bio
  2017-06-16 16:39   ` Martin K. Petersen
@ 2017-06-16 16:42     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-16 16:42 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, linux-block, adilger, hch

On 06/16/2017 10:39 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> No functional changes in this patch, we just add four flags
>> that will be used to denote a stream type, and ensure that we
>> don't merge across different stream types.
> 
> More stream terminology...

Thanks, will fix up that too...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/12] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-15 16:42 ` [PATCH 05/12] fs: add fcntl() interface for setting/getting " Jens Axboe
@ 2017-06-16 16:44   ` Martin K. Petersen
  2017-06-16 16:55     ` Jens Axboe
  2017-06-16 17:59     ` Christoph Hellwig
  0 siblings, 2 replies; 22+ messages in thread
From: Martin K. Petersen @ 2017-06-16 16:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen


Jens,

> We have a pwritev2(2) interface based on passing in flags. Add an
> fcntl interface for querying these flags, and also for setting them
> as well:
>
> F_GET_WRITE_LIFE	Returns one of the valid type of write hints,
> 			like WRITE_HINT_MEDIUM.
>
> F_SET_WRITE_LIFE	Pass in a WRITE_HINT_* type to set the
> 			write life time hint for this file/inode.
> 			Returns 0 on succes, -1 otherwise.

It seems like an overkill to have different fcntls for different
hints. And since we are expecting more, maybe these should be
F_{GET,SET}_HINT and then the individual flags can be
WRITE_LIFETIME_FOOBAR?

Otherwise OK with the fnctl approach.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/12] block: add helpers for setting/checking write hint validity
  2017-06-15 16:42 ` [PATCH 06/12] block: add helpers for setting/checking write hint validity Jens Axboe
@ 2017-06-16 16:47   ` Martin K. Petersen
  2017-06-16 16:53     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Martin K. Petersen @ 2017-06-16 16:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen


Jens,

> +static const unsigned int rwf_write_to_opf_flag[] = {
> +	0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
> +};

Minor nit: When I see WRITE_SHORT I instinctively think data corruption.

Can we make these REQ_LIFETIME_SHORT or something instead? It loses the
WRITE moniker which I'm not so keen on. But I'm not sure how we'd define
read lifetime...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/12] block: add helpers for setting/checking write hint validity
  2017-06-16 16:47   ` Martin K. Petersen
@ 2017-06-16 16:53     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-16 16:53 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, linux-block, adilger, hch

On 06/16/2017 10:47 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> +static const unsigned int rwf_write_to_opf_flag[] = {
>> +	0, REQ_WRITE_SHORT, REQ_WRITE_MEDIUM, REQ_WRITE_LONG, REQ_WRITE_EXTREME
>> +};
> 
> Minor nit: When I see WRITE_SHORT I instinctively think data corruption.
> 
> Can we make these REQ_LIFETIME_SHORT or something instead? It loses the
> WRITE moniker which I'm not so keen on. But I'm not sure how we'd define
> read lifetime...

I did have that same feeling when writing it... The good news is that v6
will just use the WRITE_HINT_* types everywhere, so this one is already
gone.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/12] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-16 16:44   ` Martin K. Petersen
@ 2017-06-16 16:55     ` Jens Axboe
  2017-06-16 17:59     ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2017-06-16 16:55 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-fsdevel, linux-block, adilger, hch

On 06/16/2017 10:44 AM, Martin K. Petersen wrote:
> 
> Jens,
> 
>> We have a pwritev2(2) interface based on passing in flags. Add an
>> fcntl interface for querying these flags, and also for setting them
>> as well:
>>
>> F_GET_WRITE_LIFE	Returns one of the valid type of write hints,
>> 			like WRITE_HINT_MEDIUM.
>>
>> F_SET_WRITE_LIFE	Pass in a WRITE_HINT_* type to set the
>> 			write life time hint for this file/inode.
>> 			Returns 0 on succes, -1 otherwise.
> 
> It seems like an overkill to have different fcntls for different
> hints. And since we are expecting more, maybe these should be
> F_{GET,SET}_HINT and then the individual flags can be
> WRITE_LIFETIME_FOOBAR?
> 
> Otherwise OK with the fnctl approach.

OK, that's a useful suggestion. The hints are already of the
WRITE_HINT_* variant, so I don't think we need to change that. I'll
change the name.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 05/12] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-16 16:44   ` Martin K. Petersen
  2017-06-16 16:55     ` Jens Axboe
@ 2017-06-16 17:59     ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2017-06-16 17:59 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Jens Axboe, linux-fsdevel, linux-block, adilger, hch

On Fri, Jun 16, 2017 at 12:44:09PM -0400, Martin K. Petersen wrote:
> It seems like an overkill to have different fcntls for different
> hints. And since we are expecting more, maybe these should be
> F_{GET,SET}_HINT and then the individual flags can be
> WRITE_LIFETIME_FOOBAR?

That's what I was trying to explain earlier - have
F_{GET,SET}_HINT ake a u16 (or maybe even a u32 or u64 with the
remainder reserved) and then take 3 bits for the write lifetime.
Which btw means we'd have another 3 possible values left if
we encode is as a value instead of as bits.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-06-16 17:59 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-15 16:41 [PATCHSET v5] Add support for write life time hints Jens Axboe
2017-06-15 16:41 ` [PATCH 01/12] block: add support for carrying stream information in a bio Jens Axboe
2017-06-16 16:39   ` Martin K. Petersen
2017-06-16 16:42     ` Jens Axboe
2017-06-15 16:42 ` [PATCH 02/12] blk-mq: expose stream write stats through debugfs Jens Axboe
2017-06-16 16:38   ` Martin K. Petersen
2017-06-16 16:41     ` Jens Axboe
2017-06-15 16:42 ` [PATCH 03/12] fs: add support for an inode to carry write hint related data Jens Axboe
2017-06-15 16:42 ` [PATCH 04/12] fs: add support for allowing applications to pass in write life time hints Jens Axboe
2017-06-15 16:42 ` [PATCH 05/12] fs: add fcntl() interface for setting/getting " Jens Axboe
2017-06-16 16:44   ` Martin K. Petersen
2017-06-16 16:55     ` Jens Axboe
2017-06-16 17:59     ` Christoph Hellwig
2017-06-15 16:42 ` [PATCH 06/12] block: add helpers for setting/checking write hint validity Jens Axboe
2017-06-16 16:47   ` Martin K. Petersen
2017-06-16 16:53     ` Jens Axboe
2017-06-15 16:42 ` [PATCH 07/12] fs: add O_DIRECT support for sending down bio stream information Jens Axboe
2017-06-15 16:42 ` [PATCH 08/12] fs: add support for buffered writeback to pass down write hints Jens Axboe
2017-06-15 16:42 ` [PATCH 09/12] ext4: add support for passing in write hints for buffered writes Jens Axboe
2017-06-15 16:42 ` [PATCH 10/12] xfs: " Jens Axboe
2017-06-15 16:42 ` [PATCH 11/12] btrfs: " Jens Axboe
2017-06-15 16:42 ` [PATCH 12/12] nvme: add support for streams and directives Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.