linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v7] Add support for write life time hints
@ 2017-06-17 19:59 Jens Axboe
  2017-06-17 19:59 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
                   ` (10 more replies)
  0 siblings, 11 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen

A new iteration of this patchset, previously known as write streams.
As before, this patchset aims at enabling applications split up
writes into separate streams, based on the perceived life time
of the data written. This is useful for a variety of reasons:

- For NVMe, this feature is ratified and released with the NVMe 1.3
  spec. Devices implementing Directives can expose multiple streams.
  Separating data written into streams based on life time can
  drastically reduce the write amplification. This helps device
  endurance, and increases performance. Testing just performed
  internally at Facebook with these patches showed up to a 25% reduction
  in NAND writes in a RocksDB setup.

- Software caching solutions can make more intelligent decisions
  on how and where to place data.

Contrary to previous patches, we're not exposing numeric stream values anymore.
I've previously advocated for just doing a set of hints that makes sense
instead. See the coverage from the LSFMM summit this year:

https://lwn.net/Articles/717755/

This patchset attempts to do that. We define 4 flags for the pwritev2
system call:

RWF_WRITE_LIFE_SHORT	Data written with this flag is expected to have
			a high overwrite rate, or life time.

RWF_WRITE_LIFE_MEDIUM	Longer life time than SHORT

RWF_WRITE_LIFE_LONG	Longer life time than MEDIUM

RWF_WRITE_LIFE_EXTREME	Longer life time than LONG

The idea is that these are relative values, so an application can
use them as they see fit. The underlying device can then place
data appropriately, or be free to ignore the hint. It's just a hint.

Similarly, to query and set these values on the side, there's now
an fcntl based interface. This exposes the WRITE_LIFE_* values to
userspace, and defines F_{GET,SET}_RW_HINT commands to get and set
them as well.

A branch based on current master can be pulled
from here:

git://git.kernel.dk/linux-block write-stream.7

Changes since v6:

- Rewrite NVMe write stream assignment
- Change NVMe stream assignment to be per-controller, not per-ns. Then
  we can use the same IDs across name spaces, and we don't have to do
  lazy setup of streams.
- If streams are enabled on nvme, set io min/opt and discard
  granularity based on the stream params reported.
- Fixup F_SET_RW_HINT definition, it was 20, should have been 12.

Changes since v5:

- Change enum write_hint to enum rw_hint.
- Change fcntl() interface to be read/write generic
- Bring enum rw_hint all the way to bio/request
- Change references to streams in changelogs and debugfs interface
- Rebase to master to resolve blkdev.h conflict
- Reshuffle patches so the WRITE_LIFE_* hints and type come first. Allowed
  me to merge two block patches as well.

Changes since v4:

- Add enum write_hint and the WRITE_HINT_* values. This is what we
  use internally (until transformed to req/bio flags), and what is
  exposed to user space with the fcntl() interface. Maps directly
  to the RWF_WRITE_LIFE_* values.
- Add fcntl() interface for getting/setting hint values.
- Get rid of inode ->i_write_hint, encode the 3 bits of hint info
  in the inode flags intead.
- Allow a write with no hint to clear the old hint. Previously we
  only changed the hint if a new valid hint was given, not if no
  hint was passed in.
- Shrink flag space grabbed from 4 to 3 bits for RWF_* and the inode
  flags.

Changes since v3:

- Change any naming of stream ID to write hint.
- Various little API changes, suggested by Christoph
- Cleanup the NVMe bits, dump the debug info.
- Change NVMe to lazily allocate the streams.
- Various NVMe error handling improvements and command checking.

Changes since v2:

- Get rid of bio->bi_stream and replace with four request/bio flags.
  These map directly to the RWF_WRITE_* flags that the user passes in.
- Cleanup the NVMe stream setting.
- Drivers now responsible for updating the queue stream write counter,
  as they determine what stream to map a given flag to.

Changes since v1:

- Guard queue stream stats to ensure we don't mess up memory, if
  bio_stream() ever were to return a larger value than we support.
- NVMe: ensure we set the stream modulo the name space defined count.
- Cleanup the RWF_ and IOCB_ flags. Set aside 4 bits, and just store
  the stream value in there. This makes the passing of stream ID from
  RWF_ space to IOCB_ (and IOCB_ to bio) more efficient, and cleans it
  up in general.
- Kill the block internal definitions of the stream type, we don't need
  them anymore. See above.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/11] fs: add support for an inode to carry write hint related data
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-19  6:26   ` Christoph Hellwig
  2017-06-17 19:59 ` [PATCH 02/11] block: add support for write hints in a bio Jens Axboe
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

No functional changes in this patch, just in preparation for
allowing applications to pass in hints about data life times
for writes. Set aside 3 bits for carrying hint information
in the inode flags.

Adds the public hints as well, which are:

WRITE_LIFE_NONE		No hints about write life time
WRITE_LIFE_SHORT	Data written has a short life time
WRITE_LIFE_MEDIUM	Data written has a medium life time
WRITE_LIFE_LONG		Data written has a long life time
WRITE_LIFE_EXTREME	Data written has an extremely long life tim

Helpers are defined to store these values in flags, by passing in
the shift that's appropriate for the given use case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/inode.c              | 11 +++++++++++
 include/linux/fs.h      | 29 +++++++++++++++++++++++++++++
 include/uapi/linux/fs.h | 13 +++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..defb015a2c6d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2120,3 +2120,14 @@ struct timespec current_time(struct inode *inode)
 	return timespec_trunc(now, inode->i_sb->s_time_gran);
 }
 EXPORT_SYMBOL(current_time);
+
+void inode_set_write_hint(struct inode *inode, enum rw_hint hint)
+{
+	unsigned int flags = write_hint_to_mask(hint, S_WRITE_LIFE_SHIFT);
+
+	if (flags != mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+		inode_lock(inode);
+		inode_set_flags(inode, flags, S_WRITE_LIFE_MASK);
+		inode_unlock(inode);
+	}
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 803e5a9b2654..472c83156606 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1828,6 +1828,14 @@ struct super_operations {
 #endif
 
 /*
+ * Expected life time hint of a write for this inode. This uses the
+ * WRITE_LIFE_* encoding, we just need to define the shift. We need
+ * 3 bits for this. Next S_* value is 131072, bit 17.
+ */
+#define S_WRITE_LIFE_MASK	0x1c000	/* bits 14..16 */
+#define S_WRITE_LIFE_SHIFT	14	/* 16384, next bit */
+
+/*
  * Note that nosuid etc flags are inode-specific: setting some file-system
  * flags just means all the inodes inherit those flags by default. It might be
  * possible to override it selectively if you really wanted to with some
@@ -1873,6 +1881,26 @@ static inline bool HAS_UNMAPPED_ID(struct inode *inode)
 	return !uid_valid(inode->i_uid) || !gid_valid(inode->i_gid);
 }
 
+static inline unsigned int write_hint_to_mask(enum rw_hint hint,
+					      unsigned int shift)
+{
+	return hint << shift;
+}
+
+static inline enum rw_hint mask_to_write_hint(unsigned int mask,
+					      unsigned int shift)
+{
+	return (mask >> shift) & 0x7;
+}
+
+static inline unsigned int inode_write_hint(struct inode *inode)
+{
+	if (inode)
+		return mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+
+	return 0;
+}
+
 /*
  * Inode state bits.  Protected by inode->i_lock
  *
@@ -2757,6 +2785,7 @@ extern struct inode *new_inode(struct super_block *sb);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_privs(struct file *);
+extern void inode_set_write_hint(struct inode *inode, enum rw_hint hint);
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 static inline void insert_inode_hash(struct inode *inode)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..8fb3b5a6e1ec 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -356,6 +356,19 @@ struct fscrypt_key {
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * Write life time hint values.
+ */
+enum rw_hint {
+	WRITE_LIFE_NONE = 0,
+	WRITE_LIFE_SHORT,
+	WRITE_LIFE_MEDIUM,
+	WRITE_LIFE_LONG,
+	WRITE_LIFE_EXTREME,
+};
+
+#define RW_HINT_MASK		0x7	/* 3 bits */
+
 /* flags for preadv2/pwritev2: */
 #define RWF_HIPRI			0x00000001 /* high priority request, poll if possible */
 #define RWF_DSYNC			0x00000002 /* per-IO O_DSYNC */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/11] block: add support for write hints in a bio
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
  2017-06-17 19:59 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-17 19:59 ` [PATCH 03/11] blk-mq: expose stream write hints through debugfs Jens Axboe
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

No functional changes in this patch, we just set aside 3 bits
in the bio/request flags, which can be used to hold a WRITE_HINT_*
life time hint.

Ensure that we don't merge requests that have different life time
hints assigned to them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c         | 16 ++++++++++++++++
 include/linux/blk_types.h | 15 +++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..7d299df3b12b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,14 @@ static struct request *attempt_merge(struct request_queue *q,
 		return NULL;
 
 	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+	    (next->cmd_flags & REQ_WRITE_LIFE_MASK))
+		return NULL;
+
+	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
 	 * will have updated segment counts, update sector
@@ -811,6 +819,14 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !blk_write_same_mergeable(rq->bio, bio))
 		return false;
 
+	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+	    (bio->bi_opf & REQ_WRITE_LIFE_MASK))
+		return false;
+
 	return true;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..0dd8d569d52f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -7,6 +7,7 @@
 
 #include <linux/types.h>
 #include <linux/bvec.h>
+#include <linux/fs.h>
 
 struct bio_set;
 struct bio;
@@ -201,6 +202,9 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_WRITE_HINT_SHIFT,	/* 3 bits for life time hint */
+	__REQ_WRITE_HINT_PAD1,
+	__REQ_WRITE_HINT_PAD2,
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
@@ -221,6 +225,12 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT		(WRITE_LIFE_SHORT << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_MEDIUM	(WRITE_LIFE_MEDIUM << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_LONG		(WRITE_LIFE_LONG << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_EXTREME	(WRITE_LIFE_EXTREME << __REQ_WRITE_HINT_SHIFT)
+
+#define REQ_WRITE_LIFE_MASK	(0x7 << __REQ_WRITE_HINT_SHIFT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
@@ -312,4 +322,9 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+static inline unsigned int write_hint_to_opf(enum rw_hint hint)
+{
+	return hint << __REQ_WRITE_HINT_SHIFT;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/11] blk-mq: expose stream write hints through debugfs
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
  2017-06-17 19:59 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
  2017-06-17 19:59 ` [PATCH 02/11] block: add support for write hints in a bio Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-17 19:59 ` [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Useful to verify that things are working the way they should.
Reading the file will return number of kb written with each
write hint. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->write_hints[] if they handle a given
write hint.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-debugfs.c | 24 ++++++++++++++++++++++++
 include/linux/blkdev.h |  3 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 803aed4d7221..b069a76f7fc7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
 	}
 }
 
+static int queue_write_hint_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	int i;
+
+	for (i = 0; i < BLK_MAX_WRITE_HINTS; i++)
+		seq_printf(m, "hint%d: %llu\n", i, q->write_hints[i]);
+
+	return 0;
+}
+
+static ssize_t queue_write_hint_store(void *data, const char __user *buf,
+				      size_t count, loff_t *ppos)
+{
+	struct request_queue *q = data;
+	int i;
+
+	for (i = 0; i < BLK_MAX_WRITE_HINTS; i++)
+		q->write_hints[i] = 0;
+
+	return count;
+}
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
@@ -656,6 +679,7 @@ const struct file_operations blk_mq_debugfs_fops = {
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{"poll_stat", 0400, queue_poll_stat_show},
 	{"state", 0600, queue_state_show, queue_state_write},
+	{"write_hints", 0600, queue_write_hint_show, queue_write_hint_store},
 	{},
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b74a3edcb3da..ab36068f6129 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -588,6 +588,9 @@ struct request_queue {
 	void			*rq_alloc_data;
 
 	struct work_struct	release_work;
+
+#define BLK_MAX_WRITE_HINTS	5
+	u64			write_hints[BLK_MAX_WRITE_HINTS];
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (2 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 03/11] blk-mq: expose stream write hints through debugfs Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-19  6:27   ` Christoph Hellwig
  2017-06-17 19:59 ` [PATCH 05/11] fs: add fcntl() interface for setting/getting " Jens Axboe
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Add four flags for the pwritev2(2) system call, allowing an application
to give the kernel a hint about what on-media life times can be
expected from a given write.

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Set aside 3 bits in the iocb flags structure to carry this information
over from the pwritev2 RWF_WRITE_LIFE_* flags.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/read_write.c         | 12 +++++++++++-
 include/linux/fs.h      | 12 ++++++++++++
 include/uapi/linux/fs.h | 10 ++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 19d4d88fa285..975fe1d46a59 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -675,10 +675,11 @@ EXPORT_SYMBOL(iov_shorten);
 static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 		loff_t *ppos, int type, int flags)
 {
+	struct inode *inode = file_inode(filp);
 	struct kiocb kiocb;
 	ssize_t ret;
 
-	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
+	if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
 		return -EOPNOTSUPP;
 
 	init_sync_kiocb(&kiocb, filp);
@@ -688,6 +689,15 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter,
 		kiocb.ki_flags |= IOCB_DSYNC;
 	if (flags & RWF_SYNC)
 		kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+	if ((flags & RWF_WRITE_LIFE_MASK) ||
+	    mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+		enum rw_hint hint;
+
+		hint = mask_to_write_hint(flags, RWF_WRITE_LIFE_SHIFT);
+
+		inode_set_write_hint(inode, hint);
+		kiocb.ki_flags |= write_hint_to_mask(hint, IOCB_WRITE_LIFE_SHIFT);
+	}
 	kiocb.ki_pos = *ppos;
 
 	if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 472c83156606..a024b32259bf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -269,6 +269,12 @@ struct writeback_control;
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 
+/*
+ * Steal 3 bits for stream information, this allows 8 valid streams
+ */
+#define IOCB_WRITE_LIFE_SHIFT	7
+#define IOCB_WRITE_LIFE_MASK	(BIT(7) | BIT(8) | BIT(9))
+
 struct kiocb {
 	struct file		*ki_filp;
 	loff_t			ki_pos;
@@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 	};
 }
 
+static inline int iocb_write_hint(const struct kiocb *iocb)
+{
+	return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
+			IOCB_WRITE_LIFE_SHIFT;
+}
+
 /*
  * "descriptor" for what we're up to with a read.
  * This allows us to use the same read code yet
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 8fb3b5a6e1ec..0d9d331d3b61 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -374,4 +374,14 @@ enum rw_hint {
 #define RWF_DSYNC			0x00000002 /* per-IO O_DSYNC */
 #define RWF_SYNC			0x00000004 /* per-IO O_SYNC */
 
+/*
+ * Data life time write flags, steal 3 bits for that
+ */
+#define RWF_WRITE_LIFE_SHIFT		4
+#define RWF_WRITE_LIFE_MASK		0x00000070 /* 3 bits of write hints */
+#define RWF_WRITE_LIFE_SHORT		(WRITE_LIFE_SHORT << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_MEDIUM		(WRITE_LIFE_MEDIUM << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_LONG		(WRITE_LIFE_LONG << RWF_WRITE_LIFE_SHIFT)
+#define RWF_WRITE_LIFE_EXTREME		(WRITE_LIFE_EXTREME << RWF_WRITE_LIFE_SHIFT)
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/11] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (3 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-19  6:28   ` Christoph Hellwig
  2017-06-17 19:59 ` [PATCH 06/11] fs: add O_DIRECT support for sending down " Jens Axboe
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

We have a pwritev2(2) interface based on passing in flags. Add an
fcntl interface for querying these flags, and also for setting them
as well:

F_GET_RW_HINT		Returns the read/write hint set. Right now it
			will be one of the WRITE_LIFE_* values.

F_SET_RW_HINT		Pass in rw_hint type to set the read/write hint.
			Only WRITE_LIFE_* hints are currently supported.
			Returns 0 on succes, -1 otherwise.

Sample program testing/implementing basic setting/getting of write
hints is below.

/*
 * writehint.c: check or set a file/inode write hint
 */

static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT",
			"WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG",
			"WRITE_LIFE_EXTREME" };

int main(int argc, char *argv[])
{
	int hint = -1, fd, ret;

	if (argc < 2) {
		fprintf(stderr, "%s: dev <hint>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 2;
	}

	if (argc > 2)
		hint = atoi(argv[2]);

	if (hint == -1) {
		ret = fcntl(fd, F_GET_RW_HINT);
		if (ret < 0) {
			perror("fcntl: F_GET_RW_HINT");
			return 3;
		}
		hint = ret;
	} else {
		ret = fcntl(fd, F_SET_RW_HINT, hint);
		if (ret < 0) {
			perror("fcntl: F_SET_RW_HINT");
			return 4;
		}
	}

	printf("%s: %shint %s\n", argv[1], hint != -1 ? "set " : "", str[hint]);
	close(fd);
	return 0;
}

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fcntl.c                 | 38 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fcntl.h |  6 ++++++
 2 files changed, 44 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f4e7267d117f..417ce336c875 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 }
 #endif
 
+long fcntl_rw_hint(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct inode *inode = file_inode(file);
+	long ret;
+
+	switch (cmd) {
+	case F_GET_RW_HINT:
+		ret = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+		break;
+	case F_SET_RW_HINT: {
+		enum rw_hint hint = arg;
+
+		switch (hint) {
+		case WRITE_LIFE_NONE:
+		case WRITE_LIFE_SHORT:
+		case WRITE_LIFE_MEDIUM:
+		case WRITE_LIFE_LONG:
+		case WRITE_LIFE_EXTREME:
+			inode_set_write_hint(inode, hint);
+			ret = 0;
+			break;
+		default:
+			ret = -EINVAL;
+		}
+		break;
+		}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
@@ -337,6 +371,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GET_SEALS:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
+	case F_GET_RW_HINT:
+	case F_SET_RW_HINT:
+		err = fcntl_rw_hint(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 813afd6eee71..f1a0fbc5bff1 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,12 @@
 /* (1U << 31) is reserved for signed error codes */
 
 /*
+ * Set/Get write life time hints
+ */
+#define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
+#define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/11] fs: add O_DIRECT support for sending down write life time hints
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (4 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 05/11] fs: add fcntl() interface for setting/getting " Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-19  6:28   ` Christoph Hellwig
  2017-06-17 19:59 ` [PATCH 07/11] fs: add support for buffered writeback to pass down write hints Jens Axboe
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c     | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 519599dddd36..b9222a9d285e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 			should_dirty = true;
 	} else {
 		bio.bi_opf = dio_bio_write_op(iocb);
+		bio.bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));
 		task_io_account_write(ret);
 	}
 
@@ -374,6 +375,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio_set_pages_dirty(bio);
 		} else {
 			bio->bi_opf = dio_bio_write_op(iocb);
+			bio->bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index a04ebea77de8..1253b059eae6 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	else
 		bio->bi_end_io = dio_bio_end_io;
 
+	bio->bi_opf |= write_hint_to_opf(iocb_write_hint(dio->iocb));
+
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4b10892967a5..e8639861bd80 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
+			bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
 			task_io_account_write(bio->bi_iter.bi_size);
 		} else {
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/11] fs: add support for buffered writeback to pass down write hints
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (5 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 06/11] fs: add O_DIRECT support for sending down " Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-17 19:59 ` [PATCH 08/11] ext4: add support for passing in write hints for buffered writes Jens Axboe
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/buffer.c | 14 +++++++++-----
 fs/mpage.c  |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..187e54fb0382 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-			 struct writeback_control *wbc);
+			 unsigned int stream, struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+					inode_write_hint(inode), wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+			submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+					inode_write_hint(inode), wbc);
 			nr_underway++;
 		}
 		bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
 }
 
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-			 struct writeback_control *wbc)
+			 unsigned int write_hint, struct writeback_control *wbc)
 {
 	struct bio *bio;
 
@@ -3134,6 +3136,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 		op_flags |= REQ_META;
 	if (buffer_prio(bh))
 		op_flags |= REQ_PRIO;
+
+	op_flags |= write_hint_to_opf(write_hint);
 	bio_set_op_attrs(bio, op, op_flags);
 
 	submit_bio(bio);
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 
 int submit_bh(int op, int op_flags, struct buffer_head *bh)
 {
-	return submit_bh_wbc(op, op_flags, bh, NULL);
+	return submit_bh_wbc(op, op_flags, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index baff8f820c29..e84dcaee6ec7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -614,6 +614,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 			goto confused;
 
 		wbc_init_bio(wbc, bio);
+		bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
 	}
 
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/11] ext4: add support for passing in write hints for buffered writes
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (6 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 07/11] fs: add support for buffered writeback to pass down write hints Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-17 19:59 ` [PATCH 09/11] xfs: " Jens Axboe
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/page-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1a82138ba739..fbc04a8cb35a 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -349,6 +349,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
 	if (bio) {
 		int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
 				  REQ_SYNC : 0;
+		io_op_flags |= write_hint_to_opf(inode_write_hint(io->io_end->inode));
 		bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
 		submit_bio(io->io_bio);
 	}
@@ -396,6 +397,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
 		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;
+		io->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
 	}
 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/11] xfs: add support for passing in write hints for buffered writes
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (7 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 08/11] ext4: add support for passing in write hints for buffered writes Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-17 19:59 ` [PATCH 10/11] btrfs: " Jens Axboe
  2017-06-17 19:59 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
  10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_aops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 09af0f7cd55e..040789c51930 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -505,6 +505,7 @@ xfs_submit_ioend(
 		return status;
 	}
 
+	ioend->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(ioend->io_inode));
 	submit_bio(ioend->io_bio);
 	return 0;
 }
@@ -564,6 +565,7 @@ xfs_chain_bio(
 	bio_chain(ioend->io_bio, new);
 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
 	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
+	ioend->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(ioend->io_inode));
 	submit_bio(ioend->io_bio);
 	ioend->io_bio = new;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/11] btrfs: add support for passing in write hints for buffered writes
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (8 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 09/11] xfs: " Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-17 19:59 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
  10 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/btrfs/extent_io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d3619e010005..92e75e4014ea 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2826,6 +2826,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree,
 	bio_add_page(bio, page, page_size, offset);
 	bio->bi_end_io = end_io_func;
 	bio->bi_private = tree;
+	op_flags |= write_hint_to_opf(inode_write_hint(page->mapping->host));
 	bio_set_op_attrs(bio, op, op_flags);
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/11] nvme: add support for streams and directives
  2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
                   ` (9 preceding siblings ...)
  2017-06-17 19:59 ` [PATCH 10/11] btrfs: " Jens Axboe
@ 2017-06-17 19:59 ` Jens Axboe
  2017-06-19  6:35   ` Christoph Hellwig
  10 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-17 19:59 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.

We default to allocating 4 streams, controller wide, so we can use them
on all name spaces. This is configurable with the 'streams' module
parameter. If a write stream is set in a write, flag is as such before
sending it to the device.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/core.c | 170 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h |   4 ++
 include/linux/nvme.h     |  48 +++++++++++++
 3 files changed, 217 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 903d5813023a..637e9514b406 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -65,6 +65,10 @@ static bool force_apst;
 module_param(force_apst, bool, 0644);
 MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
 
+static char streams = 4;
+module_param(streams, byte, 0644);
+MODULE_PARM_DESC(streams, "if available, allocate this many streams");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -330,10 +334,125 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 
 	return BLK_MQ_RQ_QUEUE_OK;
 }
+/*
+
+ * Returns number of streams allocated for use by, or -1 on error.
+ */
+static int nvme_streams_allocate(struct nvme_ctrl *ctrl, unsigned int nstreams)
+{
+	struct nvme_command c;
+	union nvme_result res;
+	int ret;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_recv;
+	c.directive.nsid = cpu_to_le32(0xffffffff);
+	c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+	c.directive.dtype = NVME_DIR_STREAMS;
+	c.directive.endir = nstreams;
+
+	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res, NULL, 0, 0,
+					NVME_QID_ANY, 0, 0);
+	if (ret)
+		return -1;
+
+	return le32_to_cpu(res.u32) & 0xffff;
+}
+
+static int nvme_enable_streams(struct nvme_ctrl *ctrl)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+
+	c.directive.opcode = nvme_admin_directive_send;
+	c.directive.nsid = cpu_to_le32(0xffffffff);
+	c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+	c.directive.dtype = NVME_DIR_IDENTIFY;
+	c.directive.tdtype = NVME_DIR_STREAMS;
+	c.directive.endir = NVME_DIR_ENDIR;
+
+	return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_get_stream_params(struct nvme_ctrl *ctrl,
+				  struct streams_directive_params *s, u32 nsid)
+{
+	struct nvme_command c;
+
+	memset(&c, 0, sizeof(c));
+	memset(s, 0, sizeof(*s));
+
+	c.directive.opcode = nvme_admin_directive_recv;
+	c.directive.nsid = cpu_to_le32(nsid);
+	c.directive.numd = sizeof(*s);
+	c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+	c.directive.dtype = NVME_DIR_STREAMS;
+
+	return nvme_submit_sync_cmd(ctrl->admin_q, &c, s, sizeof(*s));
+}
+
+static int nvme_setup_directives(struct nvme_ctrl *ctrl)
+{
+	struct streams_directive_params s;
+	int ret;
+
+	if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+		return 0;
+	if (!streams)
+		return 0;
+
+	ret = nvme_enable_streams(ctrl);
+	if (ret)
+		return ret;
+
+	ret = nvme_get_stream_params(ctrl, &s, 0xffffffff);
+	if (ret)
+		return ret;
+
+	ctrl->nssa = le16_to_cpu(s.nssa);
+
+	ret = nvme_streams_allocate(ctrl, min_t(unsigned, streams, ctrl->nssa));
+	if (ret < 0)
+		return ret;
+
+	ctrl->nr_streams = ret;
+	dev_info(ctrl->device, "successfully enabled %d streams\n", ret);
+	return 0;
+}
+
+/*
+ * Check if 'req' has a write hint associated with it. If it does, assign
+ * a valid namespace stream to the write. If we haven't setup streams yet,
+ * kick off configuration and ignore the hints until that has completed.
+ */
+static void nvme_assign_write_stream(struct nvme_ctrl *ctrl,
+				     struct request *req, u16 *control,
+				     u32 *dsmgmt)
+{
+	enum rw_hint streamid;
+
+	streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
+			>> __REQ_WRITE_HINT_SHIFT;
+	if (streamid == WRITE_LIFE_NONE)
+		return;
+
+	/* for now just round-robin, do something more clever later */
+	if (streamid > ctrl->nr_streams)
+		streamid = (streamid % ctrl->nr_streams) + 1;
+
+	if (streamid < ARRAY_SIZE(req->q->write_hints))
+		req->q->write_hints[streamid] += blk_rq_bytes(req) >> 9;
+
+	*control |= NVME_RW_DTYPE_STREAMS;
+	*dsmgmt |= streamid << 16;
+}
 
 static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
 		struct nvme_command *cmnd)
 {
+	struct nvme_ctrl *ctrl = ns->ctrl;
 	u16 control = 0;
 	u32 dsmgmt = 0;
 
@@ -351,6 +470,9 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
 	cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
 	cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
+	if (req_op(req) == REQ_OP_WRITE && ctrl->nr_streams)
+		nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);
+
 	if (ns->ms) {
 		switch (ns->pi_type) {
 		case NVME_NS_DPS_PI_TYPE3:
@@ -985,14 +1107,23 @@ static void nvme_init_integrity(struct nvme_ns *ns)
 
 static void nvme_config_discard(struct nvme_ns *ns)
 {
-	struct nvme_ctrl *ctrl = ns->ctrl;
 	u32 logical_block_size = queue_logical_block_size(ns->queue);
+	struct nvme_ctrl *ctrl = ns->ctrl;
 
 	BUILD_BUG_ON(PAGE_SIZE / sizeof(struct nvme_dsm_range) <
 			NVME_DSM_MAX_RANGES);
 
-	ns->queue->limits.discard_alignment = logical_block_size;
-	ns->queue->limits.discard_granularity = logical_block_size;
+	if (ctrl->nr_streams && ns->sws && ns->sgs) {
+		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
+
+		ns->queue->limits.discard_alignment = sz;
+		ns->queue->limits.discard_granularity = sz;
+	} else {
+		u32 logical_block_size = queue_logical_block_size(ns->queue);
+
+		ns->queue->limits.discard_alignment = logical_block_size;
+		ns->queue->limits.discard_granularity = logical_block_size;
+	}
 	blk_queue_max_discard_sectors(ns->queue, UINT_MAX);
 	blk_queue_max_discard_segments(ns->queue, NVME_DSM_MAX_RANGES);
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, ns->queue);
@@ -1024,6 +1155,7 @@ static int nvme_revalidate_ns(struct nvme_ns *ns, struct nvme_id_ns **id)
 static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 {
 	struct nvme_ns *ns = disk->private_data;
+	struct nvme_ctrl *ctrl = ns->ctrl;
 	u16 bs;
 
 	/*
@@ -1037,7 +1169,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 
 	blk_mq_freeze_queue(disk->queue);
 
-	if (ns->ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
+	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
 	blk_queue_logical_block_size(ns->queue, bs);
 	if (ns->ms && !blk_get_integrity(disk) && !ns->ext)
@@ -1047,7 +1179,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	else
 		set_capacity(disk, le64_to_cpup(&id->nsze) << (ns->lba_shift - 9));
 
-	if (ns->ctrl->oncs & NVME_CTRL_ONCS_DSM)
+	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
 	blk_mq_unfreeze_queue(disk->queue);
 }
@@ -1650,6 +1782,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		dev_pm_qos_hide_latency_tolerance(ctrl->device);
 
 	nvme_configure_apst(ctrl);
+	nvme_setup_directives(ctrl);
 
 	ctrl->identified = true;
 
@@ -2019,6 +2152,32 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	return ret;
 }
 
+static int nvme_setup_streams_ns(struct nvme_ctrl *ctrl, struct nvme_ns *ns)
+{
+	struct streams_directive_params s;
+	int ret;
+
+	if (!ctrl->nr_streams)
+		return 0;
+
+	ret = nvme_get_stream_params(ctrl, &s, ns->ns_id);
+	if (ret)
+		return ret;
+
+	ns->sws = le32_to_cpu(s.sws);
+	ns->sgs = le16_to_cpu(s.sgs);
+
+	if (ns->sws) {
+		unsigned int bs = 1 << ns->lba_shift;
+
+		blk_queue_io_min(ns->queue, bs * ns->sws);
+		if (ns->sgs)
+			blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);
+	}
+
+	return 0;
+}
+
 static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 {
 	struct nvme_ns *ns;
@@ -2048,6 +2207,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
 	nvme_set_queue_limits(ctrl, ns->queue);
+	nvme_setup_streams_ns(ctrl, ns);
 
 	sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->instance);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..533f86acd961 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -147,6 +147,8 @@ struct nvme_ctrl {
 	u16 oncs;
 	u16 vid;
 	u16 oacs;
+	u16 nssa;
+	u16 nr_streams;
 	atomic_t abort_limit;
 	u8 event_limit;
 	u8 vwc;
@@ -194,6 +196,8 @@ struct nvme_ns {
 	unsigned ns_id;
 	int lba_shift;
 	u16 ms;
+	u16 sgs;
+	u32 sws;
 	bool ext;
 	u8 pi_type;
 	unsigned long flags;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..8b2f5b140134 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -245,6 +245,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 7,
 };
 
@@ -295,6 +296,19 @@ enum {
 };
 
 enum {
+	NVME_DIR_IDENTIFY		= 0x00,
+	NVME_DIR_STREAMS		= 0x01,
+	NVME_DIR_SND_ID_OP_ENABLE	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_ID	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_RSC	= 0x02,
+	NVME_DIR_RCV_ID_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_STATUS	= 0x02,
+	NVME_DIR_RCV_ST_OP_RESOURCE	= 0x03,
+	NVME_DIR_ENDIR			= 0x01,
+};
+
+enum {
 	NVME_NS_FEAT_THIN	= 1 << 0,
 	NVME_NS_FLBAS_LBA_MASK	= 0xf,
 	NVME_NS_FLBAS_META_EXT	= 0x10,
@@ -535,6 +549,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_APP	= 1 << 11,
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
+	NVME_RW_DTYPE_STREAMS		= 1 << 4,
 };
 
 struct nvme_dsm_cmd {
@@ -604,6 +619,8 @@ enum nvme_admin_opcode {
 	nvme_admin_download_fw		= 0x11,
 	nvme_admin_ns_attach		= 0x15,
 	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_directive_send	= 0x19,
+	nvme_admin_directive_recv	= 0x1a,
 	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
@@ -756,6 +773,24 @@ struct nvme_get_log_page_command {
 	__u32			rsvd14[2];
 };
 
+struct nvme_directive_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le32			numd;
+	__u8			doper;
+	__u8			dtype;
+	__le16			dspec;
+	__u8			endir;
+	__u8			tdtype;
+	__u16			rsvd15;
+
+	__u32			rsvd16[3];
+};
+
 /*
  * Fabrics subcommands.
  */
@@ -886,6 +921,18 @@ struct nvme_dbbuf {
 	__u32			rsvd12[6];
 };
 
+struct streams_directive_params {
+	__u16	msl;
+	__u16	nssa;
+	__u16	nsso;
+	__u8	rsvd[10];
+	__u32	sws;
+	__u16	sgs;
+	__u16	nsa;
+	__u16	nso;
+	__u8	rsvd2[6];
+};
+
 struct nvme_command {
 	union {
 		struct nvme_common_command common;
@@ -906,6 +953,7 @@ struct nvme_command {
 		struct nvmf_property_set_command prop_set;
 		struct nvmf_property_get_command prop_get;
 		struct nvme_dbbuf dbbuf;
+		struct nvme_directive_cmd directive;
 	};
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] fs: add support for an inode to carry write hint related data
  2017-06-17 19:59 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
@ 2017-06-19  6:26   ` Christoph Hellwig
  2017-06-19 14:55     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19  6:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen

> +/*
> + * Write life time hint values.
> + */
> +enum rw_hint {
> +	WRITE_LIFE_NONE = 0,
> +	WRITE_LIFE_SHORT,
> +	WRITE_LIFE_MEDIUM,
> +	WRITE_LIFE_LONG,
> +	WRITE_LIFE_EXTREME,
> +};
> +
> +#define RW_HINT_MASK		0x7	/* 3 bits */

FYI, exposing enums in a uapi is always a bit problematic, due to
different ABI rules.  It might be better to make these explicit defines
at least for the uapi.

Btw, I think it might make sense to merge this with patch 5.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-17 19:59 ` [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
@ 2017-06-19  6:27   ` Christoph Hellwig
  2017-06-19 14:56     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19  6:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen

On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote:
> Add four flags for the pwritev2(2) system call, allowing an application
> to give the kernel a hint about what on-media life times can be
> expected from a given write.
> 
> The intent is for these values to be relative to each other, no
> absolute meaning should be attached to these flag names.
> 
> Set aside 3 bits in the iocb flags structure to carry this information
> over from the pwritev2 RWF_WRITE_LIFE_* flags.

What is the strong use case for the per-I/O flags?  I'd much rather
stick to fcntl only for now if we can.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/11] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-17 19:59 ` [PATCH 05/11] fs: add fcntl() interface for setting/getting " Jens Axboe
@ 2017-06-19  6:28   ` Christoph Hellwig
  2017-06-19 14:57     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19  6:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen

On Sat, Jun 17, 2017 at 01:59:48PM -0600, Jens Axboe wrote:
> We have a pwritev2(2) interface based on passing in flags. Add an
> fcntl interface for querying these flags, and also for setting them
> as well:
> 
> F_GET_RW_HINT		Returns the read/write hint set. Right now it
> 			will be one of the WRITE_LIFE_* values.
> 
> F_SET_RW_HINT		Pass in rw_hint type to set the read/write hint.
> 			Only WRITE_LIFE_* hints are currently supported.
> 			Returns 0 on succes, -1 otherwise.
> 
> Sample program testing/implementing basic setting/getting of write
> hints is below.
> 
> /*
>  * writehint.c: check or set a file/inode write hint
>  */
> 
> static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT",
> 			"WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG",
> 			"WRITE_LIFE_EXTREME" };
> 
> int main(int argc, char *argv[])
> {
> 	int hint = -1, fd, ret;
> 
> 	if (argc < 2) {
> 		fprintf(stderr, "%s: dev <hint>\n", argv[0]);
> 		return 1;
> 	}
> 
> 	fd = open(argv[1], O_RDONLY);
> 	if (fd < 0) {
> 		perror("open");
> 		return 2;
> 	}
> 
> 	if (argc > 2)
> 		hint = atoi(argv[2]);
> 
> 	if (hint == -1) {
> 		ret = fcntl(fd, F_GET_RW_HINT);
> 		if (ret < 0) {
> 			perror("fcntl: F_GET_RW_HINT");
> 			return 3;
> 		}
> 		hint = ret;
> 	} else {
> 		ret = fcntl(fd, F_SET_RW_HINT, hint);
> 		if (ret < 0) {
> 			perror("fcntl: F_SET_RW_HINT");
> 			return 4;
> 		}
> 	}
> 
> 	printf("%s: %shint %s\n", argv[1], hint != -1 ? "set " : "", str[hint]);
> 	close(fd);
> 	return 0;
> }
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/fcntl.c                 | 38 ++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fcntl.h |  6 ++++++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index f4e7267d117f..417ce336c875 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
>  }
>  #endif
>  
> +long fcntl_rw_hint(struct file *file, unsigned int cmd, unsigned long arg)

The unsigned long arg is a little annoying because it will de different
size for 32-bit vs 64-vit architectures.

How about just doing a get_user and use a fixed-size 64-bit value?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/11] fs: add O_DIRECT support for sending down write life time hints
  2017-06-17 19:59 ` [PATCH 06/11] fs: add O_DIRECT support for sending down " Jens Axboe
@ 2017-06-19  6:28   ` Christoph Hellwig
  2017-06-19 14:57     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19  6:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen

On Sat, Jun 17, 2017 at 01:59:49PM -0600, Jens Axboe wrote:
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/block_dev.c | 2 ++
>  fs/direct-io.c | 2 ++
>  fs/iomap.c     | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 519599dddd36..b9222a9d285e 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  			should_dirty = true;
>  	} else {
>  		bio.bi_opf = dio_bio_write_op(iocb);
> +		bio.bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));

Move this into dio_bio_write_op instead of duplicating it?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] nvme: add support for streams and directives
  2017-06-17 19:59 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
@ 2017-06-19  6:35   ` Christoph Hellwig
  2017-06-19 15:04     ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19  6:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, linux-block, adilger, hch, martin.petersen

Can you add linux-nvme for the next repost?

As said before I think we should rely on implicit streams allocation,
as that will make the whole patch a lot simpler, and it solves the issue
that your current patch will take away your 4 streams from the general
pool on every controller that supports streams, which isn't optimal.

> +	streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
> +			>> __REQ_WRITE_HINT_SHIFT;

Can we add a helper to convert the REQ_* flags back to the hint next
to where we have the helper to do the reverse one?

> +	if (streamid == WRITE_LIFE_NONE)
> +		return;
> +
> +	/* for now just round-robin, do something more clever later */
> +	if (streamid > ctrl->nr_streams)
> +		streamid = (streamid % ctrl->nr_streams) + 1;

I thought you were going to fix that to do more intelligent collapsing?

> -	ns->queue->limits.discard_granularity = logical_block_size;
> +	if (ctrl->nr_streams && ns->sws && ns->sgs) {
> +		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
> +
> +		ns->queue->limits.discard_alignment = sz;

I don't think this is correct:

"Data that is aligned to and in multiples of the Stream Write Size (SWS)
 provides optimal performance of the write commands to the controller.
 The Stream Granularity Size indicates the size of the media that is prepared
 as a unit for future allocation for write commands and is a multiple of the
 Stream Write Size. The controller may allocate and group together a stream
 in Stream Granularity Size (SGS) units. Refer to Figure 293."

So I think the ns->sws should go away.

> +		ns->queue->limits.discard_granularity = sz;
> +	} else {
> +		u32 logical_block_size = queue_logical_block_size(ns->queue);

I think we already have a logical_block_size with the same value in
scope here.

> +
> +	ns->sws = le32_to_cpu(s.sws);
> +	ns->sgs = le16_to_cpu(s.sgs);
> +
> +	if (ns->sws) {
> +		unsigned int bs = 1 << ns->lba_shift;
> +
> +		blk_queue_io_min(ns->queue, bs * ns->sws);
> +		if (ns->sgs)
> +			blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);

drop the multiplication with ns->sws here as well.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/11] fs: add support for an inode to carry write hint related data
  2017-06-19  6:26   ` Christoph Hellwig
@ 2017-06-19 14:55     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 14:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 12:26 AM, Christoph Hellwig wrote:
>> +/*
>> + * Write life time hint values.
>> + */
>> +enum rw_hint {
>> +	WRITE_LIFE_NONE = 0,
>> +	WRITE_LIFE_SHORT,
>> +	WRITE_LIFE_MEDIUM,
>> +	WRITE_LIFE_LONG,
>> +	WRITE_LIFE_EXTREME,
>> +};
>> +
>> +#define RW_HINT_MASK		0x7	/* 3 bits */
> 
> FYI, exposing enums in a uapi is always a bit problematic, due to
> different ABI rules.  It might be better to make these explicit defines
> at least for the uapi.
> 
> Btw, I think it might make sense to merge this with patch 5.

OK, I'll change the uapi to be defines.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19  6:27   ` Christoph Hellwig
@ 2017-06-19 14:56     ` Jens Axboe
  2017-06-19 16:02       ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 14:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 12:27 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote:
>> Add four flags for the pwritev2(2) system call, allowing an application
>> to give the kernel a hint about what on-media life times can be
>> expected from a given write.
>>
>> The intent is for these values to be relative to each other, no
>> absolute meaning should be attached to these flag names.
>>
>> Set aside 3 bits in the iocb flags structure to carry this information
>> over from the pwritev2 RWF_WRITE_LIFE_* flags.
> 
> What is the strong use case for the per-I/O flags?  I'd much rather
> stick to fcntl only for now if we can.

Fine, I guess I should just have dusted off the 2 year old patchset,
as that was _exactly_ what that did.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/11] fs: add fcntl() interface for setting/getting write life time hints
  2017-06-19  6:28   ` Christoph Hellwig
@ 2017-06-19 14:57     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 14:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 12:28 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 01:59:48PM -0600, Jens Axboe wrote:
>> We have a pwritev2(2) interface based on passing in flags. Add an
>> fcntl interface for querying these flags, and also for setting them
>> as well:
>>
>> F_GET_RW_HINT		Returns the read/write hint set. Right now it
>> 			will be one of the WRITE_LIFE_* values.
>>
>> F_SET_RW_HINT		Pass in rw_hint type to set the read/write hint.
>> 			Only WRITE_LIFE_* hints are currently supported.
>> 			Returns 0 on succes, -1 otherwise.
>>
>> Sample program testing/implementing basic setting/getting of write
>> hints is below.
>>
>> /*
>>  * writehint.c: check or set a file/inode write hint
>>  */
>>
>> static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT",
>> 			"WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG",
>> 			"WRITE_LIFE_EXTREME" };
>>
>> int main(int argc, char *argv[])
>> {
>> 	int hint = -1, fd, ret;
>>
>> 	if (argc < 2) {
>> 		fprintf(stderr, "%s: dev <hint>\n", argv[0]);
>> 		return 1;
>> 	}
>>
>> 	fd = open(argv[1], O_RDONLY);
>> 	if (fd < 0) {
>> 		perror("open");
>> 		return 2;
>> 	}
>>
>> 	if (argc > 2)
>> 		hint = atoi(argv[2]);
>>
>> 	if (hint == -1) {
>> 		ret = fcntl(fd, F_GET_RW_HINT);
>> 		if (ret < 0) {
>> 			perror("fcntl: F_GET_RW_HINT");
>> 			return 3;
>> 		}
>> 		hint = ret;
>> 	} else {
>> 		ret = fcntl(fd, F_SET_RW_HINT, hint);
>> 		if (ret < 0) {
>> 			perror("fcntl: F_SET_RW_HINT");
>> 			return 4;
>> 		}
>> 	}
>>
>> 	printf("%s: %shint %s\n", argv[1], hint != -1 ? "set " : "", str[hint]);
>> 	close(fd);
>> 	return 0;
>> }
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/fcntl.c                 | 38 ++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/fcntl.h |  6 ++++++
>>  2 files changed, 44 insertions(+)
>>
>> diff --git a/fs/fcntl.c b/fs/fcntl.c
>> index f4e7267d117f..417ce336c875 100644
>> --- a/fs/fcntl.c
>> +++ b/fs/fcntl.c
>> @@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
>>  }
>>  #endif
>>  
>> +long fcntl_rw_hint(struct file *file, unsigned int cmd, unsigned long arg)
> 
> The unsigned long arg is a little annoying because it will de different
> size for 32-bit vs 64-vit architectures.
> 
> How about just doing a get_user and use a fixed-size 64-bit value?

It's just passing it in from fcnt(), my intent was for it to be a u32 on the
uapi side. But we can make it a 64-bit type, since we're going away from
the enum and making it a proper type.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/11] fs: add O_DIRECT support for sending down write life time hints
  2017-06-19  6:28   ` Christoph Hellwig
@ 2017-06-19 14:57     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 14:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 12:28 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 01:59:49PM -0600, Jens Axboe wrote:
>> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/block_dev.c | 2 ++
>>  fs/direct-io.c | 2 ++
>>  fs/iomap.c     | 1 +
>>  3 files changed, 5 insertions(+)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 519599dddd36..b9222a9d285e 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>>  			should_dirty = true;
>>  	} else {
>>  		bio.bi_opf = dio_bio_write_op(iocb);
>> +		bio.bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));
> 
> Move this into dio_bio_write_op instead of duplicating it?

Good cleanup, will do.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 11/11] nvme: add support for streams and directives
  2017-06-19  6:35   ` Christoph Hellwig
@ 2017-06-19 15:04     ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 15:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 12:35 AM, Christoph Hellwig wrote:
> Can you add linux-nvme for the next repost?
> 
> As said before I think we should rely on implicit streams allocation,
> as that will make the whole patch a lot simpler, and it solves the issue
> that your current patch will take away your 4 streams from the general
> pool on every controller that supports streams, which isn't optimal.

The only thing it'll change in the patch is the removal of
nvme_streams_allocate(), which is 20 lines of code. So I don't
think it'll simplify things a lot. The patch is already very simple.
But if we don't need to allocate the streams, then of course it should
just go.

>> +	streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
>> +			>> __REQ_WRITE_HINT_SHIFT;
> 
> Can we add a helper to convert the REQ_* flags back to the hint next
> to where we have the helper to do the reverse one?

OK, will od.

>> +	if (streamid == WRITE_LIFE_NONE)
>> +		return;
>> +
>> +	/* for now just round-robin, do something more clever later */
>> +	if (streamid > ctrl->nr_streams)
>> +		streamid = (streamid % ctrl->nr_streams) + 1;
> 
> I thought you were going to fix that to do more intelligent collapsing?

Sure, if you want me to do that now, I can. I propose:

- With 4 streams, direct mapping.
- With 3 streams, collapse two longest life times.
- With 2 streams, collapse short+medium, and long+extreme.
- With 1 stream, don't use streams.

We could potentially still use streams with just 1 stream, since we
have the default of no stream as well. But I don't think it makes
sense at that point.

> 
>> -	ns->queue->limits.discard_granularity = logical_block_size;
>> +	if (ctrl->nr_streams && ns->sws && ns->sgs) {
>> +		unsigned int sz = logical_block_size * ns->sws * ns->sgs;
>> +
>> +		ns->queue->limits.discard_alignment = sz;
> 
> I don't think this is correct:
> 
> "Data that is aligned to and in multiples of the Stream Write Size (SWS)
>  provides optimal performance of the write commands to the controller.
>  The Stream Granularity Size indicates the size of the media that is prepared
>  as a unit for future allocation for write commands and is a multiple of the
>  Stream Write Size. The controller may allocate and group together a stream
>  in Stream Granularity Size (SGS) units. Refer to Figure 293."
> 
> So I think the ns->sws should go away.

The SGS value is in units of SWS, however.

>> +		ns->queue->limits.discard_granularity = sz;
>> +	} else {
>> +		u32 logical_block_size = queue_logical_block_size(ns->queue);
> 
> I think we already have a logical_block_size with the same value in
> scope here.

Oops yes.

>> +	ns->sws = le32_to_cpu(s.sws);
>> +	ns->sgs = le16_to_cpu(s.sgs);
>> +
>> +	if (ns->sws) {
>> +		unsigned int bs = 1 << ns->lba_shift;
>> +
>> +		blk_queue_io_min(ns->queue, bs * ns->sws);
>> +		if (ns->sgs)
>> +			blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);
> 
> drop the multiplication with ns->sws here as well.

See above.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 14:56     ` Jens Axboe
@ 2017-06-19 16:02       ` Jens Axboe
  2017-06-19 18:58         ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 16:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 08:56 AM, Jens Axboe wrote:
> On 06/19/2017 12:27 AM, Christoph Hellwig wrote:
>> On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote:
>>> Add four flags for the pwritev2(2) system call, allowing an application
>>> to give the kernel a hint about what on-media life times can be
>>> expected from a given write.
>>>
>>> The intent is for these values to be relative to each other, no
>>> absolute meaning should be attached to these flag names.
>>>
>>> Set aside 3 bits in the iocb flags structure to carry this information
>>> over from the pwritev2 RWF_WRITE_LIFE_* flags.
>>
>> What is the strong use case for the per-I/O flags?  I'd much rather
>> stick to fcntl only for now if we can.
> 
> Fine, I guess I should just have dusted off the 2 year old patchset,
> as that was _exactly_ what that did.

Actually, one good use case is O_DIRECT on a block device. Since I'm
not a huge fan of having per-call hints that is only useful for a
single case, how about we add the hints to the struct file as well?
For buffered IO, just grab it from the inode. If we have a file
available, then that overrides the per-inode setting.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 16:02       ` Jens Axboe
@ 2017-06-19 18:58         ` Christoph Hellwig
  2017-06-19 19:00           ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-19 18:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger, martin.petersen

On Mon, Jun 19, 2017 at 10:02:09AM -0600, Jens Axboe wrote:
> Actually, one good use case is O_DIRECT on a block device. Since I'm
> not a huge fan of having per-call hints that is only useful for a
> single case, how about we add the hints to the struct file as well?
> For buffered IO, just grab it from the inode. If we have a file
> available, then that overrides the per-inode setting.

Even for buffered I/O per-fіle would seem more useful to be honest.
For the buffer_head based file systems this could even be done fairly
easily.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 18:58         ` Christoph Hellwig
@ 2017-06-19 19:00           ` Jens Axboe
  2017-06-19 19:10             ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 19:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 12:58 PM, Christoph Hellwig wrote:
> On Mon, Jun 19, 2017 at 10:02:09AM -0600, Jens Axboe wrote:
>> Actually, one good use case is O_DIRECT on a block device. Since I'm
>> not a huge fan of having per-call hints that is only useful for a
>> single case, how about we add the hints to the struct file as well?
>> For buffered IO, just grab it from the inode. If we have a file
>> available, then that overrides the per-inode setting.
> 
> Even for buffered I/O per-fіle would seem more useful to be honest.
> For the buffer_head based file systems this could even be done fairly
> easily.

If I add the per-file hint as well, then anywhere that has the file should
just grab it from there. Unless not set, then grab from inode.

That does raise an issue with the NONE hint being 0. We can tell right now
if NONE was set, or nothing was set. This becomes a problem if we want the
file hint to override the inode hint. Should probably just bump the values
up by one, so that NONE is 1, SHORT is 2, etc.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 19:00           ` Jens Axboe
@ 2017-06-19 19:10             ` Jens Axboe
  2017-06-19 20:33               ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 19:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 01:00 PM, Jens Axboe wrote:
> On 06/19/2017 12:58 PM, Christoph Hellwig wrote:
>> On Mon, Jun 19, 2017 at 10:02:09AM -0600, Jens Axboe wrote:
>>> Actually, one good use case is O_DIRECT on a block device. Since I'm
>>> not a huge fan of having per-call hints that is only useful for a
>>> single case, how about we add the hints to the struct file as well?
>>> For buffered IO, just grab it from the inode. If we have a file
>>> available, then that overrides the per-inode setting.
>>
>> Even for buffered I/O per-fіle would seem more useful to be honest.
>> For the buffer_head based file systems this could even be done fairly
>> easily.
> 
> If I add the per-file hint as well, then anywhere that has the file should
> just grab it from there. Unless not set, then grab from inode.
> 
> That does raise an issue with the NONE hint being 0. We can tell right now
> if NONE was set, or nothing was set. This becomes a problem if we want the
> file hint to override the inode hint. Should probably just bump the values
> up by one, so that NONE is 1, SHORT is 2, etc.

Actually, we don't have to, as long as the file inherits the inode mask.
Then we can just use the file hint if it differs from the inode hint.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 19:10             ` Jens Axboe
@ 2017-06-19 20:33               ` Jens Axboe
  2017-06-20  2:06                 ` Jens Axboe
  2017-06-20  8:57                 ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-19 20:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 01:10 PM, Jens Axboe wrote:
> On 06/19/2017 01:00 PM, Jens Axboe wrote:
>> On 06/19/2017 12:58 PM, Christoph Hellwig wrote:
>>> On Mon, Jun 19, 2017 at 10:02:09AM -0600, Jens Axboe wrote:
>>>> Actually, one good use case is O_DIRECT on a block device. Since I'm
>>>> not a huge fan of having per-call hints that is only useful for a
>>>> single case, how about we add the hints to the struct file as well?
>>>> For buffered IO, just grab it from the inode. If we have a file
>>>> available, then that overrides the per-inode setting.
>>>
>>> Even for buffered I/O per-fіle would seem more useful to be honest.
>>> For the buffer_head based file systems this could even be done fairly
>>> easily.
>>
>> If I add the per-file hint as well, then anywhere that has the file should
>> just grab it from there. Unless not set, then grab from inode.
>>
>> That does raise an issue with the NONE hint being 0. We can tell right now
>> if NONE was set, or nothing was set. This becomes a problem if we want the
>> file hint to override the inode hint. Should probably just bump the values
>> up by one, so that NONE is 1, SHORT is 2, etc.
> 
> Actually, we don't have to, as long as the file inherits the inode mask.
> Then we can just use the file hint if it differs from the inode hint.

That doesn't work, in case it's cleared, or for checking whether it has
been set or not. Oh well, I added a NOT_SET variant for this. See below
for an incremental that adds support for file write hints as well. Use
the file write hint, if we have it, otherwise use the inode provided
one.

Setting hints on a file propagates to the inode, only if the inode doesn't
currently have a hint set.


diff --git a/fs/fcntl.c b/fs/fcntl.c
index 113b78c11631..34ca821767a0 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -247,12 +247,16 @@ static long fcntl_rw_hint(struct file *file, unsigned int cmd,
 			  u64 __user *ptr)
 {
 	struct inode *inode = file_inode(file);
+	u64 old_hint, hint;
 	long ret = 0;
-	u64 hint;
 
 	switch (cmd) {
 	case F_GET_RW_HINT:
-		hint = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+		if (file->f_write_hint != WRITE_LIFE_NOT_SET)
+			hint = file->f_write_hint;
+		else
+			hint = mask_to_write_hint(inode->i_flags,
+							S_WRITE_LIFE_SHIFT);
 		if (put_user(hint, ptr))
 			ret = -EFAULT;
 		break;
@@ -267,7 +271,15 @@ static long fcntl_rw_hint(struct file *file, unsigned int cmd,
 		case WRITE_LIFE_MEDIUM:
 		case WRITE_LIFE_LONG:
 		case WRITE_LIFE_EXTREME:
-			inode_set_write_hint(inode, hint);
+			spin_lock(&file->f_lock);
+			file->f_write_hint = hint;
+			spin_unlock(&file->f_lock);
+
+			/* Only propagate hint to inode, if no hint is set */
+			old_hint = mask_to_write_hint(inode->i_flags,
+							S_WRITE_LIFE_SHIFT);
+			if (old_hint == WRITE_LIFE_NOT_SET)
+				inode_set_write_hint(inode, hint);
 			ret = 0;
 			break;
 		default:
diff --git a/fs/inode.c b/fs/inode.c
index defb015a2c6d..e4a4e123d52b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -134,7 +134,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
-	inode->i_flags = 0;
+	inode->i_flags = S_WRITE_LIFE_MASK;
 	atomic_set(&inode->i_count, 1);
 	inode->i_op = &empty_iops;
 	inode->i_fop = &no_open_fops;
diff --git a/fs/open.c b/fs/open.c
index cd0c5be8d012..3fe0c4aa7d27 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -759,6 +759,7 @@ static int do_dentry_open(struct file *f,
 	     likely(f->f_op->write || f->f_op->write_iter))
 		f->f_mode |= FMODE_CAN_WRITE;
 
+	f->f_write_hint = WRITE_LIFE_NOT_SET;
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8720251cc153..e81bdb8ec189 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -859,6 +859,7 @@ struct file {
 	 * Must not be taken from IRQ context.
 	 */
 	spinlock_t		f_lock;
+	unsigned int		f_write_hint;
 	atomic_long_t		f_count;
 	unsigned int 		f_flags;
 	fmode_t			f_mode;
@@ -1902,7 +1903,9 @@ enum rw_hint {
 	WRITE_LIFE_SHORT = RWH_WRITE_LIFE_SHORT,
 	WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
 	WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
-	WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME
+	WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
+
+	WRITE_LIFE_NOT_SET = 7,
 };
 
 static inline unsigned int write_hint_to_mask(enum rw_hint hint,
@@ -1917,12 +1920,25 @@ static inline enum rw_hint mask_to_write_hint(unsigned int mask,
 	return (mask >> shift) & 0x7;
 }
 
-static inline unsigned int inode_write_hint(struct inode *inode)
+static inline enum rw_hint inode_write_hint(struct inode *inode)
 {
-	if (inode)
-		return mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+	enum rw_hint ret = WRITE_LIFE_NONE;
 
-	return 0;
+	if (inode) {
+		ret = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+		if (ret == WRITE_LIFE_NOT_SET)
+			ret = WRITE_LIFE_NONE;
+	}
+
+	return ret;
+}
+
+static inline enum rw_hint file_write_hint(struct file *file)
+{
+	if (file->f_write_hint != WRITE_LIFE_NOT_SET)
+		return file->f_write_hint;
+
+	return inode_write_hint(file_inode(file));
 }
 
 /*
@@ -3097,9 +3113,7 @@ static inline bool io_is_direct(struct file *filp)
 
 static inline int iocb_flags(struct file *file)
 {
-	struct inode *inode = file_inode(file);
 	int res = 0;
-
 	if (file->f_flags & O_APPEND)
 		res |= IOCB_APPEND;
 	if (io_is_direct(file))
@@ -3108,13 +3122,8 @@ static inline int iocb_flags(struct file *file)
 		res |= IOCB_DSYNC;
 	if (file->f_flags & __O_SYNC)
 		res |= IOCB_SYNC;
-	if (mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
-		enum rw_hint hint;
-
-		hint = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
-		res |= write_hint_to_mask(hint, IOCB_WRITE_LIFE_SHIFT);
-	}
 
+	res |= write_hint_to_mask(file->f_write_hint, IOCB_WRITE_LIFE_SHIFT);
 	return res;
 }
 

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 20:33               ` Jens Axboe
@ 2017-06-20  2:06                 ` Jens Axboe
  2017-06-20  8:57                 ` Christoph Hellwig
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-20  2:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/19/2017 02:33 PM, Jens Axboe wrote:
> On 06/19/2017 01:10 PM, Jens Axboe wrote:
>> On 06/19/2017 01:00 PM, Jens Axboe wrote:
>>> On 06/19/2017 12:58 PM, Christoph Hellwig wrote:
>>>> On Mon, Jun 19, 2017 at 10:02:09AM -0600, Jens Axboe wrote:
>>>>> Actually, one good use case is O_DIRECT on a block device. Since I'm
>>>>> not a huge fan of having per-call hints that is only useful for a
>>>>> single case, how about we add the hints to the struct file as well?
>>>>> For buffered IO, just grab it from the inode. If we have a file
>>>>> available, then that overrides the per-inode setting.
>>>>
>>>> Even for buffered I/O per-fіle would seem more useful to be honest.
>>>> For the buffer_head based file systems this could even be done fairly
>>>> easily.
>>>
>>> If I add the per-file hint as well, then anywhere that has the file should
>>> just grab it from there. Unless not set, then grab from inode.
>>>
>>> That does raise an issue with the NONE hint being 0. We can tell right now
>>> if NONE was set, or nothing was set. This becomes a problem if we want the
>>> file hint to override the inode hint. Should probably just bump the values
>>> up by one, so that NONE is 1, SHORT is 2, etc.
>>
>> Actually, we don't have to, as long as the file inherits the inode mask.
>> Then we can just use the file hint if it differs from the inode hint.
> 
> That doesn't work, in case it's cleared, or for checking whether it has
> been set or not. Oh well, I added a NOT_SET variant for this. See below
> for an incremental that adds support for file write hints as well. Use
> the file write hint, if we have it, otherwise use the inode provided
> one.
> 
> Setting hints on a file propagates to the inode, only if the inode doesn't
> currently have a hint set.

I didn't like the special 0x7 for NOT_SET since it hard codes the fact
that we currently use 3 bits, and since we have to initialize the inode
flags to that weird value. Since I sent out a v8 already, I'll just
point at the current branch;

http://git.kernel.dk/cgit/linux-block/log/?h=write-stream

The main change is using 0 as the NOT_SET value, and shifting everything
up by one. I think this is better, since we can also use that value to
clear hints down to the inode. Additionally, if we add more hints in
the future, it's much saner to retain '0' as the NOT_SET value, rather
than having a strange magic value for it.

Let me know what you think. As far as I'm concerned, the core API should
be ready now. For the NVMe bits, I'm fine with removing the stream
allocation.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-19 20:33               ` Jens Axboe
  2017-06-20  2:06                 ` Jens Axboe
@ 2017-06-20  8:57                 ` Christoph Hellwig
  2017-06-20 12:43                   ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-20  8:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger, martin.petersen

On Mon, Jun 19, 2017 at 02:33:41PM -0600, Jens Axboe wrote:
> That doesn't work, in case it's cleared, or for checking whether it has
> been set or not. Oh well, I added a NOT_SET variant for this. See below
> for an incremental that adds support for file write hints as well. Use
> the file write hint, if we have it, otherwise use the inode provided
> one.
> 
> Setting hints on a file propagates to the inode, only if the inode doesn't
> currently have a hint set.

Question:  IFF we can get the per-file hints to propagate to the
writeback code (which I think is pretty easy for ext4/xfs/bdev, not
sure about btrfs) do we even need the per-inode ones at all?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20  8:57                 ` Christoph Hellwig
@ 2017-06-20 12:43                   ` Jens Axboe
  2017-06-20 12:43                     ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-20 12:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/20/2017 02:57 AM, Christoph Hellwig wrote:
> On Mon, Jun 19, 2017 at 02:33:41PM -0600, Jens Axboe wrote:
>> That doesn't work, in case it's cleared, or for checking whether it has
>> been set or not. Oh well, I added a NOT_SET variant for this. See below
>> for an incremental that adds support for file write hints as well. Use
>> the file write hint, if we have it, otherwise use the inode provided
>> one.
>>
>> Setting hints on a file propagates to the inode, only if the inode doesn't
>> currently have a hint set.
> 
> Question:  IFF we can get the per-file hints to propagate to the
> writeback code (which I think is pretty easy for ext4/xfs/bdev, not
> sure about btrfs) do we even need the per-inode ones at all?

How is that possible, the file could be long gone by the time
the writeback kicks in?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20 12:43                   ` Jens Axboe
@ 2017-06-20 12:43                     ` Christoph Hellwig
  2017-06-20 12:45                       ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-20 12:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger, martin.petersen

On Tue, Jun 20, 2017 at 06:43:10AM -0600, Jens Axboe wrote:
> > Question:  IFF we can get the per-file hints to propagate to the
> > writeback code (which I think is pretty easy for ext4/xfs/bdev, not
> > sure about btrfs) do we even need the per-inode ones at all?
> 
> How is that possible, the file could be long gone by the time
> the writeback kicks in?

each time a page/buffer is dirtied we can propagate the hint.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20 12:43                     ` Christoph Hellwig
@ 2017-06-20 12:45                       ` Jens Axboe
  2017-06-20 12:47                         ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-20 12:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/20/2017 06:43 AM, Christoph Hellwig wrote:
> On Tue, Jun 20, 2017 at 06:43:10AM -0600, Jens Axboe wrote:
>>> Question:  IFF we can get the per-file hints to propagate to the
>>> writeback code (which I think is pretty easy for ext4/xfs/bdev, not
>>> sure about btrfs) do we even need the per-inode ones at all?
>>
>> How is that possible, the file could be long gone by the time
>> the writeback kicks in?
> 
> each time a page/buffer is dirtied we can propagate the hint.

To the page? Where do you want to find room for that?


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20 12:45                       ` Jens Axboe
@ 2017-06-20 12:47                         ` Christoph Hellwig
  2017-06-20 12:51                           ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-20 12:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger, martin.petersen

On Tue, Jun 20, 2017 at 06:45:45AM -0600, Jens Axboe wrote:
> To the page? Where do you want to find room for that?

In the page we don't.  In the buffer_head we have plenty (as will
the replacement for it if I ever get time to get back to that)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20 12:47                         ` Christoph Hellwig
@ 2017-06-20 12:51                           ` Jens Axboe
  2017-06-20 12:56                             ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2017-06-20 12:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/20/2017 06:47 AM, Christoph Hellwig wrote:
> On Tue, Jun 20, 2017 at 06:45:45AM -0600, Jens Axboe wrote:
>> To the page? Where do you want to find room for that?
> 
> In the page we don't.  In the buffer_head we have plenty (as will
> the replacement for it if I ever get time to get back to that)

That sounds fragile... Is it really worth it to avoid stealing three
bits in the inode? Do we have buffer_heads attached everywhere, that
sounds surprising to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20 12:51                           ` Jens Axboe
@ 2017-06-20 12:56                             ` Christoph Hellwig
  2017-06-20 13:00                               ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2017-06-20 12:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel, linux-block, adilger, martin.petersen

On Tue, Jun 20, 2017 at 06:51:34AM -0600, Jens Axboe wrote:
> That sounds fragile... Is it really worth it to avoid stealing three
> bits in the inode? Do we have buffer_heads attached everywhere, that
> sounds surprising to me.

I'm not concerned about saving a few bits - we'll use those elsewhere
anyway.  I'm concerned about settings on one file descriptor having
an effect on other fds that doesn't look intended.  It's not the
end of the world, as multiple writers to the same file should
better be aware of each other, but the semantics of setting these
on a per-inode level still stink.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
  2017-06-20 12:56                             ` Christoph Hellwig
@ 2017-06-20 13:00                               ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-20 13:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-block, adilger, martin.petersen

On 06/20/2017 06:56 AM, Christoph Hellwig wrote:
> On Tue, Jun 20, 2017 at 06:51:34AM -0600, Jens Axboe wrote:
>> That sounds fragile... Is it really worth it to avoid stealing three
>> bits in the inode? Do we have buffer_heads attached everywhere, that
>> sounds surprising to me.
> 
> I'm not concerned about saving a few bits - we'll use those elsewhere
> anyway.  I'm concerned about settings on one file descriptor having
> an effect on other fds that doesn't look intended.  It's not the
> end of the world, as multiple writers to the same file should
> better be aware of each other, but the semantics of setting these
> on a per-inode level still stink.

For O_DIRECT writes, the file level hints work perfectly fine. For
buffered writes, we need some way to defer retrieving this hint to
when the flush happens. Seems very sad and dated to rely on
storing these in buffer_heads, which we've been trying to kill
for more than a decade.

The per inode hint isn't perfect for multiple buffered writers in
a single file. While I think it'd be great to have something bullet
proof there, I also don't think it's a troublesome use cases for
streams.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 02/11] block: add support for write hints in a bio
  2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
@ 2017-06-16 17:24 ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2017-06-16 17:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-block; +Cc: adilger, hch, martin.petersen, Jens Axboe

No functional changes in this patch, we just set aside 3 bits
in the bio/request flags, which can be used to hold a WRITE_HINT_*
life time hint.

Ensure that we don't merge requests that have different life time
hints assigned to them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-merge.c         | 16 ++++++++++++++++
 include/linux/blk_types.h | 15 +++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..7d299df3b12b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -693,6 +693,14 @@ static struct request *attempt_merge(struct request_queue *q,
 		return NULL;
 
 	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if ((req->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+	    (next->cmd_flags & REQ_WRITE_LIFE_MASK))
+		return NULL;
+
+	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
 	 * will have updated segment counts, update sector
@@ -811,6 +819,14 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !blk_write_same_mergeable(rq->bio, bio))
 		return false;
 
+	/*
+	 * Don't allow merge of different streams, or for a stream with
+	 * non-stream IO.
+	 */
+	if ((rq->cmd_flags & REQ_WRITE_LIFE_MASK) !=
+	    (bio->bi_opf & REQ_WRITE_LIFE_MASK))
+		return false;
+
 	return true;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 61339bc44400..0dd8d569d52f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -7,6 +7,7 @@
 
 #include <linux/types.h>
 #include <linux/bvec.h>
+#include <linux/fs.h>
 
 struct bio_set;
 struct bio;
@@ -201,6 +202,9 @@ enum req_flag_bits {
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
+	__REQ_WRITE_HINT_SHIFT,	/* 3 bits for life time hint */
+	__REQ_WRITE_HINT_PAD1,
+	__REQ_WRITE_HINT_PAD2,
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
@@ -221,6 +225,12 @@ enum req_flag_bits {
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
+#define REQ_WRITE_SHORT		(WRITE_LIFE_SHORT << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_MEDIUM	(WRITE_LIFE_MEDIUM << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_LONG		(WRITE_LIFE_LONG << __REQ_WRITE_HINT_SHIFT)
+#define REQ_WRITE_EXTREME	(WRITE_LIFE_EXTREME << __REQ_WRITE_HINT_SHIFT)
+
+#define REQ_WRITE_LIFE_MASK	(0x7 << __REQ_WRITE_HINT_SHIFT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 
@@ -312,4 +322,9 @@ struct blk_rq_stat {
 	u64 batch;
 };
 
+static inline unsigned int write_hint_to_opf(enum rw_hint hint)
+{
+	return hint << __REQ_WRITE_HINT_SHIFT;
+}
+
 #endif /* __LINUX_BLK_TYPES_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-06-20 13:00 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-17 19:59 [PATCHSET v7] Add support for write life time hints Jens Axboe
2017-06-17 19:59 ` [PATCH 01/11] fs: add support for an inode to carry write hint related data Jens Axboe
2017-06-19  6:26   ` Christoph Hellwig
2017-06-19 14:55     ` Jens Axboe
2017-06-17 19:59 ` [PATCH 02/11] block: add support for write hints in a bio Jens Axboe
2017-06-17 19:59 ` [PATCH 03/11] blk-mq: expose stream write hints through debugfs Jens Axboe
2017-06-17 19:59 ` [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints Jens Axboe
2017-06-19  6:27   ` Christoph Hellwig
2017-06-19 14:56     ` Jens Axboe
2017-06-19 16:02       ` Jens Axboe
2017-06-19 18:58         ` Christoph Hellwig
2017-06-19 19:00           ` Jens Axboe
2017-06-19 19:10             ` Jens Axboe
2017-06-19 20:33               ` Jens Axboe
2017-06-20  2:06                 ` Jens Axboe
2017-06-20  8:57                 ` Christoph Hellwig
2017-06-20 12:43                   ` Jens Axboe
2017-06-20 12:43                     ` Christoph Hellwig
2017-06-20 12:45                       ` Jens Axboe
2017-06-20 12:47                         ` Christoph Hellwig
2017-06-20 12:51                           ` Jens Axboe
2017-06-20 12:56                             ` Christoph Hellwig
2017-06-20 13:00                               ` Jens Axboe
2017-06-17 19:59 ` [PATCH 05/11] fs: add fcntl() interface for setting/getting " Jens Axboe
2017-06-19  6:28   ` Christoph Hellwig
2017-06-19 14:57     ` Jens Axboe
2017-06-17 19:59 ` [PATCH 06/11] fs: add O_DIRECT support for sending down " Jens Axboe
2017-06-19  6:28   ` Christoph Hellwig
2017-06-19 14:57     ` Jens Axboe
2017-06-17 19:59 ` [PATCH 07/11] fs: add support for buffered writeback to pass down write hints Jens Axboe
2017-06-17 19:59 ` [PATCH 08/11] ext4: add support for passing in write hints for buffered writes Jens Axboe
2017-06-17 19:59 ` [PATCH 09/11] xfs: " Jens Axboe
2017-06-17 19:59 ` [PATCH 10/11] btrfs: " Jens Axboe
2017-06-17 19:59 ` [PATCH 11/11] nvme: add support for streams and directives Jens Axboe
2017-06-19  6:35   ` Christoph Hellwig
2017-06-19 15:04     ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2017-06-16 17:24 [PATCHSET v6] Add support for write life time hints Jens Axboe
2017-06-16 17:24 ` [PATCH 02/11] block: add support for write hints in a bio Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).