linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] fs,ext4,jbd2: Specifying write-hint for Ext4 journal
       [not found] <CGME20181210125245epcas2p202bc4336e91a55cee3786ed6abffea05@epcas2p2.samsung.com>
@ 2018-12-10 12:50 ` Kanchan Joshi
       [not found]   ` <CGME20181210125251epcas1p3023db72cc08d6b2f899141d551d53f61@epcas1p3.samsung.com>
       [not found]   ` <CGME20181210125256epcas1p1e4142cfefb9b21dfb8dad927fbd49143@epcas1p1.samsung.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Kanchan Joshi @ 2018-12-10 12:50 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: tytso, adilger.kernel, jack, viro, darrick.wong, axboe,
	jrdr.linux, ebiggers, jooyoung.hwang, chur.lee, prakash.v,
	Kanchan Joshi

For NAND-based SSDs, mixing of data with different life-time reduces            
efficiency of garbage-collection. During FS operations, series of journal       
updates will follow/precede series of data/meta updates, causing intermixing    
inside SSD. By passing a write-hint (a.k.a stream) with journal, its writes     
can be isolated from other meta/data writes, leading to performance/endurance   
benefit on multi-stream SSD.                                                    
This is described in greater detail (along with results) in this "FAST 2018"    
paper - https://www.usenix.org/system/files/conference/fast18/fast18-rho.pdf    
                                                                                
This patch is split into two parts. First patch introduces APIs to send         
write-hint with buffer-head. Second one implement "journal_writehint" mount     
option (inspired from "journal_ioprio") in Ext4/JBD2.

Kanchan Joshi (2):
  fs: introduce APIs to enable sending write-hint with buffer-head
  fs/ext4,jbd2: Add support for passing write-hint with journal.

 fs/buffer.c                 | 21 +++++++++++++++++++++
 fs/ext4/super.c             | 33 +++++++++++++++++++++++++++------
 fs/jbd2/commit.c            | 11 +++++++----
 fs/jbd2/journal.c           |  2 +-
 fs/jbd2/revoke.c            |  2 +-
 include/linux/buffer_head.h |  3 +++
 include/linux/jbd2.h        |  7 +++++++
 7 files changed, 67 insertions(+), 12 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] fs: introduce APIs to enable sending write-hint with buffer-head
       [not found]   ` <CGME20181210125251epcas1p3023db72cc08d6b2f899141d551d53f61@epcas1p3.samsung.com>
@ 2018-12-10 12:50     ` Kanchan Joshi
  2018-12-10 13:49       ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2018-12-10 12:50 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: tytso, adilger.kernel, jack, viro, darrick.wong, axboe,
	jrdr.linux, ebiggers, jooyoung.hwang, chur.lee, prakash.v,
	Kanchan Joshi

submit_bh and write_dirty_buffer do not take write-hint as
parameter. This patch introduces variants which do.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/buffer.c                 | 21 +++++++++++++++++++++
 include/linux/buffer_head.h |  3 +++
 2 files changed, 24 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b..60c8867 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3094,6 +3094,13 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
 }
 EXPORT_SYMBOL(submit_bh);
 
+int submit_bh_write_hint(int op, int op_flags, struct buffer_head *bh,
+			 enum rw_hint hint)
+{
+	return submit_bh_wbc(op, op_flags, bh, hint, NULL);
+}
+EXPORT_SYMBOL(submit_bh_write_hint);
+
 /**
  * ll_rw_block: low-level access to block devices (DEPRECATED)
  * @op: whether to %READ or %WRITE
@@ -3162,6 +3169,20 @@ void write_dirty_buffer(struct buffer_head *bh, int op_flags)
 }
 EXPORT_SYMBOL(write_dirty_buffer);
 
+void write_dirty_buffer_write_hint(struct buffer_head *bh, int op_flags,
+				   enum rw_hint hint)
+{
+	lock_buffer(bh);
+	if (!test_clear_buffer_dirty(bh)) {
+		unlock_buffer(bh);
+		return;
+	}
+	bh->b_end_io = end_buffer_write_sync;
+	get_bh(bh);
+	submit_bh_wbc(REQ_OP_WRITE, op_flags, bh, hint, NULL);
+}
+EXPORT_SYMBOL(write_dirty_buffer_write_hint);
+
 /*
  * For a data-integrity writeout, we need to wait upon any in-progress I/O
  * and then start new I/O and then wait upon it.  The caller must have a ref on
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73ef7..ac9f111 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -200,7 +200,10 @@ void ll_rw_block(int, int, int, struct buffer_head * bh[]);
 int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
 void write_dirty_buffer(struct buffer_head *bh, int op_flags);
+void write_dirty_buffer_write_hint(struct buffer_head *bh, int op_flags,
+				   enum rw_hint hint);
 int submit_bh(int, int, struct buffer_head *);
+int submit_bh_write_hint(int, int, struct buffer_head *, enum rw_hint hint);
 void write_boundary_block(struct block_device *bdev,
 			sector_t bblock, unsigned blocksize);
 int bh_uptodate_or_lock(struct buffer_head *bh);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
       [not found]   ` <CGME20181210125256epcas1p1e4142cfefb9b21dfb8dad927fbd49143@epcas1p1.samsung.com>
@ 2018-12-10 12:50     ` Kanchan Joshi
  2018-12-10 14:12       ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Kanchan Joshi @ 2018-12-10 12:50 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel
  Cc: tytso, adilger.kernel, jack, viro, darrick.wong, axboe,
	jrdr.linux, ebiggers, jooyoung.hwang, chur.lee, prakash.v,
	Kanchan Joshi

This patch introduces "j_writehint" in JBD2 journal,
which is set based by Ext4 depending on "journal_writehint"
mount option (inspired from "journal_ioprio").

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 fs/ext4/super.c      | 33 +++++++++++++++++++++++++++------
 fs/jbd2/commit.c     | 11 +++++++----
 fs/jbd2/journal.c    |  2 +-
 fs/jbd2/revoke.c     |  2 +-
 include/linux/jbd2.h |  7 +++++++
 5 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 53ff6c2..0283760 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1413,7 +1413,7 @@ enum {
 	Opt_nowarn_on_error, Opt_mblk_io_submit,
 	Opt_lazytime, Opt_nolazytime, Opt_debug_want_extra_isize,
 	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
-	Opt_inode_readahead_blks, Opt_journal_ioprio,
+	Opt_inode_readahead_blks, Opt_journal_ioprio, Opt_journal_writehint,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
@@ -1489,6 +1489,7 @@ static const match_table_t tokens = {
 	{Opt_noblock_validity, "noblock_validity"},
 	{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
 	{Opt_journal_ioprio, "journal_ioprio=%u"},
+	{Opt_journal_writehint, "journal_writehint=%u"},
 	{Opt_auto_da_alloc, "auto_da_alloc=%u"},
 	{Opt_auto_da_alloc, "auto_da_alloc"},
 	{Opt_noauto_da_alloc, "noauto_da_alloc"},
@@ -1677,6 +1678,7 @@ static const struct mount_opts {
 	{Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0},
 	{Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING},
 	{Opt_journal_ioprio, 0, MOPT_NO_EXT2 | MOPT_GTE0},
+	{Opt_journal_writehint, 0, MOPT_NO_EXT2 | MOPT_GTE0},
 	{Opt_data_journal, EXT4_MOUNT_JOURNAL_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
 	{Opt_data_ordered, EXT4_MOUNT_ORDERED_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
 	{Opt_data_writeback, EXT4_MOUNT_WRITEBACK_DATA,
@@ -1718,7 +1720,8 @@ static const struct mount_opts {
 
 static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			    substring_t *args, unsigned long *journal_devnum,
-			    unsigned int *journal_ioprio, int is_remount)
+			    unsigned int *journal_ioprio,
+			    enum rw_hint *journal_writehint, int is_remount)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	const struct mount_opts *m;
@@ -1897,6 +1900,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		}
 		*journal_ioprio =
 			IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, arg);
+	} else if (token == Opt_journal_writehint) {
+		if (arg < WRITE_LIFE_NOT_SET || arg > WRITE_LIFE_EXTREME) {
+			ext4_msg(sb, KERN_ERR, "Invalid journal write hint"
+				 " (must be 0-5)");
+			return -1;
+		}
+		*journal_writehint = arg;
 	} else if (token == Opt_test_dummy_encryption) {
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 		sbi->s_mount_flags |= EXT4_MF_TEST_DUMMY_ENCRYPTION;
@@ -1970,6 +1980,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 static int parse_options(char *options, struct super_block *sb,
 			 unsigned long *journal_devnum,
 			 unsigned int *journal_ioprio,
+			 enum rw_hint *journal_writehint,
 			 int is_remount)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1990,7 +2001,8 @@ static int parse_options(char *options, struct super_block *sb,
 		args[0].to = args[0].from = NULL;
 		token = match_token(p, tokens, args);
 		if (handle_mount_opt(sb, p, token, args, journal_devnum,
-				     journal_ioprio, is_remount) < 0)
+				     journal_ioprio, journal_writehint,
+				     is_remount) < 0)
 			return 0;
 	}
 #ifdef CONFIG_QUOTA
@@ -3534,6 +3546,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	int err = 0;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
 	ext4_group_t first_not_zeroed;
+	enum rw_hint journal_writehint = WRITE_LIFE_NOT_SET;
 
 	if ((data && !orig_data) || !sbi)
 		goto out_free_base;
@@ -3694,7 +3707,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		if (!s_mount_opts)
 			goto failed_mount;
 		if (!parse_options(s_mount_opts, sb, &journal_devnum,
-				   &journal_ioprio, 0)) {
+				   &journal_ioprio, &journal_writehint, 0)) {
 			ext4_msg(sb, KERN_WARNING,
 				 "failed to parse options in superblock: %s",
 				 s_mount_opts);
@@ -3703,7 +3716,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	}
 	sbi->s_def_mount_opt = sbi->s_mount_opt;
 	if (!parse_options((char *) data, sb, &journal_devnum,
-			   &journal_ioprio, 0))
+			   &journal_ioprio, &journal_writehint, 0))
 		goto failed_mount;
 
 	if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
@@ -4265,6 +4278,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
+	sbi->s_journal->j_writehint = journal_writehint;
+
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
 
 no_journal:
@@ -5120,6 +5135,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 	int enable_quota = 0;
 	ext4_group_t g;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+	enum rw_hint journal_writehint = WRITE_LIFE_NOT_SET;
 	int err = 0;
 #ifdef CONFIG_QUOTA
 	int i, j;
@@ -5158,7 +5174,11 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 	if (sbi->s_journal && sbi->s_journal->j_task->io_context)
 		journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
 
-	if (!parse_options(data, sb, NULL, &journal_ioprio, 1)) {
+	if (sbi->s_journal)
+		journal_writehint = sbi->s_journal->j_writehint;
+
+	if (!parse_options(data, sb, NULL, &journal_ioprio,
+			   &journal_writehint, 1)) {
 		err = -EINVAL;
 		goto restore_opts;
 	}
@@ -5221,6 +5241,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
 	if (sbi->s_journal) {
 		ext4_init_journal_params(sb, sbi->s_journal);
 		set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
+		sbi->s_journal->j_writehint = journal_writehint;
 	}
 
 	if (*flags & SB_LAZYTIME)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 150cc03..8710afe 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,
 
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_has_feature_async_commit(journal))
-		ret = submit_bh(REQ_OP_WRITE,
-			REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
+		ret = submit_bh_write_hint(REQ_OP_WRITE,
+			REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
+			journal->j_writehint);
 	else
-		ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+		ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
+					journal->j_writehint);
 
 	*cbh = bh;
 	return ret;
@@ -708,7 +710,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				clear_buffer_dirty(bh);
 				set_buffer_uptodate(bh);
 				bh->b_end_io = journal_end_buffer_io_sync;
-				submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+				submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
+						   journal->j_writehint);
 			}
 			cond_resched();
 			stats.run.rs_blocks_logged += bufs;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8ef6b6d..f72ee4b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1384,7 +1384,7 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
 	jbd2_superblock_csum_set(journal, sb);
 	get_bh(bh);
 	bh->b_end_io = end_buffer_write_sync;
-	ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
+	ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh, journal->j_writehint);
 	wait_on_buffer(bh);
 	if (buffer_write_io_error(bh)) {
 		clear_buffer_write_io_error(bh);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index a1143e5..d4fd178 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -642,7 +642,7 @@ static void flush_descriptor(journal_t *journal,
 	set_buffer_jwrite(descriptor);
 	BUFFER_TRACE(descriptor, "write");
 	set_buffer_dirty(descriptor);
-	write_dirty_buffer(descriptor, REQ_SYNC);
+	write_dirty_buffer_write_hint(descriptor, REQ_SYNC, journal->j_writehint);
 }
 #endif
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index b708e51..0735ab0 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1138,6 +1138,13 @@ struct journal_s
 	 */
 	__u32 j_csum_seed;
 
+	/**
+	 * @j_writehint:
+	 *
+	 * write hint for journal (set by fs).
+	 */
+	enum rw_hint j_writehint;
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/**
 	 * @j_trans_commit_map:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fs: introduce APIs to enable sending write-hint with buffer-head
  2018-12-10 12:50     ` [PATCH 1/2] fs: introduce APIs to enable sending write-hint with buffer-head Kanchan Joshi
@ 2018-12-10 13:49       ` Jan Kara
  2018-12-11 11:57         ` Kanchan Joshi
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2018-12-10 13:49 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack, viro,
	darrick.wong, axboe, jrdr.linux, ebiggers, jooyoung.hwang,
	chur.lee, prakash.v

On Mon 10-12-18 18:20:03, Kanchan Joshi wrote:
> submit_bh and write_dirty_buffer do not take write-hint as
> parameter. This patch introduces variants which do.
> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>

...

> @@ -3162,6 +3169,20 @@ void write_dirty_buffer(struct buffer_head *bh, int op_flags)
>  }
>  EXPORT_SYMBOL(write_dirty_buffer);
>  
> +void write_dirty_buffer_write_hint(struct buffer_head *bh, int op_flags,
> +				   enum rw_hint hint)
> +{
> +	lock_buffer(bh);
> +	if (!test_clear_buffer_dirty(bh)) {
> +		unlock_buffer(bh);
> +		return;
> +	}
> +	bh->b_end_io = end_buffer_write_sync;
> +	get_bh(bh);
> +	submit_bh_wbc(REQ_OP_WRITE, op_flags, bh, hint, NULL);
> +}
> +EXPORT_SYMBOL(write_dirty_buffer_write_hint);
> +

Please implement write_dirty_buffer() as a call to
write_dirty_buffer_write_hint() so that we don't unnecessarily duplicate
the code. Otherwise the patch looks good.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-10 12:50     ` [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal Kanchan Joshi
@ 2018-12-10 14:12       ` Jan Kara
  2018-12-10 15:17         ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2018-12-10 14:12 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack, viro,
	darrick.wong, axboe, jrdr.linux, ebiggers, jooyoung.hwang,
	chur.lee, prakash.v

On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
> This patch introduces "j_writehint" in JBD2 journal,
> which is set based by Ext4 depending on "journal_writehint"
> mount option (inspired from "journal_ioprio").

Thanks for the patch! It would be good to provide the explanation you have
in the cover letter in this patch as well so that it gets recorded in git
logs.

Also I don't like the fact that users have to set the hint via a mount
option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
fs.h are generally supposed to be used by userspace so it's difficult to
pick anything if we don't know what the userspace is going to do. I'd argue
it's even difficult for the sysadmin to pick any good value even if he
actually knows that he might benefit from setting some. Jens, is there
some reasonable way for fs to automatically pick some stream value for its
journal?

								Honza

> 
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  fs/ext4/super.c      | 33 +++++++++++++++++++++++++++------
>  fs/jbd2/commit.c     | 11 +++++++----
>  fs/jbd2/journal.c    |  2 +-
>  fs/jbd2/revoke.c     |  2 +-
>  include/linux/jbd2.h |  7 +++++++
>  5 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 53ff6c2..0283760 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1413,7 +1413,7 @@ enum {
>  	Opt_nowarn_on_error, Opt_mblk_io_submit,
>  	Opt_lazytime, Opt_nolazytime, Opt_debug_want_extra_isize,
>  	Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
> -	Opt_inode_readahead_blks, Opt_journal_ioprio,
> +	Opt_inode_readahead_blks, Opt_journal_ioprio, Opt_journal_writehint,
>  	Opt_dioread_nolock, Opt_dioread_lock,
>  	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
>  	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> @@ -1489,6 +1489,7 @@ static const match_table_t tokens = {
>  	{Opt_noblock_validity, "noblock_validity"},
>  	{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
>  	{Opt_journal_ioprio, "journal_ioprio=%u"},
> +	{Opt_journal_writehint, "journal_writehint=%u"},
>  	{Opt_auto_da_alloc, "auto_da_alloc=%u"},
>  	{Opt_auto_da_alloc, "auto_da_alloc"},
>  	{Opt_noauto_da_alloc, "noauto_da_alloc"},
> @@ -1677,6 +1678,7 @@ static const struct mount_opts {
>  	{Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0},
>  	{Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING},
>  	{Opt_journal_ioprio, 0, MOPT_NO_EXT2 | MOPT_GTE0},
> +	{Opt_journal_writehint, 0, MOPT_NO_EXT2 | MOPT_GTE0},
>  	{Opt_data_journal, EXT4_MOUNT_JOURNAL_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
>  	{Opt_data_ordered, EXT4_MOUNT_ORDERED_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
>  	{Opt_data_writeback, EXT4_MOUNT_WRITEBACK_DATA,
> @@ -1718,7 +1720,8 @@ static const struct mount_opts {
>  
>  static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  			    substring_t *args, unsigned long *journal_devnum,
> -			    unsigned int *journal_ioprio, int is_remount)
> +			    unsigned int *journal_ioprio,
> +			    enum rw_hint *journal_writehint, int is_remount)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
>  	const struct mount_opts *m;
> @@ -1897,6 +1900,13 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  		}
>  		*journal_ioprio =
>  			IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, arg);
> +	} else if (token == Opt_journal_writehint) {
> +		if (arg < WRITE_LIFE_NOT_SET || arg > WRITE_LIFE_EXTREME) {
> +			ext4_msg(sb, KERN_ERR, "Invalid journal write hint"
> +				 " (must be 0-5)");
> +			return -1;
> +		}
> +		*journal_writehint = arg;
>  	} else if (token == Opt_test_dummy_encryption) {
>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>  		sbi->s_mount_flags |= EXT4_MF_TEST_DUMMY_ENCRYPTION;
> @@ -1970,6 +1980,7 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  static int parse_options(char *options, struct super_block *sb,
>  			 unsigned long *journal_devnum,
>  			 unsigned int *journal_ioprio,
> +			 enum rw_hint *journal_writehint,
>  			 int is_remount)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -1990,7 +2001,8 @@ static int parse_options(char *options, struct super_block *sb,
>  		args[0].to = args[0].from = NULL;
>  		token = match_token(p, tokens, args);
>  		if (handle_mount_opt(sb, p, token, args, journal_devnum,
> -				     journal_ioprio, is_remount) < 0)
> +				     journal_ioprio, journal_writehint,
> +				     is_remount) < 0)
>  			return 0;
>  	}
>  #ifdef CONFIG_QUOTA
> @@ -3534,6 +3546,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	int err = 0;
>  	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
>  	ext4_group_t first_not_zeroed;
> +	enum rw_hint journal_writehint = WRITE_LIFE_NOT_SET;
>  
>  	if ((data && !orig_data) || !sbi)
>  		goto out_free_base;
> @@ -3694,7 +3707,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  		if (!s_mount_opts)
>  			goto failed_mount;
>  		if (!parse_options(s_mount_opts, sb, &journal_devnum,
> -				   &journal_ioprio, 0)) {
> +				   &journal_ioprio, &journal_writehint, 0)) {
>  			ext4_msg(sb, KERN_WARNING,
>  				 "failed to parse options in superblock: %s",
>  				 s_mount_opts);
> @@ -3703,7 +3716,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	}
>  	sbi->s_def_mount_opt = sbi->s_mount_opt;
>  	if (!parse_options((char *) data, sb, &journal_devnum,
> -			   &journal_ioprio, 0))
> +			   &journal_ioprio, &journal_writehint, 0))
>  		goto failed_mount;
>  
>  	if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
> @@ -4265,6 +4278,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  
>  	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
>  
> +	sbi->s_journal->j_writehint = journal_writehint;
> +
>  	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
>  
>  no_journal:
> @@ -5120,6 +5135,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>  	int enable_quota = 0;
>  	ext4_group_t g;
>  	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> +	enum rw_hint journal_writehint = WRITE_LIFE_NOT_SET;
>  	int err = 0;
>  #ifdef CONFIG_QUOTA
>  	int i, j;
> @@ -5158,7 +5174,11 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>  	if (sbi->s_journal && sbi->s_journal->j_task->io_context)
>  		journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
>  
> -	if (!parse_options(data, sb, NULL, &journal_ioprio, 1)) {
> +	if (sbi->s_journal)
> +		journal_writehint = sbi->s_journal->j_writehint;
> +
> +	if (!parse_options(data, sb, NULL, &journal_ioprio,
> +			   &journal_writehint, 1)) {
>  		err = -EINVAL;
>  		goto restore_opts;
>  	}
> @@ -5221,6 +5241,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>  	if (sbi->s_journal) {
>  		ext4_init_journal_params(sb, sbi->s_journal);
>  		set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
> +		sbi->s_journal->j_writehint = journal_writehint;
>  	}
>  
>  	if (*flags & SB_LAZYTIME)
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 150cc03..8710afe 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -153,10 +153,12 @@ static int journal_submit_commit_record(journal_t *journal,
>  
>  	if (journal->j_flags & JBD2_BARRIER &&
>  	    !jbd2_has_feature_async_commit(journal))
> -		ret = submit_bh(REQ_OP_WRITE,
> -			REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
> +		ret = submit_bh_write_hint(REQ_OP_WRITE,
> +			REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh,
> +			journal->j_writehint);
>  	else
> -		ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> +		ret = submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
> +					journal->j_writehint);
>  
>  	*cbh = bh;
>  	return ret;
> @@ -708,7 +710,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  				clear_buffer_dirty(bh);
>  				set_buffer_uptodate(bh);
>  				bh->b_end_io = journal_end_buffer_io_sync;
> -				submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> +				submit_bh_write_hint(REQ_OP_WRITE, REQ_SYNC, bh,
> +						   journal->j_writehint);
>  			}
>  			cond_resched();
>  			stats.run.rs_blocks_logged += bufs;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 8ef6b6d..f72ee4b 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1384,7 +1384,7 @@ static int jbd2_write_superblock(journal_t *journal, int write_flags)
>  	jbd2_superblock_csum_set(journal, sb);
>  	get_bh(bh);
>  	bh->b_end_io = end_buffer_write_sync;
> -	ret = submit_bh(REQ_OP_WRITE, write_flags, bh);
> +	ret = submit_bh_write_hint(REQ_OP_WRITE, write_flags, bh, journal->j_writehint);
>  	wait_on_buffer(bh);
>  	if (buffer_write_io_error(bh)) {
>  		clear_buffer_write_io_error(bh);
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index a1143e5..d4fd178 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -642,7 +642,7 @@ static void flush_descriptor(journal_t *journal,
>  	set_buffer_jwrite(descriptor);
>  	BUFFER_TRACE(descriptor, "write");
>  	set_buffer_dirty(descriptor);
> -	write_dirty_buffer(descriptor, REQ_SYNC);
> +	write_dirty_buffer_write_hint(descriptor, REQ_SYNC, journal->j_writehint);
>  }
>  #endif
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index b708e51..0735ab0 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1138,6 +1138,13 @@ struct journal_s
>  	 */
>  	__u32 j_csum_seed;
>  
> +	/**
> +	 * @j_writehint:
> +	 *
> +	 * write hint for journal (set by fs).
> +	 */
> +	enum rw_hint j_writehint;
> +
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	/**
>  	 * @j_trans_commit_map:
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-10 14:12       ` Jan Kara
@ 2018-12-10 15:17         ` Jens Axboe
  2018-12-10 15:41           ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2018-12-10 15:17 UTC (permalink / raw)
  To: Jan Kara, Kanchan Joshi
  Cc: linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack, viro,
	darrick.wong, jrdr.linux, ebiggers, jooyoung.hwang, chur.lee,
	prakash.v

On 12/10/18 7:12 AM, Jan Kara wrote:
> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
>> This patch introduces "j_writehint" in JBD2 journal,
>> which is set based by Ext4 depending on "journal_writehint"
>> mount option (inspired from "journal_ioprio").
> 
> Thanks for the patch! It would be good to provide the explanation you have
> in the cover letter in this patch as well so that it gets recorded in git
> logs.
> 
> Also I don't like the fact that users have to set the hint via a mount
> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
> fs.h are generally supposed to be used by userspace so it's difficult to
> pick anything if we don't know what the userspace is going to do. I'd argue
> it's even difficult for the sysadmin to pick any good value even if he
> actually knows that he might benefit from setting some. Jens, is there
> some reasonable way for fs to automatically pick some stream value for its
> journal?

I think we have two options here:

1) It's _probably_ safe to assume that journal data is short lived. While
   hints are all relative to the specific use case, the size of the journal
   compared to the rest of the drive is most likely very small. Hence a
   default of WRITE_LIFE_SHORT is probably a good idea.

2) We add a specific internal life time hint for fs journals.

#2 makes the most sense to me, but requires a bit more work...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-10 15:17         ` Jens Axboe
@ 2018-12-10 15:41           ` Jan Kara
  2018-12-10 15:44             ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2018-12-10 15:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Kanchan Joshi, linux-ext4, linux-fsdevel, tytso,
	adilger.kernel, jack, viro, darrick.wong, jrdr.linux, ebiggers,
	jooyoung.hwang, chur.lee, prakash.v

On Mon 10-12-18 08:17:18, Jens Axboe wrote:
> On 12/10/18 7:12 AM, Jan Kara wrote:
> > On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
> >> This patch introduces "j_writehint" in JBD2 journal,
> >> which is set based by Ext4 depending on "journal_writehint"
> >> mount option (inspired from "journal_ioprio").
> > 
> > Thanks for the patch! It would be good to provide the explanation you have
> > in the cover letter in this patch as well so that it gets recorded in git
> > logs.
> > 
> > Also I don't like the fact that users have to set the hint via a mount
> > option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
> > fs.h are generally supposed to be used by userspace so it's difficult to
> > pick anything if we don't know what the userspace is going to do. I'd argue
> > it's even difficult for the sysadmin to pick any good value even if he
> > actually knows that he might benefit from setting some. Jens, is there
> > some reasonable way for fs to automatically pick some stream value for its
> > journal?
> 
> I think we have two options here:
> 
> 1) It's _probably_ safe to assume that journal data is short lived. While
>    hints are all relative to the specific use case, the size of the journal
>    compared to the rest of the drive is most likely very small. Hence a
>    default of WRITE_LIFE_SHORT is probably a good idea.

That's what I was thinking as well. But there are some exceptions like
heavy DB load where there's very little of metadata modified (and thus
almost no journal IO) compared to the DB data. So journal blocks may have
actually longer life time than data blocks. OTOH if there's little journal
IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
is probably a good default anyway.

> 2) We add a specific internal life time hint for fs journals.
> 
> #2 makes the most sense to me, but requires a bit more work...

Yeah, #2 would look more natural to me but I guess it needs some mapping to
what the drive offers, doesn't it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-10 15:41           ` Jan Kara
@ 2018-12-10 15:44             ` Jens Axboe
  2018-12-11  4:07               ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2018-12-10 15:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kanchan Joshi, linux-ext4, linux-fsdevel, tytso, adilger.kernel,
	jack, viro, darrick.wong, jrdr.linux, ebiggers, jooyoung.hwang,
	chur.lee, prakash.v

On 12/10/18 8:41 AM, Jan Kara wrote:
> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
>> On 12/10/18 7:12 AM, Jan Kara wrote:
>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
>>>> This patch introduces "j_writehint" in JBD2 journal,
>>>> which is set based by Ext4 depending on "journal_writehint"
>>>> mount option (inspired from "journal_ioprio").
>>>
>>> Thanks for the patch! It would be good to provide the explanation you have
>>> in the cover letter in this patch as well so that it gets recorded in git
>>> logs.
>>>
>>> Also I don't like the fact that users have to set the hint via a mount
>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
>>> fs.h are generally supposed to be used by userspace so it's difficult to
>>> pick anything if we don't know what the userspace is going to do. I'd argue
>>> it's even difficult for the sysadmin to pick any good value even if he
>>> actually knows that he might benefit from setting some. Jens, is there
>>> some reasonable way for fs to automatically pick some stream value for its
>>> journal?
>>
>> I think we have two options here:
>>
>> 1) It's _probably_ safe to assume that journal data is short lived. While
>>    hints are all relative to the specific use case, the size of the journal
>>    compared to the rest of the drive is most likely very small. Hence a
>>    default of WRITE_LIFE_SHORT is probably a good idea.
> 
> That's what I was thinking as well. But there are some exceptions like
> heavy DB load where there's very little of metadata modified (and thus
> almost no journal IO) compared to the DB data. So journal blocks may have
> actually longer life time than data blocks. OTOH if there's little journal
> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
> is probably a good default anyway.

Right, that's my probably, it would definitely not work for all cases. But
it only really fails if two uses of the same life time ends up being vastly
different. It doesn't matter if LIFE_SHORT ends up being the longest life
time of them all.

>> 2) We add a specific internal life time hint for fs journals.
>>
>> #2 makes the most sense to me, but requires a bit more work...
> 
> Yeah, #2 would look more natural to me but I guess it needs some mapping to
> what the drive offers, doesn't it?

We only used 4 streams, drives generally offer a lot more. So we can expand
it quite easily.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-10 15:44             ` Jens Axboe
@ 2018-12-11  4:07               ` Dave Chinner
  2018-12-11  5:07                 ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2018-12-11  4:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Kanchan Joshi, linux-ext4, linux-fsdevel, tytso,
	adilger.kernel, jack, viro, darrick.wong, jrdr.linux, ebiggers,
	jooyoung.hwang, chur.lee, prakash.v

On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
> On 12/10/18 8:41 AM, Jan Kara wrote:
> > On Mon 10-12-18 08:17:18, Jens Axboe wrote:
> >> On 12/10/18 7:12 AM, Jan Kara wrote:
> >>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
> >>>> This patch introduces "j_writehint" in JBD2 journal,
> >>>> which is set based by Ext4 depending on "journal_writehint"
> >>>> mount option (inspired from "journal_ioprio").
> >>>
> >>> Thanks for the patch! It would be good to provide the explanation you have
> >>> in the cover letter in this patch as well so that it gets recorded in git
> >>> logs.
> >>>
> >>> Also I don't like the fact that users have to set the hint via a mount
> >>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
> >>> fs.h are generally supposed to be used by userspace so it's difficult to
> >>> pick anything if we don't know what the userspace is going to do. I'd argue
> >>> it's even difficult for the sysadmin to pick any good value even if he
> >>> actually knows that he might benefit from setting some. Jens, is there
> >>> some reasonable way for fs to automatically pick some stream value for its
> >>> journal?
> >>
> >> I think we have two options here:
> >>
> >> 1) It's _probably_ safe to assume that journal data is short lived. While
> >>    hints are all relative to the specific use case, the size of the journal
> >>    compared to the rest of the drive is most likely very small. Hence a
> >>    default of WRITE_LIFE_SHORT is probably a good idea.
> > 
> > That's what I was thinking as well. But there are some exceptions like
> > heavy DB load where there's very little of metadata modified (and thus
> > almost no journal IO) compared to the DB data. So journal blocks may have
> > actually longer life time than data blocks. OTOH if there's little journal
> > IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
> > is probably a good default anyway.
> 
> Right, that's my probably, it would definitely not work for all cases. But
> it only really fails if two uses of the same life time ends up being vastly
> different. It doesn't matter if LIFE_SHORT ends up being the longest life
> time of them all.
> 
> >> 2) We add a specific internal life time hint for fs journals.
> >>
> >> #2 makes the most sense to me, but requires a bit more work...
> > 
> > Yeah, #2 would look more natural to me but I guess it needs some mapping to
> > what the drive offers, doesn't it?
> 
> We only used 4 streams, drives generally offer a lot more. So we can expand
> it quite easily.

Can we get the number of stream supported from the drive? If we can
get at this at mount time, we can use high numbers down for internal
filesystem stuff, and low numbers up for user data (as already
defined by the fcntl interface).

If the hardware doesn't support streams or doesn't support any more
than the userspace interface covers, then it is probably best not to
use hints at all for metadata...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-11  4:07               ` Dave Chinner
@ 2018-12-11  5:07                 ` Jens Axboe
  2018-12-11 13:53                   ` Kanchan Joshi
  2018-12-11 21:54                   ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Jens Axboe @ 2018-12-11  5:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Kanchan Joshi, linux-ext4, linux-fsdevel, tytso,
	adilger.kernel, jack, viro, darrick.wong, jrdr.linux, ebiggers,
	jooyoung.hwang, chur.lee, prakash.v

On 12/10/18 9:07 PM, Dave Chinner wrote:
> On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
>> On 12/10/18 8:41 AM, Jan Kara wrote:
>>> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
>>>> On 12/10/18 7:12 AM, Jan Kara wrote:
>>>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
>>>>>> This patch introduces "j_writehint" in JBD2 journal,
>>>>>> which is set based by Ext4 depending on "journal_writehint"
>>>>>> mount option (inspired from "journal_ioprio").
>>>>>
>>>>> Thanks for the patch! It would be good to provide the explanation you have
>>>>> in the cover letter in this patch as well so that it gets recorded in git
>>>>> logs.
>>>>>
>>>>> Also I don't like the fact that users have to set the hint via a mount
>>>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
>>>>> fs.h are generally supposed to be used by userspace so it's difficult to
>>>>> pick anything if we don't know what the userspace is going to do. I'd argue
>>>>> it's even difficult for the sysadmin to pick any good value even if he
>>>>> actually knows that he might benefit from setting some. Jens, is there
>>>>> some reasonable way for fs to automatically pick some stream value for its
>>>>> journal?
>>>>
>>>> I think we have two options here:
>>>>
>>>> 1) It's _probably_ safe to assume that journal data is short lived. While
>>>>    hints are all relative to the specific use case, the size of the journal
>>>>    compared to the rest of the drive is most likely very small. Hence a
>>>>    default of WRITE_LIFE_SHORT is probably a good idea.
>>>
>>> That's what I was thinking as well. But there are some exceptions like
>>> heavy DB load where there's very little of metadata modified (and thus
>>> almost no journal IO) compared to the DB data. So journal blocks may have
>>> actually longer life time than data blocks. OTOH if there's little journal
>>> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
>>> is probably a good default anyway.
>>
>> Right, that's my probably, it would definitely not work for all cases. But
>> it only really fails if two uses of the same life time ends up being vastly
>> different. It doesn't matter if LIFE_SHORT ends up being the longest life
>> time of them all.
>>
>>>> 2) We add a specific internal life time hint for fs journals.
>>>>
>>>> #2 makes the most sense to me, but requires a bit more work...
>>>
>>> Yeah, #2 would look more natural to me but I guess it needs some mapping to
>>> what the drive offers, doesn't it?
>>
>> We only used 4 streams, drives generally offer a lot more. So we can expand
>> it quite easily.
> 
> Can we get the number of stream supported from the drive? If we can
> get at this at mount time, we can use high numbers down for internal
> filesystem stuff, and low numbers up for user data (as already
> defined by the fcntl interface).
> 
> If the hardware doesn't support streams or doesn't support any more
> than the userspace interface covers, then it is probably best not to
> use hints at all for metadata...

Yes, we query these values. For instance, if we can't get the current
number of streams we support (4), then we don't use them. We currently
don't export this anywhere for the kernel to see, but that could be
rectified. In terms of values, the NVMe stream space is 16-bits, so
we could allocate from 65535 and down. There are no restrictions on
ordering, so it'd be perfectly fine to use your suggestion of top down
for the kernel.

In terms of hardware support, we assign a number of streams per
namespace, and there's a fixed number of concurrently open streams per
drive. We can add reservations, for instance 8, for each namespace.
That'll give you the 4 user streams, and 4 for the kernel, 65535..65532.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fs: introduce APIs to enable sending write-hint with buffer-head
  2018-12-10 13:49       ` Jan Kara
@ 2018-12-11 11:57         ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2018-12-11 11:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack, viro,
	darrick.wong, axboe, jrdr.linux, ebiggers, jooyoung.hwang,
	chur.lee, prakash.v

Thank you for review and suggestion.
I will make that change in V2, once things become clear for second patch
of this set.

On Monday 10 December 2018 07:19 PM, Jan Kara wrote:
> On Mon 10-12-18 18:20:03, Kanchan Joshi wrote:
>> submit_bh and write_dirty_buffer do not take write-hint as
>> parameter. This patch introduces variants which do.
>>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> 
> ...
> 
>> @@ -3162,6 +3169,20 @@ void write_dirty_buffer(struct buffer_head *bh, int op_flags)
>>   }
>>   EXPORT_SYMBOL(write_dirty_buffer);
>>   
>> +void write_dirty_buffer_write_hint(struct buffer_head *bh, int op_flags,
>> +				   enum rw_hint hint)
>> +{
>> +	lock_buffer(bh);
>> +	if (!test_clear_buffer_dirty(bh)) {
>> +		unlock_buffer(bh);
>> +		return;
>> +	}
>> +	bh->b_end_io = end_buffer_write_sync;
>> +	get_bh(bh);
>> +	submit_bh_wbc(REQ_OP_WRITE, op_flags, bh, hint, NULL);
>> +}
>> +EXPORT_SYMBOL(write_dirty_buffer_write_hint);
>> +
> 
> Please implement write_dirty_buffer() as a call to
> write_dirty_buffer_write_hint() so that we don't unnecessarily duplicate
> the code. Otherwise the patch looks good.
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-11  5:07                 ` Jens Axboe
@ 2018-12-11 13:53                   ` Kanchan Joshi
  2018-12-11 21:54                   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2018-12-11 13:53 UTC (permalink / raw)
  To: Jens Axboe, Dave Chinner
  Cc: Jan Kara, linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack,
	viro, darrick.wong, jrdr.linux, ebiggers, jooyoung.hwang,
	chur.lee, prakash.v


On Tuesday 11 December 2018 10:37 AM, Jens Axboe wrote:
> On 12/10/18 9:07 PM, Dave Chinner wrote:
>> On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
>>> On 12/10/18 8:41 AM, Jan Kara wrote:
>>>> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
>>>>> On 12/10/18 7:12 AM, Jan Kara wrote:
>>>>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
>>>>>>> This patch introduces "j_writehint" in JBD2 journal,
>>>>>>> which is set based by Ext4 depending on "journal_writehint"
>>>>>>> mount option (inspired from "journal_ioprio").
>>>>>>
>>>>>> Thanks for the patch! It would be good to provide the explanation you have
>>>>>> in the cover letter in this patch as well so that it gets recorded in git
>>>>>> logs.
>>>>>>
>>>>>> Also I don't like the fact that users have to set the hint via a mount
>>>>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
>>>>>> fs.h are generally supposed to be used by userspace so it's difficult to
>>>>>> pick anything if we don't know what the userspace is going to do. I'd argue
>>>>>> it's even difficult for the sysadmin to pick any good value even if he
>>>>>> actually knows that he might benefit from setting some. Jens, is there
>>>>>> some reasonable way for fs to automatically pick some stream value for its
>>>>>> journal?
>>>>>
>>>>> I think we have two options here:
>>>>>
>>>>> 1) It's _probably_ safe to assume that journal data is short lived. While
>>>>>     hints are all relative to the specific use case, the size of the journal
>>>>>     compared to the rest of the drive is most likely very small. Hence a
>>>>>     default of WRITE_LIFE_SHORT is probably a good idea.
>>>>
>>>> That's what I was thinking as well. But there are some exceptions like
>>>> heavy DB load where there's very little of metadata modified (and thus
>>>> almost no journal IO) compared to the DB data. So journal blocks may have
>>>> actually longer life time than data blocks. OTOH if there's little journal
>>>> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
>>>> is probably a good default anyway.
>>>
>>> Right, that's my probably, it would definitely not work for all cases. But
>>> it only really fails if two uses of the same life time ends up being vastly
>>> different. It doesn't matter if LIFE_SHORT ends up being the longest life
>>> time of them all.
>>>
>>>>> 2) We add a specific internal life time hint for fs journals.
>>>>>
>>>>> #2 makes the most sense to me, but requires a bit more work...
>>>>
>>>> Yeah, #2 would look more natural to me but I guess it needs some mapping to
>>>> what the drive offers, doesn't it?
>>>
>>> We only used 4 streams, drives generally offer a lot more. So we can expand
>>> it quite easily.
>>
>> Can we get the number of stream supported from the drive? If we can
>> get at this at mount time, we can use high numbers down for internal
>> filesystem stuff, and low numbers up for user data (as already
>> defined by the fcntl interface).
>>
>> If the hardware doesn't support streams or doesn't support any more
>> than the userspace interface covers, then it is probably best not to
>> use hints at all for metadata...
> 
> Yes, we query these values. For instance, if we can't get the current
> number of streams we support (4), then we don't use them. We currently
> don't export this anywhere for the kernel to see, but that could be
> rectified. In terms of values, the NVMe stream space is 16-bits, so
> we could allocate from 65535 and down. There are no restrictions on
> ordering, so it'd be perfectly fine to use your suggestion of top down
> for the kernel.
> 
> In terms of hardware support, we assign a number of streams per
> namespace, and there's a fixed number of concurrently open streams per
> drive. We can add reservations, for instance 8, for each namespace.
> That'll give you the 4 user streams, and 4 for the kernel, 65535..65532.

Jens,

Currently hints from two or more independent user-space applications may 
accidentally fall into same stream as there is no arbiter. By keeping 
two sets of write-hints, user-space does not get in the way of 
kernel-space. But I wonder about multiple users of this ,kernel-private, 
hint-set (say multiple FS) - who gets to use what, and whether we need 
arbiter for that. And that makes me remember your earlier work when 
streams were streams (i.e. plain numbers) and block layer managed a 
bitmap for allocation. OTOH, I doubt likelihood of in-kernel 
stream-collision in real-world.

Thanks,

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-11  5:07                 ` Jens Axboe
  2018-12-11 13:53                   ` Kanchan Joshi
@ 2018-12-11 21:54                   ` Dave Chinner
  2018-12-12 22:21                     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2018-12-11 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Kanchan Joshi, linux-ext4, linux-fsdevel, tytso,
	adilger.kernel, jack, viro, darrick.wong, jrdr.linux, ebiggers,
	jooyoung.hwang, chur.lee, prakash.v

On Mon, Dec 10, 2018 at 10:07:48PM -0700, Jens Axboe wrote:
> On 12/10/18 9:07 PM, Dave Chinner wrote:
> > On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
> >> On 12/10/18 8:41 AM, Jan Kara wrote:
> >>> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
> >>>> On 12/10/18 7:12 AM, Jan Kara wrote:
> >>>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
> >>>>>> This patch introduces "j_writehint" in JBD2 journal,
> >>>>>> which is set based by Ext4 depending on "journal_writehint"
> >>>>>> mount option (inspired from "journal_ioprio").
> >>>>>
> >>>>> Thanks for the patch! It would be good to provide the explanation you have
> >>>>> in the cover letter in this patch as well so that it gets recorded in git
> >>>>> logs.
> >>>>>
> >>>>> Also I don't like the fact that users have to set the hint via a mount
> >>>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
> >>>>> fs.h are generally supposed to be used by userspace so it's difficult to
> >>>>> pick anything if we don't know what the userspace is going to do. I'd argue
> >>>>> it's even difficult for the sysadmin to pick any good value even if he
> >>>>> actually knows that he might benefit from setting some. Jens, is there
> >>>>> some reasonable way for fs to automatically pick some stream value for its
> >>>>> journal?
> >>>>
> >>>> I think we have two options here:
> >>>>
> >>>> 1) It's _probably_ safe to assume that journal data is short lived. While
> >>>>    hints are all relative to the specific use case, the size of the journal
> >>>>    compared to the rest of the drive is most likely very small. Hence a
> >>>>    default of WRITE_LIFE_SHORT is probably a good idea.
> >>>
> >>> That's what I was thinking as well. But there are some exceptions like
> >>> heavy DB load where there's very little of metadata modified (and thus
> >>> almost no journal IO) compared to the DB data. So journal blocks may have
> >>> actually longer life time than data blocks. OTOH if there's little journal
> >>> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
> >>> is probably a good default anyway.
> >>
> >> Right, that's my probably, it would definitely not work for all cases. But
> >> it only really fails if two uses of the same life time ends up being vastly
> >> different. It doesn't matter if LIFE_SHORT ends up being the longest life
> >> time of them all.
> >>
> >>>> 2) We add a specific internal life time hint for fs journals.
> >>>>
> >>>> #2 makes the most sense to me, but requires a bit more work...
> >>>
> >>> Yeah, #2 would look more natural to me but I guess it needs some mapping to
> >>> what the drive offers, doesn't it?
> >>
> >> We only used 4 streams, drives generally offer a lot more. So we can expand
> >> it quite easily.
> > 
> > Can we get the number of stream supported from the drive? If we can
> > get at this at mount time, we can use high numbers down for internal
> > filesystem stuff, and low numbers up for user data (as already
> > defined by the fcntl interface).
> > 
> > If the hardware doesn't support streams or doesn't support any more
> > than the userspace interface covers, then it is probably best not to
> > use hints at all for metadata...
> 
> Yes, we query these values. For instance, if we can't get the current
> number of streams we support (4), then we don't use them. We currently
> don't export this anywhere for the kernel to see, but that could be
> rectified. In terms of values, the NVMe stream space is 16-bits, so
> we could allocate from 65535 and down. There are no restrictions on
> ordering, so it'd be perfectly fine to use your suggestion of top down
> for the kernel.

That sounds perfect - right now I can't see a need for more than 4
streams for XFS filesystem metadata (log, directory blocks, inodes,
and internal btree blocks) as all the file data extent tree blocks
whould inherit the data lifetime hint as the two of them are closely
related....

> In terms of hardware support, we assign a number of streams per
> namespace, and there's a fixed number of concurrently open streams per
> drive. We can add reservations, for instance 8, for each namespace.
> That'll give you the 4 user streams, and 4 for the kernel, 65535..65532.

*nod*

That also allows us to extend the "kernel reserved" space in future
if we need to.

How do we find out if the device has the required number of streams
to enable us to use this? Or can we just always set the hint on
filesytem metadata IO and the devices will ignore it if they don't
support it? The latter would be ideal from a filesystem POV - zero
conf and just does the right thing when the hardware supports it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-11 21:54                   ` Dave Chinner
@ 2018-12-12 22:21                     ` Dave Chinner
  2018-12-19 14:10                       ` Kanchan Joshi
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2018-12-12 22:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Kanchan Joshi, linux-ext4, linux-fsdevel, tytso,
	adilger.kernel, jack, viro, darrick.wong, jrdr.linux, ebiggers,
	jooyoung.hwang, chur.lee, prakash.v

On Wed, Dec 12, 2018 at 08:54:44AM +1100, Dave Chinner wrote:
> On Mon, Dec 10, 2018 at 10:07:48PM -0700, Jens Axboe wrote:
> > On 12/10/18 9:07 PM, Dave Chinner wrote:
> > > On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
> > >> On 12/10/18 8:41 AM, Jan Kara wrote:
> > >>> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
> > >>>> On 12/10/18 7:12 AM, Jan Kara wrote:
> > >>>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
> > >>>>>> This patch introduces "j_writehint" in JBD2 journal,
> > >>>>>> which is set based by Ext4 depending on "journal_writehint"
> > >>>>>> mount option (inspired from "journal_ioprio").
> > >>>>>
> > >>>>> Thanks for the patch! It would be good to provide the explanation you have
> > >>>>> in the cover letter in this patch as well so that it gets recorded in git
> > >>>>> logs.
> > >>>>>
> > >>>>> Also I don't like the fact that users have to set the hint via a mount
> > >>>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
> > >>>>> fs.h are generally supposed to be used by userspace so it's difficult to
> > >>>>> pick anything if we don't know what the userspace is going to do. I'd argue
> > >>>>> it's even difficult for the sysadmin to pick any good value even if he
> > >>>>> actually knows that he might benefit from setting some. Jens, is there
> > >>>>> some reasonable way for fs to automatically pick some stream value for its
> > >>>>> journal?
> > >>>>
> > >>>> I think we have two options here:
> > >>>>
> > >>>> 1) It's _probably_ safe to assume that journal data is short lived. While
> > >>>>    hints are all relative to the specific use case, the size of the journal
> > >>>>    compared to the rest of the drive is most likely very small. Hence a
> > >>>>    default of WRITE_LIFE_SHORT is probably a good idea.
> > >>>
> > >>> That's what I was thinking as well. But there are some exceptions like
> > >>> heavy DB load where there's very little of metadata modified (and thus
> > >>> almost no journal IO) compared to the DB data. So journal blocks may have
> > >>> actually longer life time than data blocks. OTOH if there's little journal
> > >>> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
> > >>> is probably a good default anyway.
> > >>
> > >> Right, that's my probably, it would definitely not work for all cases. But
> > >> it only really fails if two uses of the same life time ends up being vastly
> > >> different. It doesn't matter if LIFE_SHORT ends up being the longest life
> > >> time of them all.
> > >>
> > >>>> 2) We add a specific internal life time hint for fs journals.
> > >>>>
> > >>>> #2 makes the most sense to me, but requires a bit more work...
> > >>>
> > >>> Yeah, #2 would look more natural to me but I guess it needs some mapping to
> > >>> what the drive offers, doesn't it?
> > >>
> > >> We only used 4 streams, drives generally offer a lot more. So we can expand
> > >> it quite easily.
> > > 
> > > Can we get the number of stream supported from the drive? If we can
> > > get at this at mount time, we can use high numbers down for internal
> > > filesystem stuff, and low numbers up for user data (as already
> > > defined by the fcntl interface).
> > > 
> > > If the hardware doesn't support streams or doesn't support any more
> > > than the userspace interface covers, then it is probably best not to
> > > use hints at all for metadata...
> > 
> > Yes, we query these values. For instance, if we can't get the current
> > number of streams we support (4), then we don't use them. We currently
> > don't export this anywhere for the kernel to see, but that could be
> > rectified. In terms of values, the NVMe stream space is 16-bits, so
> > we could allocate from 65535 and down. There are no restrictions on
> > ordering, so it'd be perfectly fine to use your suggestion of top down
> > for the kernel.
> 
> That sounds perfect - right now I can't see a need for more than 4
> streams for XFS filesystem metadata (log, directory blocks, inodes,
> and internal btree blocks) as all the file data extent tree blocks
> whould inherit the data lifetime hint as the two of them are closely
> related....
> 
> > In terms of hardware support, we assign a number of streams per
> > namespace, and there's a fixed number of concurrently open streams per
> > drive. We can add reservations, for instance 8, for each namespace.
> > That'll give you the 4 user streams, and 4 for the kernel, 65535..65532.
> 
> *nod*
> 
> That also allows us to extend the "kernel reserved" space in future
> if we need to.
> 
> How do we find out if the device has the required number of streams
> to enable us to use this? Or can we just always set the hint on
> filesytem metadata IO and the devices will ignore it if they don't
> support it? The latter would be ideal from a filesystem POV - zero
> conf and just does the right thing when the hardware supports it...

Having looked at the block/nvme streams implementation yesterday,
it looks to me like this entails a complete rewrite of the
block layer streams functionality to enable this. There's not much
in the way of functionality there right now.

FWIW, I've got quite a few different nvme SSDs here (including
several enterprise drives), but none of them support streams. Which
means I can't test and validate anything to do with streams right
now. I'm guessing streams support is highly vendor specifici and
there are very few devices that support it right now?

At which point I have to ask: just how well tested is this
functionality? I really don't want to have XFS users exposed to yet
another round of "SSD firmware bugs corrupt XFS filesystems" because
nobody is using or testing this stuff outside of benchmarketing
exercises....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal.
  2018-12-12 22:21                     ` Dave Chinner
@ 2018-12-19 14:10                       ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2018-12-19 14:10 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe
  Cc: Jan Kara, linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack,
	viro, darrick.wong, jrdr.linux, ebiggers, jooyoung.hwang,
	chur.lee, prakash.v

Hi Dave, Jens,

If it sounds okay to you, can I prepare/send v2 of the patchset which 
introduces four additional hints for in-kernel use?
Something like this -

@@ -291,6 +291,11 @@ enum rw_hint {
         WRITE_LIFE_MEDIUM       = RWH_WRITE_LIFE_MEDIUM,
         WRITE_LIFE_LONG         = RWH_WRITE_LIFE_LONG,
         WRITE_LIFE_EXTREME      = RWH_WRITE_LIFE_EXTREME,
+/* below ones are meant for in-kernel use */
+       KERN_WRITE_LIFE_SHORT,
+       KERN_WRITE_LIFE_MEDIUM,
+       KERN_WRITE_LIFE_LONG,
+       KERN_WRITE_LIFE_EXTREME
  };

And there will be no additional mount-option in FS. Rather it will be 
like what Dave mentioned - "FS always sets the hint on filesytem 
metadata IO and the devices will ignore it if they don't support it".

Thanks,
Kanchan


On Thursday 13 December 2018 03:51 AM, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 08:54:44AM +1100, Dave Chinner wrote:
>> On Mon, Dec 10, 2018 at 10:07:48PM -0700, Jens Axboe wrote:
>>> On 12/10/18 9:07 PM, Dave Chinner wrote:
>>>> On Mon, Dec 10, 2018 at 08:44:32AM -0700, Jens Axboe wrote:
>>>>> On 12/10/18 8:41 AM, Jan Kara wrote:
>>>>>> On Mon 10-12-18 08:17:18, Jens Axboe wrote:
>>>>>>> On 12/10/18 7:12 AM, Jan Kara wrote:
>>>>>>>> On Mon 10-12-18 18:20:04, Kanchan Joshi wrote:
>>>>>>>>> This patch introduces "j_writehint" in JBD2 journal,
>>>>>>>>> which is set based by Ext4 depending on "journal_writehint"
>>>>>>>>> mount option (inspired from "journal_ioprio").
>>>>>>>>
>>>>>>>> Thanks for the patch! It would be good to provide the explanation you have
>>>>>>>> in the cover letter in this patch as well so that it gets recorded in git
>>>>>>>> logs.
>>>>>>>>
>>>>>>>> Also I don't like the fact that users have to set the hint via a mount
>>>>>>>> option for this to be enabled. OTOH the WRITE_FILE_<foo> hints defined in
>>>>>>>> fs.h are generally supposed to be used by userspace so it's difficult to
>>>>>>>> pick anything if we don't know what the userspace is going to do. I'd argue
>>>>>>>> it's even difficult for the sysadmin to pick any good value even if he
>>>>>>>> actually knows that he might benefit from setting some. Jens, is there
>>>>>>>> some reasonable way for fs to automatically pick some stream value for its
>>>>>>>> journal?
>>>>>>>
>>>>>>> I think we have two options here:
>>>>>>>
>>>>>>> 1) It's _probably_ safe to assume that journal data is short lived. While
>>>>>>>     hints are all relative to the specific use case, the size of the journal
>>>>>>>     compared to the rest of the drive is most likely very small. Hence a
>>>>>>>     default of WRITE_LIFE_SHORT is probably a good idea.
>>>>>>
>>>>>> That's what I was thinking as well. But there are some exceptions like
>>>>>> heavy DB load where there's very little of metadata modified (and thus
>>>>>> almost no journal IO) compared to the DB data. So journal blocks may have
>>>>>> actually longer life time than data blocks. OTOH if there's little journal
>>>>>> IO there's no big benefit in specifying a stream for it so WRITE_LIFE_SHORT
>>>>>> is probably a good default anyway.
>>>>>
>>>>> Right, that's my probably, it would definitely not work for all cases. But
>>>>> it only really fails if two uses of the same life time ends up being vastly
>>>>> different. It doesn't matter if LIFE_SHORT ends up being the longest life
>>>>> time of them all.
>>>>>
>>>>>>> 2) We add a specific internal life time hint for fs journals.
>>>>>>>
>>>>>>> #2 makes the most sense to me, but requires a bit more work...
>>>>>>
>>>>>> Yeah, #2 would look more natural to me but I guess it needs some mapping to
>>>>>> what the drive offers, doesn't it?
>>>>>
>>>>> We only used 4 streams, drives generally offer a lot more. So we can expand
>>>>> it quite easily.
>>>>
>>>> Can we get the number of stream supported from the drive? If we can
>>>> get at this at mount time, we can use high numbers down for internal
>>>> filesystem stuff, and low numbers up for user data (as already
>>>> defined by the fcntl interface).
>>>>
>>>> If the hardware doesn't support streams or doesn't support any more
>>>> than the userspace interface covers, then it is probably best not to
>>>> use hints at all for metadata...
>>>
>>> Yes, we query these values. For instance, if we can't get the current
>>> number of streams we support (4), then we don't use them. We currently
>>> don't export this anywhere for the kernel to see, but that could be
>>> rectified. In terms of values, the NVMe stream space is 16-bits, so
>>> we could allocate from 65535 and down. There are no restrictions on
>>> ordering, so it'd be perfectly fine to use your suggestion of top down
>>> for the kernel.
>>
>> That sounds perfect - right now I can't see a need for more than 4
>> streams for XFS filesystem metadata (log, directory blocks, inodes,
>> and internal btree blocks) as all the file data extent tree blocks
>> whould inherit the data lifetime hint as the two of them are closely
>> related....
>>
>>> In terms of hardware support, we assign a number of streams per
>>> namespace, and there's a fixed number of concurrently open streams per
>>> drive. We can add reservations, for instance 8, for each namespace.
>>> That'll give you the 4 user streams, and 4 for the kernel, 65535..65532.
>>
>> *nod*
>>
>> That also allows us to extend the "kernel reserved" space in future
>> if we need to.
>>
>> How do we find out if the device has the required number of streams
>> to enable us to use this? Or can we just always set the hint on
>> filesytem metadata IO and the devices will ignore it if they don't
>> support it? The latter would be ideal from a filesystem POV - zero
>> conf and just does the right thing when the hardware supports it...
> 
> Having looked at the block/nvme streams implementation yesterday,
> it looks to me like this entails a complete rewrite of the
> block layer streams functionality to enable this. There's not much
> in the way of functionality there right now.
> 
> FWIW, I've got quite a few different nvme SSDs here (including
> several enterprise drives), but none of them support streams. Which
> means I can't test and validate anything to do with streams right
> now. I'm guessing streams support is highly vendor specifici and
> there are very few devices that support it right now?
> 
> At which point I have to ask: just how well tested is this
> functionality? I really don't want to have XFS users exposed to yet
> another round of "SSD firmware bugs corrupt XFS filesystems" because
> nobody is using or testing this stuff outside of benchmarketing
> exercises....
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-12-19 14:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20181210125245epcas2p202bc4336e91a55cee3786ed6abffea05@epcas2p2.samsung.com>
2018-12-10 12:50 ` [PATCH 0/2] fs,ext4,jbd2: Specifying write-hint for Ext4 journal Kanchan Joshi
     [not found]   ` <CGME20181210125251epcas1p3023db72cc08d6b2f899141d551d53f61@epcas1p3.samsung.com>
2018-12-10 12:50     ` [PATCH 1/2] fs: introduce APIs to enable sending write-hint with buffer-head Kanchan Joshi
2018-12-10 13:49       ` Jan Kara
2018-12-11 11:57         ` Kanchan Joshi
     [not found]   ` <CGME20181210125256epcas1p1e4142cfefb9b21dfb8dad927fbd49143@epcas1p1.samsung.com>
2018-12-10 12:50     ` [PATCH 2/2] fs/ext4,jbd2: Add support for passing write-hint with journal Kanchan Joshi
2018-12-10 14:12       ` Jan Kara
2018-12-10 15:17         ` Jens Axboe
2018-12-10 15:41           ` Jan Kara
2018-12-10 15:44             ` Jens Axboe
2018-12-11  4:07               ` Dave Chinner
2018-12-11  5:07                 ` Jens Axboe
2018-12-11 13:53                   ` Kanchan Joshi
2018-12-11 21:54                   ` Dave Chinner
2018-12-12 22:21                     ` Dave Chinner
2018-12-19 14:10                       ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).