All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] XFS real-time device tweaks
@ 2017-09-02 22:41 Richard Wareing
  2017-09-02 22:41 ` [PATCH v2 1/3] fs/xfs: Add rtdisable option Richard Wareing
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Richard Wareing @ 2017-09-02 22:41 UTC (permalink / raw)
  To: linux-xfs; +Cc: Richard Wareing, david, darrick.wong

Taking another run at this based on all the feedback (much appreciated). 

- Replaced rtdefault with rtdisable, this yields similar operational
benefits when combined with the existing mkfs time setting of the inheritance
flag on the root directory.  Allows temporary disabling of real-time allocation
without having to walk entire FS to remove flags (which could be time consuming).
I still don't think it's super obvious to an admin the real-time flag was put
there at mkfs time (vs. rtdefault being in mount flags), but this gets me
half of what I'm after.
- rtstatfs flag is removed, instead per Dave's suggestion we look for the
inheritance flag on the directory inode, if so we fill the statfs struct
with the real-time block info, otherwise data block info.  Open to making
this behavior a flag if folks are worried it might be a jarring change
to folks used to the old behavior (i.e. data device info no matter what).
- rtfallocmin no changes, need to think more about this.  Still a pretty big
fan of this option for reasons already stated; at least until a more elegant
solution such as preferred AGs (we'd need a tunable size for the "preferred"
AG, since our SSD partitions are a fraction of the size of a normal AG) can 
be implemented.  The only other idea I have is to make a new ioctl e.g. 
"norealtime", which causes the RT bits to stay cleared regardless of 
inheritance bits on the containing directory.  This would allowing the 
"steering" of files to the data device (e.g. SSD); this is probably a safer 
design than defaulting to SSD and steering to the HDD via the realtime ioctl.  

Richard Wareing (3):
  fs/xfs: Add rtdisable option
  fs/xfs: Add real-time device support to statfs
  fs/xfs: Add rtfallocmin mount option

 fs/xfs/xfs_file.c  | 16 ++++++++++++++++
 fs/xfs/xfs_inode.c |  6 ++++--
 fs/xfs/xfs_ioctl.c |  7 +++++--
 fs/xfs/xfs_mount.h |  2 ++
 fs/xfs/xfs_super.c | 33 ++++++++++++++++++++++++++++++++-
 5 files changed, 59 insertions(+), 5 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/3] fs/xfs: Add rtdisable option
  2017-09-02 22:41 [PATCH v2 0/3] XFS real-time device tweaks Richard Wareing
@ 2017-09-02 22:41 ` Richard Wareing
  2017-09-02 22:41 ` [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs Richard Wareing
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Richard Wareing @ 2017-09-02 22:41 UTC (permalink / raw)
  To: linux-xfs; +Cc: Richard Wareing, david, darrick.wong

- Adds rtdisable mount option to ignore any real-time inheritance flag
  set on directories.  Designed to be used as a "kill-switch" for
  this behavior, negating the need to walk/kill flags on FS.

Signed-off-by: Richard Wareing <rwareing@fb.com>
---
 fs/xfs/xfs_inode.c |  6 ++++--
 fs/xfs/xfs_ioctl.c |  7 +++++--
 fs/xfs/xfs_mount.h |  1 +
 fs/xfs/xfs_super.c | 14 +++++++++++++-
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ec9826c..dc53731 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -878,7 +878,8 @@ xfs_ialloc(
 			uint		di_flags = 0;
 
 			if (S_ISDIR(mode)) {
-				if (pip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)
+				if (!(mp->m_flags & XFS_MOUNT_RTDISABLE) &&
+					pip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)
 					di_flags |= XFS_DIFLAG_RTINHERIT;
 				if (pip->i_d.di_flags & XFS_DIFLAG_EXTSZINHERIT) {
 					di_flags |= XFS_DIFLAG_EXTSZINHERIT;
@@ -887,7 +888,8 @@ xfs_ialloc(
 				if (pip->i_d.di_flags & XFS_DIFLAG_PROJINHERIT)
 					di_flags |= XFS_DIFLAG_PROJINHERIT;
 			} else if (S_ISREG(mode)) {
-				if (pip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)
+				if (!(mp->m_flags & XFS_MOUNT_RTDISABLE) &&
+					pip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)
 					di_flags |= XFS_DIFLAG_REALTIME;
 				if (pip->i_d.di_flags & XFS_DIFLAG_EXTSZINHERIT) {
 					di_flags |= XFS_DIFLAG_EXTSIZE;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6190697..5a6d45d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -937,6 +937,7 @@ xfs_set_diflags(
 	struct xfs_inode	*ip,
 	unsigned int		xflags)
 {
+	struct xfs_mount        *mp = ip->i_mount;
 	unsigned int		di_flags;
 	uint64_t		di_flags2;
 
@@ -957,7 +958,8 @@ xfs_set_diflags(
 	if (xflags & FS_XFLAG_FILESTREAM)
 		di_flags |= XFS_DIFLAG_FILESTREAM;
 	if (S_ISDIR(VFS_I(ip)->i_mode)) {
-		if (xflags & FS_XFLAG_RTINHERIT)
+		if (!(mp->m_flags & XFS_MOUNT_RTDISABLE) &&
+			xflags & FS_XFLAG_RTINHERIT)
 			di_flags |= XFS_DIFLAG_RTINHERIT;
 		if (xflags & FS_XFLAG_NOSYMLINKS)
 			di_flags |= XFS_DIFLAG_NOSYMLINKS;
@@ -966,7 +968,8 @@ xfs_set_diflags(
 		if (xflags & FS_XFLAG_PROJINHERIT)
 			di_flags |= XFS_DIFLAG_PROJINHERIT;
 	} else if (S_ISREG(VFS_I(ip)->i_mode)) {
-		if (xflags & FS_XFLAG_REALTIME)
+		if (!(mp->m_flags & XFS_MOUNT_RTDISABLE) &&
+			xflags & FS_XFLAG_REALTIME)
 			di_flags |= XFS_DIFLAG_REALTIME;
 		if (xflags & FS_XFLAG_EXTSIZE)
 			di_flags |= XFS_DIFLAG_EXTSIZE;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9fa312a..8016ddb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -243,6 +243,7 @@ typedef struct xfs_mount {
 						   allocator */
 #define XFS_MOUNT_NOATTR2	(1ULL << 25)	/* disable use of attr2 format */
 
+#define XFS_MOUNT_RTDISABLE    (1ULL << 61)    /* Ignore RT flags */
 #define XFS_MOUNT_DAX		(1ULL << 62)	/* TEST ONLY! */
 
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575..4dbf95c 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -83,7 +83,7 @@ enum {
 	Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, Opt_prjquota,
 	Opt_uquota, Opt_gquota, Opt_pquota,
 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
-	Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
+	Opt_discard, Opt_nodiscard, Opt_dax, Opt_rtdisable, Opt_err,
 };
 
 static const match_table_t tokens = {
@@ -133,6 +133,9 @@ static const match_table_t tokens = {
 
 	{Opt_dax,	"dax"},		/* Enable direct access to bdev pages */
 
+#ifdef CONFIG_XFS_RT
+	{Opt_rtdisable, "rtdisable"},   /* Ignore real-time flags */
+#endif
 	/* Deprecated mount options scheduled for removal */
 	{Opt_barrier,	"barrier"},	/* use writer barriers for log write and
 					 * unwritten extent conversion */
@@ -367,6 +370,11 @@ xfs_parseargs(
 		case Opt_nodiscard:
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
 			break;
+#ifdef CONFIG_XFS_RT
+		case Opt_rtdisable:
+			mp->m_flags |= XFS_MOUNT_RTDISABLE;
+			break;
+#endif
 #ifdef CONFIG_FS_DAX
 		case Opt_dax:
 			mp->m_flags |= XFS_MOUNT_DAX;
@@ -492,6 +500,10 @@ xfs_showargs(
 		{ XFS_MOUNT_DISCARD,		",discard" },
 		{ XFS_MOUNT_SMALL_INUMS,	",inode32" },
 		{ XFS_MOUNT_DAX,		",dax" },
+#ifdef CONFIG_XFS_RT
+		{ XFS_MOUNT_RTDISABLE,          ",rtdisable" },
+#endif
+
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs
  2017-09-02 22:41 [PATCH v2 0/3] XFS real-time device tweaks Richard Wareing
  2017-09-02 22:41 ` [PATCH v2 1/3] fs/xfs: Add rtdisable option Richard Wareing
@ 2017-09-02 22:41 ` Richard Wareing
  2017-09-03  8:49   ` Christoph Hellwig
  2017-09-02 22:41 ` [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option Richard Wareing
  2017-09-03  8:56 ` [PATCH v2 0/3] XFS real-time device tweaks Christoph Hellwig
  3 siblings, 1 reply; 17+ messages in thread
From: Richard Wareing @ 2017-09-02 22:41 UTC (permalink / raw)
  To: linux-xfs; +Cc: Richard Wareing, david, darrick.wong

- Reports real-time device free blocks in statfs calls if
inheritance bit is set on the inode of directory.  This is a bit more
intuitive, especially for use-cases which are using a much larger
device for the real-time device.

Signed-off-by: Richard Wareing <rwareing@fb.com>
---
 fs/xfs/xfs_super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4dbf95c..a1d6968 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1148,6 +1148,12 @@ xfs_fs_statfs(
 	    ((mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))) ==
 			      (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))
 		xfs_qm_statvfs(ip, statp);
+	if ((ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT) &&
+		(mp->m_rtdev_targp != NULL)) {
+		statp->f_blocks = sbp->sb_rblocks;
+		statp->f_bfree = sbp->sb_frextents * sbp->sb_rextsize -
+			mp->m_alloc_set_aside;
+	}
 	return 0;
 }
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option
  2017-09-02 22:41 [PATCH v2 0/3] XFS real-time device tweaks Richard Wareing
  2017-09-02 22:41 ` [PATCH v2 1/3] fs/xfs: Add rtdisable option Richard Wareing
  2017-09-02 22:41 ` [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs Richard Wareing
@ 2017-09-02 22:41 ` Richard Wareing
  2017-09-03  8:50   ` Christoph Hellwig
  2017-09-03  8:56 ` [PATCH v2 0/3] XFS real-time device tweaks Christoph Hellwig
  3 siblings, 1 reply; 17+ messages in thread
From: Richard Wareing @ 2017-09-02 22:41 UTC (permalink / raw)
  To: linux-xfs; +Cc: Richard Wareing, david, darrick.wong

- Gates real-time block device fallocation's to rtfallocmin bytes
- Use case: Allows developers to send files to the SSD with ease simply
  by fallocating them, if they are below rtfallocmin XFS will allocate the
  blocks from the non-RT device (e.g. an SSD)
- Useful to automagically store small files on the SSD vs. RT device
  (HDD) for tiered XFS setups without having to rely on XFS specific
  ioctl calls.  Userland tools such as rsync can also use fallocation
  behavior to migrate files between SSD and RT (HDD) device without
  modifiction (e.g. w/ --preallocate flag).

Signed-off-by: Richard Wareing <rwareing@fb.com>
---
 fs/xfs/xfs_file.c  | 16 ++++++++++++++++
 fs/xfs/xfs_mount.h |  1 +
 fs/xfs/xfs_super.c | 15 ++++++++++++++-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a09..a29f6e8 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -730,6 +730,7 @@ xfs_file_fallocate(
 {
 	struct inode		*inode = file_inode(file);
 	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
 	long			error;
 	enum xfs_prealloc_flags	flags = 0;
 	uint			iolock = XFS_IOLOCK_EXCL;
@@ -749,6 +750,21 @@ xfs_file_fallocate(
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 	iolock |= XFS_MMAPLOCK_EXCL;
 
+	/*
+	 * If fallocating a file < rtfallocmin store it on the non RT device.
+	 * In a tiered storage setup, this device might be a device suitable
+	 * for better small file storage/performance (e.g. SSD).
+	 */
+	if (mp->m_rtdev_targp && mp->m_rtfallocmin && !offset &&
+			!inode->i_size) {
+		if (len >= mp->m_rtfallocmin) {
+			ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
+		/* Clear flag if inheritence or rtdefault is being used */
+		} else {
+			ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
+		}
+	}
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8016ddb..f5593bb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -84,6 +84,7 @@ typedef struct xfs_mount {
 	char			*m_fsname;	/* filesystem name */
 	int			m_fsname_len;	/* strlen of fs name */
 	char			*m_rtname;	/* realtime device name */
+	int			m_rtfallocmin;  /* Min size for RT fallocate */
 	char			*m_logname;	/* external log device name */
 	int			m_bsize;	/* fs logical block size */
 	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a1d6968..01a2ab4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -83,7 +83,8 @@ enum {
 	Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota, Opt_prjquota,
 	Opt_uquota, Opt_gquota, Opt_pquota,
 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
-	Opt_discard, Opt_nodiscard, Opt_dax, Opt_rtdisable, Opt_err,
+	Opt_discard, Opt_nodiscard, Opt_dax, Opt_rtdisable, Opt_rtfallocmin,
+	Opt_err,
 };
 
 static const match_table_t tokens = {
@@ -135,6 +136,9 @@ static const match_table_t tokens = {
 
 #ifdef CONFIG_XFS_RT
 	{Opt_rtdisable, "rtdisable"},   /* Ignore real-time flags */
+	{Opt_rtfallocmin, "rtfallocmin=%u"}, /* Min fallocation required
+										  * for rt device
+										  */
 #endif
 	/* Deprecated mount options scheduled for removal */
 	{Opt_barrier,	"barrier"},	/* use writer barriers for log write and
@@ -374,6 +378,10 @@ xfs_parseargs(
 		case Opt_rtdisable:
 			mp->m_flags |= XFS_MOUNT_RTDISABLE;
 			break;
+		case Opt_rtfallocmin:
+			if (match_int(args, &mp->m_rtfallocmin))
+				return -EINVAL;
+			break;
 #endif
 #ifdef CONFIG_FS_DAX
 		case Opt_dax:
@@ -538,6 +546,11 @@ xfs_showargs(
 	if (mp->m_rtname)
 		seq_show_option(m, "rtdev", mp->m_rtname);
 
+#ifdef CONFIG_XFS_RT
+	if (mp->m_rtfallocmin > 0)
+		seq_printf(m, ",rtfallocmin=%d", mp->m_rtfallocmin);
+#endif
+
 	if (mp->m_dalign > 0)
 		seq_printf(m, ",sunit=%d",
 				(int)XFS_FSB_TO_BB(mp, mp->m_dalign));
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs
  2017-09-02 22:41 ` [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs Richard Wareing
@ 2017-09-03  8:49   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-03  8:49 UTC (permalink / raw)
  To: Richard Wareing; +Cc: linux-xfs, david, darrick.wong

On Sat, Sep 02, 2017 at 03:41:44PM -0700, Richard Wareing wrote:
> - Reports real-time device free blocks in statfs calls if
> inheritance bit is set on the inode of directory.  This is a bit more
> intuitive, especially for use-cases which are using a much larger
> device for the real-time device.
> 
> Signed-off-by: Richard Wareing <rwareing@fb.com>
> ---
>  fs/xfs/xfs_super.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 4dbf95c..a1d6968 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1148,6 +1148,12 @@ xfs_fs_statfs(
>  	    ((mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))) ==
>  			      (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))
>  		xfs_qm_statvfs(ip, statp);
> +	if ((ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT) &&
> +		(mp->m_rtdev_targp != NULL)) {

	if ((ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT) && mp->m_rtdev_targp) {

> +		statp->f_blocks = sbp->sb_rblocks;
> +		statp->f_bfree = sbp->sb_frextents * sbp->sb_rextsize -
> +			mp->m_alloc_set_aside;
> +	}

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option
  2017-09-02 22:41 ` [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option Richard Wareing
@ 2017-09-03  8:50   ` Christoph Hellwig
  2017-09-03 22:04     ` Richard Wareing
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-03  8:50 UTC (permalink / raw)
  To: Richard Wareing; +Cc: linux-xfs, david, darrick.wong

On Sat, Sep 02, 2017 at 03:41:45PM -0700, Richard Wareing wrote:
> - Gates real-time block device fallocation's to rtfallocmin bytes
> - Use case: Allows developers to send files to the SSD with ease simply
>   by fallocating them, if they are below rtfallocmin XFS will allocate the
>   blocks from the non-RT device (e.g. an SSD)
> - Useful to automagically store small files on the SSD vs. RT device
>   (HDD) for tiered XFS setups without having to rely on XFS specific
>   ioctl calls.  Userland tools such as rsync can also use fallocation
>   behavior to migrate files between SSD and RT (HDD) device without
>   modifiction (e.g. w/ --preallocate flag).

I'd be much happier if this was done inside the allocator, and in
affect for any initial allocation, not just fallocate, as that keeps
the layering and logic much cleaner.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-02 22:41 [PATCH v2 0/3] XFS real-time device tweaks Richard Wareing
                   ` (2 preceding siblings ...)
  2017-09-02 22:41 ` [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option Richard Wareing
@ 2017-09-03  8:56 ` Christoph Hellwig
  2017-09-03 22:02   ` Richard Wareing
  3 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2017-09-03  8:56 UTC (permalink / raw)
  To: Richard Wareing; +Cc: linux-xfs, david, darrick.wong

On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing wrote:
> - Replaced rtdefault with rtdisable, this yields similar operational
> benefits when combined with the existing mkfs time setting of the inheritance
> flag on the root directory.  Allows temporary disabling of real-time allocation
> without having to walk entire FS to remove flags (which could be time consuming).
> I still don't think it's super obvious to an admin the real-time flag was put
> there at mkfs time (vs. rtdefault being in mount flags), but this gets me
> half of what I'm after.

I still don't understand this option.  What is the use case of
dynamically switching on/off these default to the rt device?

> - rtfallocmin no changes, need to think more about this.  Still a pretty big
> fan of this option for reasons already stated; at least until a more elegant
> solution such as preferred AGs (we'd need a tunable size for the "preferred"
> AG, since our SSD partitions are a fraction of the size of a normal AG) can 
> be implemented.  The only other idea I have is to make a new ioctl e.g. 
> "norealtime", which causes the RT bits to stay cleared regardless of 
> inheritance bits on the containing directory.  This would allowing the 
> "steering" of files to the data device (e.g. SSD); this is probably a safer 
> design than defaulting to SSD and steering to the HDD via the realtime ioctl.  

Jens just added a nice new fcntl to declare the life time of write
streams (and in theory can add other I/O hints).

How about a a mount option that moves all I/O with a given hint
to the RT device?  E.g. rt=longlife would direct I/O on a file
with an rw hint of RWH_WRITE_LIFE_LONG or RWH_WRITE_LIFE_EXTREME to the
RT subvolume as long as there aren't any previous extents.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-03  8:56 ` [PATCH v2 0/3] XFS real-time device tweaks Christoph Hellwig
@ 2017-09-03 22:02   ` Richard Wareing
  2017-09-06  3:44     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Wareing @ 2017-09-03 22:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david, darrick.wong


> On Sep 3, 2017, at 1:56 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing wrote:
>> - Replaced rtdefault with rtdisable, this yields similar operational
>> benefits when combined with the existing mkfs time setting of the inheritance
>> flag on the root directory.  Allows temporary disabling of real-time allocation
>> without having to walk entire FS to remove flags (which could be time consuming).
>> I still don't think it's super obvious to an admin the real-time flag was put
>> there at mkfs time (vs. rtdefault being in mount flags), but this gets me
>> half of what I'm after.
> 
> I still don't understand this option.  What is the use case of
> dynamically switching on/off these default to the rt device?
> 

Say you are in a bit of an emergency, and you need IOPs *now* (incident recovery), w/ rtdisable you could funnel the IO to the SSD without having to strip the inheritance bits from all the directories (which would require two walks....one to remove and one to add them all back).    I think this is about having some options during incidents, and a "kill-switch" should the need arise.

>> - rtfallocmin no changes, need to think more about this.  Still a pretty big
>> fan of this option for reasons already stated; at least until a more elegant
>> solution such as preferred AGs (we'd need a tunable size for the "preferred"
>> AG, since our SSD partitions are a fraction of the size of a normal AG) can 
>> be implemented.  The only other idea I have is to make a new ioctl e.g. 
>> "norealtime", which causes the RT bits to stay cleared regardless of 
>> inheritance bits on the containing directory.  This would allowing the 
>> "steering" of files to the data device (e.g. SSD); this is probably a safer 
>> design than defaulting to SSD and steering to the HDD via the realtime ioctl.  
> 
> Jens just added a nice new fcntl to declare the life time of write
> streams (and in theory can add other I/O hints).
> 
> How about a a mount option that moves all I/O with a given hint
> to the RT device?  E.g. rt=longlife would direct I/O on a file
> with an rw hint of RWH_WRITE_LIFE_LONG or RWH_WRITE_LIFE_EXTREME to the
> RT subvolume as long as there aren't any previous extents.

You seem to trust application developers more than I :).  The problem I see with the lifetime, or allocation size as a hint is that a user could later append to the file and fill up the SSD.  A "norealtime" or fallocation request is a bit more explicit and high signal about the intent vs. the lifetime or allocation size alone.  It's possible, I happen to trickle writes into a file which may ultimately become very very large (e.g. logging), or perhaps introduce a performance or buffering bug which triggers smaller writes (allocations) and altered write lifetimes.  With fallocmin, this won't happen as the assumption/relationship here is clear, you are clearly declaring your intent to write a file of N bytes, and based on that we promote or demote you to the appropriate tier of storage.

The other problem I see is accessibility and usability.  By making these decisions buried in more generic XFS allocation mechanisms or fnctl's, few developers are going to really understand how to safely use them (e.g. without blowing up their SSD's WAF or endurance).  Fallocation is a better understood notion, easier to use and has wider support amongst existing utilities.  Keep in mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW; 300TBW if we are lucky), so we want to design things such that application developers are less likely to step on land mines causing pre-mature SSD failure.

Whatever the ultimate solution here, it should be designed such that it's relatively difficult to accidentally write data to the non-RT device (e.g. SSD in our case); intent must be clear and high signal.  Thus my similar "high-signal" bias in my first patchset w/ rtdefault; sure inheritance bits should be there if somebody mkfs'd, but if somehow they were removed, it could wind up costing 10's of millions of dollars in reduced SSD write lifetime at our scale.  An explicit mount option makes me sleep better at night, things like chef/cfengine can enforce this through traditional policy mechanisms, and removing the behavior has a higher bar (remount + change chef/cfengine) than a trivial call to xfs_io.  From a production engineering/reliability stand-point the design decision is pretty clear; inheritance bits are nearly unenforcible with policy engines such as cfengine or chef (somebody/something could remove a bit buried in the FS and you'd find it only by walking the entire FS), and as result they are bombs waiting to go off compared to the rtdefault flag.

I'd ideally like to take things even further with fallocmin, and revert to the RT device should the non-RT device fill up (subtracting some % of space for metadata); this brings the behavior more along the lines of a "preferred" device vs. a must-have.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option
  2017-09-03  8:50   ` Christoph Hellwig
@ 2017-09-03 22:04     ` Richard Wareing
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Wareing @ 2017-09-03 22:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david, darrick.wong


> On Sep 3, 2017, at 1:50 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Sat, Sep 02, 2017 at 03:41:45PM -0700, Richard Wareing wrote:
>> - Gates real-time block device fallocation's to rtfallocmin bytes
>> - Use case: Allows developers to send files to the SSD with ease simply
>>  by fallocating them, if they are below rtfallocmin XFS will allocate the
>>  blocks from the non-RT device (e.g. an SSD)
>> - Useful to automagically store small files on the SSD vs. RT device
>>  (HDD) for tiered XFS setups without having to rely on XFS specific
>>  ioctl calls.  Userland tools such as rsync can also use fallocation
>>  behavior to migrate files between SSD and RT (HDD) device without
>>  modifiction (e.g. w/ --preallocate flag).
> 
> I'd be much happier if this was done inside the allocator, and in
> affect for any initial allocation, not just fallocate, as that keeps
> the layering and logic much cleaner.

See my comments on your reply to my cover letter.  Would love to hear your thoughts on my reply.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-03 22:02   ` Richard Wareing
@ 2017-09-06  3:44     ` Dave Chinner
  2017-09-06  6:54       ` Richard Wareing
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2017-09-06  3:44 UTC (permalink / raw)
  To: Richard Wareing; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote:
> 
> > On Sep 3, 2017, at 1:56 AM, Christoph Hellwig
> > <hch@infradead.org> wrote:
> > 
> > On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing
> > wrote:
> >> - Replaced rtdefault with rtdisable, this yields similar
> >> operational benefits when combined with the existing mkfs time
> >> setting of the inheritance flag on the root directory.  Allows
> >> temporary disabling of real-time allocation without having to
> >> walk entire FS to remove flags (which could be time consuming).
> >> I still don't think it's super obvious to an admin the
> >> real-time flag was put there at mkfs time (vs. rtdefault being
> >> in mount flags), but this gets me half of what I'm after.
> > 
> > I still don't understand this option.  What is the use case of
> > dynamically switching on/off these default to the rt device?
> > 
> 
> Say you are in a bit of an emergency, and you need IOPs *now*
> (incident recovery), w/ rtdisable you could funnel the IO to the
> SSD

But it /doesn't do that/. It only disables new files from writing to
the rt device. All reads for data in the RT device and writes to
existing files still go to the RT device.


> without having to strip the inheritance bits from all the
> directories (which would require two walks....one to remove and
> one to add them all back).    I think this is about having some
> options during incidents, and a "kill-switch" should the need
> arise.

And soon after the kill switch is triggered, your tiny data device
will go ENOSPC because changing that mount option effective removed
TBs of free space from the filesystem. Then things will really start
going bad.

So maybe you didn't think this through properly - the last thing a
typical user would expect is a filesystem reporting TBs of free
space to go ENOSPC and not being able to recover, regardless of what
mount options are present. iAnd they'll be especially confused when
they start looking at inodes and seeing RT bits set all over the
place...

It's just a recipe for confusion, unexpected behaviour and all I
see here is a support and triage nightmare. Not to mention FB will
move on to something else in a couple of years, and we get stuck
having to maintain it forever more (*cough* filestreams *cough*).

> The other problem I see is accessibility and usability.  By making
> these decisions buried in more generic XFS allocation mechanisms
> or fnctl's, few developers are going to really understand how to
> safely use them (e.g. without blowing up their SSD's WAF or
> endurance). 

The whole point of putting them into the XFS allocator as admin
policies is that *applications developers don't need to know they
exist*.

> Fallocation is a better understood notion, easier to
> use and has wider support amongst existing utilities.

Almost every application I've seen that uses fallocate does
something wrong and/or breaks a longevity or performance
optimisation that filesystems have been making for years. 

fallocate is "easy to understand" but *difficult to use optimally*
because it's behaviour is tightly bound to the filesystem allocator
algorithms. i.e. it's easy to defeat hidden filesystem optimisations
with fallocate, but it's difficult to understand a sub-optimal
corner case in the filesystem allocator that fallocate could be used
to avoid.

In reality, we don't want people using fallocate - the filesystem
algorithms should do the right thing so people don't need to modify
their applications. In cases like this, having the filesystem decide
automatically at first allocation what device to use is the right
way to integrate the functionality, not require users to use
fallocate to trigger such a decision and, as a side effect, prevent
the filesystem from making all the other optimisations they still
want it to make.

> Keep in
> mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW;
> 300TBW if we are lucky), so we want to design things such that
> application developers are less likely to step on land mines
> causing pre-mature SSD failure.

Hmmm. I don't think they way you are using fallocate is doing what
you think it is doing.

That is, using fallocate to preallocate all files so you can direct
allocation to a different device means that delayed allocation is
turned off. Hence XFS cannot optimise allocation across multiple
files at writeback time. This means that writeback across multiple
files will be sprayed around disjointed preallocated regions. When
using delayed allocation, the filesystem will allocate the blocks
for all the files sequentially and so the block layer merge will
them all into one big contiguous IO.

IOWs, fallocate sprays write IO around because they decouple
allocation locality from temporal writeback locality and this causes
non-contiguous write patterns which are a significant contributin
factor to write amplification in SSDs.  In comparison, delayed
allocation results in large sequential IOs that minimise write
amplification in the SSD...

Hence the method you describe that "maximises SSD life" won't help
- if anything it's going to actively harm the SSD life when
compared to just letting the filesystem use delayed allocation and
choose what device to write to at that time....

Hacking one-off high level controls into APIs like fallocate does
not work. Allocation policies need to be integrated into the
filesystem allocators for them to be effective and useful to
administrators and applications alike. fallocate is no sustitute for
the black magic that filesystems do to optimise allocation and IO
patterns....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06  3:44     ` Dave Chinner
@ 2017-09-06  6:54       ` Richard Wareing
  2017-09-06 11:19         ` Dave Chinner
  2017-09-06 11:43         ` Brian Foster
  0 siblings, 2 replies; 17+ messages in thread
From: Richard Wareing @ 2017-09-06  6:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, darrick.wong


On 9/5/17, 8:45 PM, "Dave Chinner" <david@fromorbit.com> wrote:

    On Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote:
    > 
    > > On Sep 3, 2017, at 1:56 AM, Christoph Hellwig
    > > <hch@infradead.org> wrote:
    > > 
    > > On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing
    > > wrote:
    > >> - Replaced rtdefault with rtdisable, this yields similar
    > >> operational benefits when combined with the existing mkfs time
    > >> setting of the inheritance flag on the root directory.  Allows
    > >> temporary disabling of real-time allocation without having to
    > >> walk entire FS to remove flags (which could be time consuming).
    > >> I still don't think it's super obvious to an admin the
    > >> real-time flag was put there at mkfs time (vs. rtdefault being
    > >> in mount flags), but this gets me half of what I'm after.
    > > 
    > > I still don't understand this option.  What is the use case of
    > > dynamically switching on/off these default to the rt device?
    > > 
    > 
    > Say you are in a bit of an emergency, and you need IOPs *now*
    > (incident recovery), w/ rtdisable you could funnel the IO to the
    > SSD
    
    But it /doesn't do that/. It only disables new files from writing to
    the rt device. All reads for data in the RT device and writes to
    existing files still go to the RT device.
    
    
    > without having to strip the inheritance bits from all the
    > directories (which would require two walks....one to remove and
    > one to add them all back).    I think this is about having some
    > options during incidents, and a "kill-switch" should the need
    > arise.
    
    And soon after the kill switch is triggered, your tiny data device
    will go ENOSPC because changing that mount option effective removed
    TBs of free space from the filesystem. Then things will really start
    going bad.
    
    So maybe you didn't think this through properly - the last thing a
    typical user would expect is a filesystem reporting TBs of free
    space to go ENOSPC and not being able to recover, regardless of what
    mount options are present. iAnd they'll be especially confused when
    they start looking at inodes and seeing RT bits set all over the
    place...
    
    It's just a recipe for confusion, unexpected behaviour and all I
    see here is a support and triage nightmare. Not to mention FB will
    move on to something else in a couple of years, and we get stuck
    having to maintain it forever more (*cough* filestreams *cough*).
    
Fair enough, what are your thoughts on rtdefault, if I changed it to *not* set the inheritance bits, but take over this responsibility in their place?  My thinking here is this integrates better than inheritance bits w/ policy management systems such as Chef/Puppet.  Inheritance bits, on the other hand don¹t really lend themselves to machine level policies; they can be sprinkled about all over the FS, and a walk would be required to enforce a machine wide policy.

Or instead of a mount option, would a sysfs option be acceptable?

My hope is we don¹t move on, but collaborate a bit more with the open-source world on these sorts of problems instead of re-inventing the proverbial FS wheel (and re-learning old lessons solved many moons ago by FS developers).  Trying to do my part now, show it can be done and should be done.

    > The other problem I see is accessibility and usability.  By making
    > these decisions buried in more generic XFS allocation mechanisms
    > or fnctl's, few developers are going to really understand how to
    > safely use them (e.g. without blowing up their SSD's WAF or
    > endurance). 
    
    The whole point of putting them into the XFS allocator as admin
    policies is that *applications developers don't need to know they
    exist*.
    
I get you now: *admins* need to know, but application developers not so much.

    > Fallocation is a better understood notion, easier to
    > use and has wider support amongst existing utilities.
    
    Almost every application I've seen that uses fallocate does
    something wrong and/or breaks a longevity or performance
    optimisation that filesystems have been making for years. 
    
    fallocate is "easy to understand" but *difficult to use optimally*
    because it's behaviour is tightly bound to the filesystem allocator
    algorithms. i.e. it's easy to defeat hidden filesystem optimisations
    with fallocate, but it's difficult to understand a sub-optimal
    corner case in the filesystem allocator that fallocate could be used
    to avoid.
    
    In reality, we don't want people using fallocate - the filesystem
    algorithms should do the right thing so people don't need to modify
    their applications. In cases like this, having the filesystem decide
    automatically at first allocation what device to use is the right
    way to integrate the functionality, not require users to use
    fallocate to trigger such a decision and, as a side effect, prevent
    the filesystem from making all the other optimisations they still
    want it to make.

You make a good point here, on preventing the FS from making other optimizations.  I¹m re-working this as you and others have suggested (new version tomorrow).

And xfs_fsr would be the home for code migrating the file to the real-time device once it grows beyond some tunable size.  

    > Keep in
    > mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW;
    > 300TBW if we are lucky), so we want to design things such that
    > application developers are less likely to step on land mines
    > causing pre-mature SSD failure.
    
    Hmmm. I don't think they way you are using fallocate is doing what
    you think it is doing.
    
    That is, using fallocate to preallocate all files so you can direct
    allocation to a different device means that delayed allocation is
    turned off. Hence XFS cannot optimise allocation across multiple
    files at writeback time. This means that writeback across multiple
    files will be sprayed around disjointed preallocated regions. When
    using delayed allocation, the filesystem will allocate the blocks
    for all the files sequentially and so the block layer merge will
    them all into one big contiguous IO.
    
    IOWs, fallocate sprays write IO around because they decouple
    allocation locality from temporal writeback locality and this causes
    non-contiguous write patterns which are a significant contributin
    factor to write amplification in SSDs.  In comparison, delayed
    allocation results in large sequential IOs that minimise write
    amplification in the SSD...
    
    Hence the method you describe that "maximises SSD life" won't help
    - if anything it's going to actively harm the SSD life when
    compared to just letting the filesystem use delayed allocation and
    choose what device to write to at that time....

Wrt to SSDs you are completely correct on this, our fallocate calls were intended to pay up front on the write path for more favorable allocations which pay off during reads on HDDs.  For SSDs this clearly makes less sense, and an optimization we will need to make in our code for the reasons you point out.

    Hacking one-off high level controls into APIs like fallocate does
    not work. Allocation policies need to be integrated into the
    filesystem allocators for them to be effective and useful to
    administrators and applications alike. fallocate is no sustitute for
    the black magic that filesystems do to optimise allocation and IO
    patterns....
    
    Cheers,
    
    Dave.
    -- 
    Dave Chinner
    david@fromorbit.com
    

Thanks for the great comments, suggestions & insights.  Learning a lot.

Richard



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06  6:54       ` Richard Wareing
@ 2017-09-06 11:19         ` Dave Chinner
  2017-09-06 11:43         ` Brian Foster
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2017-09-06 11:19 UTC (permalink / raw)
  To: Richard Wareing; +Cc: Christoph Hellwig, linux-xfs, darrick.wong

On Wed, Sep 06, 2017 at 06:54:41AM +0000, Richard Wareing wrote:
> On 9/5/17, 8:45 PM, "Dave Chinner" <david@fromorbit.com> wrote: On
> Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote:
>     > without having to strip the inheritance bits from all the
>     > directories (which would require two walks....one to remove
>     > and one to add them all back).    I think this is about
>     > having some options during incidents, and a "kill-switch"
>     > should the need arise.
>     
>     And soon after the kill switch is triggered, your tiny data
>     device will go ENOSPC because changing that mount option
>     effective removed TBs of free space from the filesystem. Then
>     things will really start going bad.
>     
>     So maybe you didn't think this through properly - the last
>     thing a typical user would expect is a filesystem reporting
>     TBs of free space to go ENOSPC and not being able to recover,
>     regardless of what mount options are present. iAnd they'll be
>     especially confused when they start looking at inodes and
>     seeing RT bits set all over the place...
>     
>     It's just a recipe for confusion, unexpected behaviour and all
>     I see here is a support and triage nightmare. Not to mention
>     FB will move on to something else in a couple of years, and we
>     get stuck having to maintain it forever more (*cough*
>     filestreams *cough*).
>     
> Fair enough, what are your thoughts on rtdefault,

I don't think it's necessary. If you want automatic selection of
the target device based on the first allocation size, then the
first data allocation on a file will add the RT flag to the inode
before calling into the RT allocator....

> Or instead of a mount option, would a sysfs option be acceptable?

sysfs is preferable for options that are dynamically configurable.

> My hope is we don¹t move on, but collaborate a bit more with the
> open-source world on these sorts of problems instead of
> re-inventing the proverbial FS wheel (and re-learning old lessons
> solved many moons ago by FS developers).  Trying to do my part
> now, show it can be done and should be done.

Sure, nobody here has said what you are doing is conceptually
unsound. All of the comments have been about the implementation and
trying to understand what features from the implementation actually
provide you with the benefit. Then we can focus in on a solid,
maintainable solution...

>     > The other problem I see is accessibility and usability.  By making
>     > these decisions buried in more generic XFS allocation mechanisms
>     > or fnctl's, few developers are going to really understand how to
>     > safely use them (e.g. without blowing up their SSD's WAF or
>     > endurance). 
>     
>     The whole point of putting them into the XFS allocator as admin
>     policies is that *applications developers don't need to know they
>     exist*.
>     
> I get you now: *admins* need to know, but application developers not so much.

Yeah, exactly. Sorry for not making this clearer. In general, we
try to make the fs do the right thing by default and so tuning is
not necessary. But if tuning is necessary, the policy is set by the
admin and not the application as the admin knows a lot more about
their specific hardware and execution context than an application
developer.

>     In reality, we don't want people using fallocate - the
>     filesystem algorithms should do the right thing so people
>     don't need to modify their applications. In cases like this,
>     having the filesystem decide automatically at first allocation
>     what device to use is the right way to integrate the
>     functionality, not require users to use fallocate to trigger
>     such a decision and, as a side effect, prevent the filesystem
>     from making all the other optimisations they still want it to
>     make.
> 
> You make a good point here, on preventing the FS from making other
> optimizations.  I¹m re-working this as you and others have
> suggested (new version tomorrow).

OK.

> And xfs_fsr would be the home for code migrating the file to the
> real-time device once it grows beyond some tunable size.  

Keep in mind that the allocation xfs_fsr does will follow whatever
policy is currently in force. e.g. if a large file is on the wrong
device, then just running the existing defrag operation on it should
relocate the data to the correct device. Sure, fsr might need some
help to recognise what "wrong device" means in it's inode scan
routines, but the mechanism to move the data should be pretty much
unchanged...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06  6:54       ` Richard Wareing
  2017-09-06 11:19         ` Dave Chinner
@ 2017-09-06 11:43         ` Brian Foster
  2017-09-06 12:12           ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Brian Foster @ 2017-09-06 11:43 UTC (permalink / raw)
  To: Richard Wareing; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, darrick.wong

On Wed, Sep 06, 2017 at 06:54:41AM +0000, Richard Wareing wrote:
> 
> On 9/5/17, 8:45 PM, "Dave Chinner" <david@fromorbit.com> wrote:
> 
>     On Sun, Sep 03, 2017 at 10:02:41PM +0000, Richard Wareing wrote:
>     > 
>     > > On Sep 3, 2017, at 1:56 AM, Christoph Hellwig
>     > > <hch@infradead.org> wrote:
>     > > 
>     > > On Sat, Sep 02, 2017 at 03:41:42PM -0700, Richard Wareing
>     > > wrote:
>     > >> - Replaced rtdefault with rtdisable, this yields similar
>     > >> operational benefits when combined with the existing mkfs time
>     > >> setting of the inheritance flag on the root directory.  Allows
>     > >> temporary disabling of real-time allocation without having to
>     > >> walk entire FS to remove flags (which could be time consuming).
>     > >> I still don't think it's super obvious to an admin the
>     > >> real-time flag was put there at mkfs time (vs. rtdefault being
>     > >> in mount flags), but this gets me half of what I'm after.
>     > > 
>     > > I still don't understand this option.  What is the use case of
>     > > dynamically switching on/off these default to the rt device?
>     > > 
>     > 
>     > Say you are in a bit of an emergency, and you need IOPs *now*
>     > (incident recovery), w/ rtdisable you could funnel the IO to the
>     > SSD
>     
>     But it /doesn't do that/. It only disables new files from writing to
>     the rt device. All reads for data in the RT device and writes to
>     existing files still go to the RT device.
>     
>     
>     > without having to strip the inheritance bits from all the
>     > directories (which would require two walks....one to remove and
>     > one to add them all back).    I think this is about having some
>     > options during incidents, and a "kill-switch" should the need
>     > arise.
>     
>     And soon after the kill switch is triggered, your tiny data device
>     will go ENOSPC because changing that mount option effective removed
>     TBs of free space from the filesystem. Then things will really start
>     going bad.
>     
>     So maybe you didn't think this through properly - the last thing a
>     typical user would expect is a filesystem reporting TBs of free
>     space to go ENOSPC and not being able to recover, regardless of what
>     mount options are present. iAnd they'll be especially confused when
>     they start looking at inodes and seeing RT bits set all over the
>     place...
>     
>     It's just a recipe for confusion, unexpected behaviour and all I
>     see here is a support and triage nightmare. Not to mention FB will
>     move on to something else in a couple of years, and we get stuck
>     having to maintain it forever more (*cough* filestreams *cough*).
>     
> Fair enough, what are your thoughts on rtdefault, if I changed it to *not* set the inheritance bits, but take over this responsibility in their place?  My thinking here is this integrates better than inheritance bits w/ policy management systems such as Chef/Puppet.  Inheritance bits, on the other hand don¹t really lend themselves to machine level policies; they can be sprinkled about all over the FS, and a walk would be required to enforce a machine wide policy.
> 
> Or instead of a mount option, would a sysfs option be acceptable?
> 
> My hope is we don¹t move on, but collaborate a bit more with the open-source world on these sorts of problems instead of re-inventing the proverbial FS wheel (and re-learning old lessons solved many moons ago by FS developers).  Trying to do my part now, show it can be done and should be done.
> 

FWIW, I'm still a little confused as to the need for this mechanism.
What exactly is the use case for 1.) your specific environment and 2.)
to a traditional realtime user?

Something like rtdefault (or an rtro option for realtime readonly
behavior) seems a bit more generic to me if one wanted broad control
over the feature, but your fallocate mount thingy seems to already
accomplish that. I.e., if you made that thing set/clear RT on individual
files based purely on file size and you had a need to quickly disable
setting RT on new files, why can't you just remount without that option?
It seems to me you wouldn't need to care about the RT inherit flag
either way..?

>     > The other problem I see is accessibility and usability.  By making
>     > these decisions buried in more generic XFS allocation mechanisms
>     > or fnctl's, few developers are going to really understand how to
>     > safely use them (e.g. without blowing up their SSD's WAF or
>     > endurance). 
>     
>     The whole point of putting them into the XFS allocator as admin
>     policies is that *applications developers don't need to know they
>     exist*.
>     
> I get you now: *admins* need to know, but application developers not so much.
> 
>     > Fallocation is a better understood notion, easier to
>     > use and has wider support amongst existing utilities.
>     
>     Almost every application I've seen that uses fallocate does
>     something wrong and/or breaks a longevity or performance
>     optimisation that filesystems have been making for years. 
>     
>     fallocate is "easy to understand" but *difficult to use optimally*
>     because it's behaviour is tightly bound to the filesystem allocator
>     algorithms. i.e. it's easy to defeat hidden filesystem optimisations
>     with fallocate, but it's difficult to understand a sub-optimal
>     corner case in the filesystem allocator that fallocate could be used
>     to avoid.
>     
>     In reality, we don't want people using fallocate - the filesystem
>     algorithms should do the right thing so people don't need to modify
>     their applications. In cases like this, having the filesystem decide
>     automatically at first allocation what device to use is the right
>     way to integrate the functionality, not require users to use
>     fallocate to trigger such a decision and, as a side effect, prevent
>     the filesystem from making all the other optimisations they still
>     want it to make.
> 
> You make a good point here, on preventing the FS from making other optimizations.  I¹m re-working this as you and others have suggested (new version tomorrow).
> 
> And xfs_fsr would be the home for code migrating the file to the real-time device once it grows beyond some tunable size.  
> 

I pretty much agree with everything Dave says here, along with
Christoph's previous suggestion that this is better off in the allocator
than in the fallocate path. In the end, I think your current environment
won't know the difference because you fallocate everything up front
anyways (notwithstanding Dave's explanation as to why that might not be
the greatest idea, however). In fact, I think this would be much more
interesting overall if we could tier per-extent allocation rather than
per-file, but that of course is one of the limitations of using RT.

That said, while the implementation improvement makes sense, I'm still
not necessarily convinced that this has a place in the upstream realtime
feature. I'll grant you that I'm not terribly familiar with the
historical realtime use case.. Dave, do you see value in such a
heuristic as it relates to the realtime feature (not this tiering
setup)? Is there necessarily a mapping between a large file size and a
file that should be tagged realtime? E.g., I suppose somebody who is
using traditional realtime (i.e., no SSD) and has a mix of legitimate
realtime (streaming media) files and large sparse virt disk images or
something of that nature would need to know to not use this feature
(i.e., this requires documentation)..?

Brian

>     > Keep in
>     > mind, we need our SSDs to last >3 years (w/ only a mere 70-80 TBW;
>     > 300TBW if we are lucky), so we want to design things such that
>     > application developers are less likely to step on land mines
>     > causing pre-mature SSD failure.
>     
>     Hmmm. I don't think they way you are using fallocate is doing what
>     you think it is doing.
>     
>     That is, using fallocate to preallocate all files so you can direct
>     allocation to a different device means that delayed allocation is
>     turned off. Hence XFS cannot optimise allocation across multiple
>     files at writeback time. This means that writeback across multiple
>     files will be sprayed around disjointed preallocated regions. When
>     using delayed allocation, the filesystem will allocate the blocks
>     for all the files sequentially and so the block layer merge will
>     them all into one big contiguous IO.
>     
>     IOWs, fallocate sprays write IO around because they decouple
>     allocation locality from temporal writeback locality and this causes
>     non-contiguous write patterns which are a significant contributin
>     factor to write amplification in SSDs.  In comparison, delayed
>     allocation results in large sequential IOs that minimise write
>     amplification in the SSD...
>     
>     Hence the method you describe that "maximises SSD life" won't help
>     - if anything it's going to actively harm the SSD life when
>     compared to just letting the filesystem use delayed allocation and
>     choose what device to write to at that time....
> 
> Wrt to SSDs you are completely correct on this, our fallocate calls were intended to pay up front on the write path for more favorable allocations which pay off during reads on HDDs.  For SSDs this clearly makes less sense, and an optimization we will need to make in our code for the reasons you point out.
> 
>     Hacking one-off high level controls into APIs like fallocate does
>     not work. Allocation policies need to be integrated into the
>     filesystem allocators for them to be effective and useful to
>     administrators and applications alike. fallocate is no sustitute for
>     the black magic that filesystems do to optimise allocation and IO
>     patterns....
>     
>     Cheers,
>     
>     Dave.
>     -- 
>     Dave Chinner
>     david@fromorbit.com
>     
> 
> Thanks for the great comments, suggestions & insights.  Learning a lot.
> 
> Richard
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06 11:43         ` Brian Foster
@ 2017-09-06 12:12           ` Dave Chinner
  2017-09-06 12:49             ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2017-09-06 12:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: Richard Wareing, Christoph Hellwig, linux-xfs, darrick.wong

On Wed, Sep 06, 2017 at 07:43:05AM -0400, Brian Foster wrote:
> That said, while the implementation improvement makes sense, I'm still
> not necessarily convinced that this has a place in the upstream realtime
> feature. I'll grant you that I'm not terribly familiar with the
> historical realtime use case.. Dave, do you see value in such a
> heuristic as it relates to the realtime feature (not this tiering
> setup)? Is there necessarily a mapping between a large file size and a
> file that should be tagged realtime?

I don't see it much differently to the inode32 allocator policy.
That separates metadata from data based on the type of allocation
that is going to take place.  inode32 decides on the AG for the
inode data on the first data allocation (via the ag rotor), so
there's already precedence for this sort of "locality selection at
initial allocation" policy in the XFS allocation algorithms. 

Some workloads run really well on inode32 because the metadata ends
up tightly packed and you can keep lots of disks busy with a dm
concat because data IO is effectively distributed over all AGs.
We've never done that automatically with the rt device before, but
if it allows hybrid setups to be constructed easily then I can see
it being beneficial to those same sorts of worklaods....

And, FWIW, auto rtdev selection might also work quite nicely with
write once large file workloads (i.e. archives) on SMR drives - data
device for the PMR region for metadata and small or temporary files,
rt device w/ appropriate extent size for larges files in the SMR
region...

> E.g., I suppose somebody who is
> using traditional realtime (i.e., no SSD) and has a mix of legitimate
> realtime (streaming media) files and large sparse virt disk images or
> something of that nature would need to know to not use this feature
> (i.e., this requires documentation)..?

It wouldn't be enabled by default. We can't break existing rt device
setups, so I don't see any issue here. And, well, someone mixing
realtime and sparse virt in the same filesystem and storage isn't
going to get reliable realtime response. i.e. nobody in their right
mind mixes realtime streaming workloads with anything else - it's
always dedicated hardware for RT....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06 12:12           ` Dave Chinner
@ 2017-09-06 12:49             ` Brian Foster
  2017-09-06 23:29               ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2017-09-06 12:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Richard Wareing, Christoph Hellwig, linux-xfs, darrick.wong

On Wed, Sep 06, 2017 at 10:12:01PM +1000, Dave Chinner wrote:
> On Wed, Sep 06, 2017 at 07:43:05AM -0400, Brian Foster wrote:
> > That said, while the implementation improvement makes sense, I'm still
> > not necessarily convinced that this has a place in the upstream realtime
> > feature. I'll grant you that I'm not terribly familiar with the
> > historical realtime use case.. Dave, do you see value in such a
> > heuristic as it relates to the realtime feature (not this tiering
> > setup)? Is there necessarily a mapping between a large file size and a
> > file that should be tagged realtime?
> 
> I don't see it much differently to the inode32 allocator policy.
> That separates metadata from data based on the type of allocation
> that is going to take place.  inode32 decides on the AG for the
> inode data on the first data allocation (via the ag rotor), so
> there's already precedence for this sort of "locality selection at
> initial allocation" policy in the XFS allocation algorithms. 
> 
> Some workloads run really well on inode32 because the metadata ends
> up tightly packed and you can keep lots of disks busy with a dm
> concat because data IO is effectively distributed over all AGs.
> We've never done that automatically with the rt device before, but
> if it allows hybrid setups to be constructed easily then I can see
> it being beneficial to those same sorts of worklaods....
> 
> And, FWIW, auto rtdev selection might also work quite nicely with
> write once large file workloads (i.e. archives) on SMR drives - data
> device for the PMR region for metadata and small or temporary files,
> rt device w/ appropriate extent size for larges files in the SMR
> region...
> 

Ok, that sounds reasonable enough to me. Thanks.

> > E.g., I suppose somebody who is
> > using traditional realtime (i.e., no SSD) and has a mix of legitimate
> > realtime (streaming media) files and large sparse virt disk images or
> > something of that nature would need to know to not use this feature
> > (i.e., this requires documentation)..?
> 
> It wouldn't be enabled by default. We can't break existing rt device
> setups, so I don't see any issue here. And, well, someone mixing
> realtime and sparse virt in the same filesystem and storage isn't
> going to get reliable realtime response. i.e. nobody in their right
> mind mixes realtime streaming workloads with anything else - it's
> always dedicated hardware for RT....
> 

Yes, that's just a dumb example. Let me rephrase...

Is there any legitimate realtime use case a filesystem may not want to
tag all files of a particular size? E.g., this is more relevant for
subsequent read requirements than anything, right? (If not, then why do
we have the flag at all?) If so, then it seems to me this needs to be
clearly documented...

Note that this use case defines large as >256k. Realtime use cases may
have a much different definition, yes? I take it that means things like
amount of physical memory and write workload may also be a significant
factor in the effectiveness of this heuristic. For example, how much
pagecache can we dirty before writeback occurs and does an initial
allocation? How many large files are typically written in parallel?
Also, what about direct I/O or extent size hints?

All I'm really saying is that I think this at least needs to consider
the generic use case and have some documentation around any scnarios
where this might not make sense for traditional users, what values might
be sane, etc. As opposed to such users seeing an "automagic" knob,
turning it on thinking it replaces the need to think about how to
properly lay out the fs and then realizing later that this doesn't do
what they expect. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06 12:49             ` Brian Foster
@ 2017-09-06 23:29               ` Dave Chinner
  2017-09-07 11:58                 ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2017-09-06 23:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Richard Wareing, Christoph Hellwig, linux-xfs, darrick.wong

On Wed, Sep 06, 2017 at 08:49:28AM -0400, Brian Foster wrote:
> On Wed, Sep 06, 2017 at 10:12:01PM +1000, Dave Chinner wrote:
> > On Wed, Sep 06, 2017 at 07:43:05AM -0400, Brian Foster wrote:
> > > That said, while the implementation improvement makes sense, I'm still
> > > not necessarily convinced that this has a place in the upstream realtime
> > > feature. I'll grant you that I'm not terribly familiar with the
> > > historical realtime use case.. Dave, do you see value in such a
> > > heuristic as it relates to the realtime feature (not this tiering
> > > setup)? Is there necessarily a mapping between a large file size and a
> > > file that should be tagged realtime?
> > 
> > I don't see it much differently to the inode32 allocator policy.
> > That separates metadata from data based on the type of allocation
> > that is going to take place.  inode32 decides on the AG for the
> > inode data on the first data allocation (via the ag rotor), so
> > there's already precedence for this sort of "locality selection at
> > initial allocation" policy in the XFS allocation algorithms. 
> > 
> > Some workloads run really well on inode32 because the metadata ends
> > up tightly packed and you can keep lots of disks busy with a dm
> > concat because data IO is effectively distributed over all AGs.
> > We've never done that automatically with the rt device before, but
> > if it allows hybrid setups to be constructed easily then I can see
> > it being beneficial to those same sorts of worklaods....
> > 
> > And, FWIW, auto rtdev selection might also work quite nicely with
> > write once large file workloads (i.e. archives) on SMR drives - data
> > device for the PMR region for metadata and small or temporary files,
> > rt device w/ appropriate extent size for larges files in the SMR
> > region...
> > 
> 
> Ok, that sounds reasonable enough to me. Thanks.
> 
> > > E.g., I suppose somebody who is
> > > using traditional realtime (i.e., no SSD) and has a mix of legitimate
> > > realtime (streaming media) files and large sparse virt disk images or
> > > something of that nature would need to know to not use this feature
> > > (i.e., this requires documentation)..?
> > 
> > It wouldn't be enabled by default. We can't break existing rt device
> > setups, so I don't see any issue here. And, well, someone mixing
> > realtime and sparse virt in the same filesystem and storage isn't
> > going to get reliable realtime response. i.e. nobody in their right
> > mind mixes realtime streaming workloads with anything else - it's
> > always dedicated hardware for RT....
> > 
> 
> Yes, that's just a dumb example. Let me rephrase...
> 
> Is there any legitimate realtime use case a filesystem may not want to
> tag all files of a particular size?  E.g., this is more relevant for
> subsequent read requirements than anything, right? (If not, then why do
> we have the flag at all?) If so, then it seems to me this needs to be
> clearly documented...

Hmmm. I'm not following you here, Brian. RT has a deterministic
allocator to prevent arbitrary IO delays on write, not read.  The
read side on RT is no different to the data device (i.e. extent
lookup, read data) and as long as both allocators have given the
file large contiguous extents there's no difference in the size and
shape of read IOs being issued, either.  So I'm not sure what you
are saying needs documenting?

Also, keep in mind the RT device is not suited to small files at
all. It's optimised for allocating large contiguous extents, it
doesn't handle freespace fragmentation at all well so having small
files come and go regularly really screws it up, and it's single
threaded allocator means it can't handle the allocation demand that
comes along with small file workloads, either.....

> Note that this use case defines large as >256k. Realtime use cases may
> have a much different definition, yes?

Again, if the workload is "realtime"(*) then it is not going to be
using this functionality - everything needs to be tightly controlled
and leave nothing to unpredictable algorithmic heuristics.
Regardless, for different *data sets* the size threshold might be
different, but that is for the admin who understands the environment
and applications to separate workloads and set appropriate policy
for each.

If you're only worried about it being a fs global setting, then
start thinking about how to do it per inode/directory.  Personally,
though, I think we need to start moving all the allocation policy
stuff (extsize hints, flags, etc) into a generic alloc policy xattr
space otherwise we're going to run out of space in the inode core
for all this alloc policy stuff...

> I take it that means things like amount of physical memory and
> write workload may also be a significant factor in the
> effectiveness of this heuristic.  For example, how much pagecache
> can we dirty before writeback occurs and does an initial
> allocation?  How many large files are typically written in parallel?

Delayed allocation on large works just fine regardless of these
parameter variations - that's the whole point of all the heuristics
in the delalloc code to prevent fragmentation. IOWs, machine loading
and worklaod should not significantly impact on what device large
files are written to because it's rare that large files get
allocated in tiny chunks by XFS.

Where mistakes are made, xfs_fsr can relocate the files
appropriately. And the good part about having the metadata on SSD is
that the xfs_fsr scan to fnd such files (i.e bulkstat) won't impact
on the running workload significantly.

> Also, what about direct I/O or extent size hints?

If you are doing direct IO, then it's up to the admin and
application to make sure it it's not doing something silly. The
usual raft of alloc policy controls like extent
size hints, preallocation and/or actually setting the rt inherit
bits manually on data set directories can deal with issues here...

> All I'm really saying is that I think this at least needs to consider
> the generic use case and have some documentation around any scnarios
> where this might not make sense for traditional users, what values might
> be sane, etc.

I think you're conflating "integrating new functionality in a
generic manner" with "this is new generic functionality everyone
should use".  CRCs and reflink fall into the latter category, while
allocation policies for rtdevs fall into the former....

> As opposed to such users seeing an "automagic" knob,
> turning it on thinking it replaces the need to think about how to
> properly lay out the fs and then realizing later that this doesn't do
> what they expect. Thoughts?

ISTM that you are over-thinking the problem. :/

We should document how something can/should be used, not iterate all
the cases where it should not be used because they vastly outnumber
the valid use cases. I can see how useful a simple setup like
Richard has described is for effcient long term storage in large
scale storage environments. I think we should aim to support that
cleanly and efficiently first, not try to make it into something
that nobody is asking for....

Cheers,

Dave.

(*) <rant warning>

The "realtime" device isn't real time at all. It's a shit name and I
hate it because it makes people think it's something that it isn't.
It's just an alternative IO address space with a bound overhead
(i.e. deterministic) allocator that is optimised for large
contiguous data allocations.  It's used for workloads are that are
latency sensitive, not "real time". The filesystem is not real time
capable and the IO subsystem is most definitely not real time
capable. It's a crap name.

<end rant>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/3] XFS real-time device tweaks
  2017-09-06 23:29               ` Dave Chinner
@ 2017-09-07 11:58                 ` Brian Foster
  0 siblings, 0 replies; 17+ messages in thread
From: Brian Foster @ 2017-09-07 11:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Richard Wareing, Christoph Hellwig, linux-xfs, darrick.wong

On Thu, Sep 07, 2017 at 09:29:54AM +1000, Dave Chinner wrote:
> On Wed, Sep 06, 2017 at 08:49:28AM -0400, Brian Foster wrote:
> > On Wed, Sep 06, 2017 at 10:12:01PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 06, 2017 at 07:43:05AM -0400, Brian Foster wrote:
...
> > 
> > Yes, that's just a dumb example. Let me rephrase...
> > 
> > Is there any legitimate realtime use case a filesystem may not want to
> > tag all files of a particular size?  E.g., this is more relevant for
> > subsequent read requirements than anything, right? (If not, then why do
> > we have the flag at all?) If so, then it seems to me this needs to be
> > clearly documented...
> 
> Hmmm. I'm not following you here, Brian. RT has a deterministic
> allocator to prevent arbitrary IO delays on write, not read.  The
> read side on RT is no different to the data device (i.e. extent
> lookup, read data) and as long as both allocators have given the
> file large contiguous extents there's no difference in the size and
> shape of read IOs being issued, either.  So I'm not sure what you
> are saying needs documenting?
> 

Er, Ok. I may be conflating the use cases between traditional rt and
this one. Sorry, I'm also not explaining myself clearly wrt to my
questions, but I think you manage to close in on them anyways...

...
> > Note that this use case defines large as >256k. Realtime use cases may
> > have a much different definition, yes?
> 
> Again, if the workload is "realtime"(*) then it is not going to be
> using this functionality - everything needs to be tightly controlled
> and leave nothing to unpredictable algorithmic heuristics.
> Regardless, for different *data sets* the size threshold might be
> different, but that is for the admin who understands the environment
> and applications to separate workloads and set appropriate policy
> for each.
> 

Ok, so the above says that basically if somebody is using traditional
RT, they shouldn't be using this mount option at all. That's the part
that I think needs to be called out. :) If we add/document an
rt-oriented mount option, we should probably explain that there are very
special conditions where this should be used ("tiering" via SSD,
archives to SMR, etc.). Either your workload closely matches these
conditions or you shouldn't use this option.

That pretty much answers my question wrt to traditional realtime. It
also seems like a red flag for a one off hack, but I digress (for now,
more on this later). ;P Moving on from the traditional RT use case, this
raises a similar question for those who might want to legitimately use
this feature for the SSD use case: what are those conditions their
workload needs to meet?

> If you're only worried about it being a fs global setting, then
> start thinking about how to do it per inode/directory.  Personally,
> though, I think we need to start moving all the allocation policy
> stuff (extsize hints, flags, etc) into a generic alloc policy xattr
> space otherwise we're going to run out of space in the inode core
> for all this alloc policy stuff...
> 
> > I take it that means things like amount of physical memory and
> > write workload may also be a significant factor in the
> > effectiveness of this heuristic.  For example, how much pagecache
> > can we dirty before writeback occurs and does an initial
> > allocation?  How many large files are typically written in parallel?
> 
> Delayed allocation on large works just fine regardless of these
> parameter variations - that's the whole point of all the heuristics
> in the delalloc code to prevent fragmentation. IOWs, machine loading
> and worklaod should not significantly impact on what device large
> files are written to because it's rare that large files get
> allocated in tiny chunks by XFS.
> 

So we create a mount option that automatically assigns a file to the
appropriate device based on the inode size at the time of the first
physical allocation. This works fine for fb because they 1.) define a
relatively small threshold of 256k and 2.) fallocate every file up
front.

But a tunable is a tunable, so suppose another user comes along, thinks
they otherwise match the conditions to use this feature on a DVR or
something of that nature. The device has a smaller SSD, bigger HDD (the
rtdev) and 512GB RAM. Files are either pretty small (kB-Mb) and should
remain on the root SSD or multi-Gb and should go to the HDD, so the user
sets a threshold of 1GB (another dumb example, just assume it's valid
with respect to the dataset). This probably won't work and it's not
obvious why to somebody who doesn't understand the implementation of
this hack (because "file size at first alloc" is really a
non-deterministic transient when it comes down to it). So is this
feature simply not suitable for this environment? Does the user need to
set a smaller threshold that's some percentage of physical RAM? This is
the type of stuff I think needs to be described somewhere.

Repeat that scenario for another user who has a similar workload to fb,
wants to ship off everything larger than a few MB to a spining rust
rtdev, but otherwise have many concurrent writers of such files. This
isn't a problem for fb because of their generally single-threaded
workload, but all this user knows is we've advertised a mechanism that
can be used to do big/small file tiering between an SSD and HDD. This
user otherwise has no reason to know or care about the RT allocator.
This is, of course, also not likely to perform as the user expects.

...
> 
> ISTM that you are over-thinking the problem. :/
> 
> We should document how something can/should be used, not iterate all
> the cases where it should not be used because they vastly outnumber
> the valid use cases. I can see how useful a simple setup like
> Richard has described is for effcient long term storage in large
> scale storage environments. I think we should aim to support that
> cleanly and efficiently first, not try to make it into something
> that nobody is asking for....
> 

Yes, I understand. I'm not concerned about this feature being generic or
literally enumerating all of the reasons not to use it. ;)

For one, I'm concerned that this may not be as useful for many users
outside of fb, if any (based on the current XFS RT oriented design) [1],
precisely because of the highly controlled/constrained workload
requirements. Second, I think that highly constrained workload needs to
be documented.

I understand that the realtime allocator has all these constraints and
limitations as to where it should and should not be used. My point is
that if we're adding a mount option on top that traditional RT users
should never use and we call it the "file size tiering between SSD/HDD
option," then I think we're opening the door for significant confusion
for users to think they can accomplish what fb has without actually
running into the limitations of the RT allocator.

IOW, users will come along with no care at all for RT and just want to
do this cool SSD/HDD tiering thing. Hence, I think this non-rt, rt,
tiering mount option needs to very specifically describe that those rt
limitations still exist and the performance expectations might not be as
expected unless they are met. Make sense?

Brian

[1] First, I'm not against merging this if you and others think there is
a real use case (moreso because I don't care much about RT and will
likely keep it disabled :). But as noted a couple times above, the more
I think about this the more I think the current implementation of this
is really not for anybody but fb. I'm not convinced the majority of
users who would want to use this kind of tiering mechanism could do so
in a way that navigates around the limitations of RT. I could have too
insular a view of the potential use cases or be overestimating how
limiting RT really is, of course. That's just my .02.

> Cheers,
> 
> Dave.
> 
> (*) <rant warning>
> 
> The "realtime" device isn't real time at all. It's a shit name and I
> hate it because it makes people think it's something that it isn't.
> It's just an alternative IO address space with a bound overhead
> (i.e. deterministic) allocator that is optimised for large
> contiguous data allocations.  It's used for workloads are that are
> latency sensitive, not "real time". The filesystem is not real time
> capable and the IO subsystem is most definitely not real time
> capable. It's a crap name.
> 
> <end rant>
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-09-07 11:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-02 22:41 [PATCH v2 0/3] XFS real-time device tweaks Richard Wareing
2017-09-02 22:41 ` [PATCH v2 1/3] fs/xfs: Add rtdisable option Richard Wareing
2017-09-02 22:41 ` [PATCH v2 2/3] fs/xfs: Add real-time device support to statfs Richard Wareing
2017-09-03  8:49   ` Christoph Hellwig
2017-09-02 22:41 ` [PATCH v2 3/3] fs/xfs: Add rtfallocmin mount option Richard Wareing
2017-09-03  8:50   ` Christoph Hellwig
2017-09-03 22:04     ` Richard Wareing
2017-09-03  8:56 ` [PATCH v2 0/3] XFS real-time device tweaks Christoph Hellwig
2017-09-03 22:02   ` Richard Wareing
2017-09-06  3:44     ` Dave Chinner
2017-09-06  6:54       ` Richard Wareing
2017-09-06 11:19         ` Dave Chinner
2017-09-06 11:43         ` Brian Foster
2017-09-06 12:12           ` Dave Chinner
2017-09-06 12:49             ` Brian Foster
2017-09-06 23:29               ` Dave Chinner
2017-09-07 11:58                 ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.