All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] XFS realtime device tweaks
@ 2017-09-25 19:44 Richard Wareing
  2017-09-25 19:44 ` [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set Richard Wareing
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Richard Wareing @ 2017-09-25 19:44 UTC (permalink / raw)
  To: linux-xfs; +Cc: david, darrick.wong, hch

The patch set has been updated per reviewer suggestions, all changes are
noted in the change log for each patch.  The most notable change is the
addition of xfs_inode_select_target and the refactoring which took place
as a result of this.

This cleaned things up as the policy code could be contained in this function
nicely.  I still kept the two functions for the alloc_min and fallback_pct
tunables in their own functions, as I didn't want to clutter up this function
with the policy logic (however little).

Patch set based off Linux 4.12 (commit
6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c) located @
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git; I also have a clean
patch set for Linus' tree (@ 4.14-rc2) if that's preferred.

Richard Wareing (3):
  xfs: Show realtime device stats on statfs calls if inherit flag set
  xfs: Set realtime flag based on initial allocation size
  xfs: Add realtime fallback if data device full

 fs/xfs/xfs_bmap_util.c |  2 ++
 fs/xfs/xfs_fsops.c     |  2 ++
 fs/xfs/xfs_inode.c     | 18 +++++++----
 fs/xfs/xfs_iomap.c     |  5 +++
 fs/xfs/xfs_linux.h     |  2 ++
 fs/xfs/xfs_mount.c     | 24 +++++++++++++++
 fs/xfs/xfs_mount.h     |  9 ++++++
 fs/xfs/xfs_rtalloc.c   | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h   |  2 ++
 fs/xfs/xfs_super.c     |  8 +++++
 fs/xfs/xfs_sysfs.c     | 76 ++++++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 224 insertions(+), 6 deletions(-)

-- 
2.9.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set
  2017-09-25 19:44 [PATCH v5 0/3] XFS realtime device tweaks Richard Wareing
@ 2017-09-25 19:44 ` Richard Wareing
  2017-09-25 22:53   ` Eric Sandeen
  2017-09-25 22:55   ` Darrick J. Wong
  2017-09-25 19:44 ` [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size Richard Wareing
  2017-09-25 19:44 ` [PATCH v5 3/3] xfs: Add realtime fallback if data device full Richard Wareing
  2 siblings, 2 replies; 13+ messages in thread
From: Richard Wareing @ 2017-09-25 19:44 UTC (permalink / raw)
  To: linux-xfs; +Cc: david, darrick.wong, hch

- Reports realtime device free blocks in statfs calls if inheritance
  bit is set on the inode of directory.  This is a bit more intuitive,
  especially for use-cases which are using a much larger device for
  the realtime device.
- Add XFS_IS_REALTIME_MOUNT option to gate based on the existence of a
  realtime device on the mount, similar to the XFS_IS_REALTIME_INODE
  option.

Signed-off-by: Richard Wareing <rwareing@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
Changes since v4:
* None

Changes since v3:
* Fixed accounting bug, we are not required to substract m_alloc_set_aside
  as this is a data device only requirement.
* Added XFS_IS_REALTIME_MOUNT macro based on learnings from CVE-2017-14340,
  now provides similar gating on the mount as XFS_IS_REALTIME_INODE does
  for the inode.

Changes since v2:
* Style updated per Christoph Hellwig's comment
* Fixed bug: statp->f_bavail = statp->f_bfree

 fs/xfs/xfs_linux.h | 2 ++
 fs/xfs/xfs_super.c | 8 ++++++++
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 044fb0e..fe46e71 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -280,8 +280,10 @@ static inline __uint64_t howmany_64(__uint64_t x, __uint32_t y)
 
 #ifdef CONFIG_XFS_RT
 #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME)
+#define XFS_IS_REALTIME_MOUNT(mp) ((mp)->m_rtdev_targp ? 1 : 0)
 #else
 #define XFS_IS_REALTIME_INODE(ip) (0)
+#define XFS_IS_REALTIME_MOUNT(mp) (0)
 #endif
 
 #endif /* __XFS_LINUX__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575..6d33a5e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1136,6 +1136,14 @@ xfs_fs_statfs(
 	    ((mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))) ==
 			      (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))
 		xfs_qm_statvfs(ip, statp);
+
+	if (XFS_IS_REALTIME_MOUNT(mp) &&
+	    (ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)) {
+		statp->f_blocks = sbp->sb_rblocks;
+		statp->f_bavail = statp->f_bfree =
+			sbp->sb_frextents * sbp->sb_rextsize;
+	}
+
 	return 0;
 }
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size
  2017-09-25 19:44 [PATCH v5 0/3] XFS realtime device tweaks Richard Wareing
  2017-09-25 19:44 ` [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set Richard Wareing
@ 2017-09-25 19:44 ` Richard Wareing
  2017-09-25 22:47   ` Darrick J. Wong
  2017-09-26  0:13   ` Eric Sandeen
  2017-09-25 19:44 ` [PATCH v5 3/3] xfs: Add realtime fallback if data device full Richard Wareing
  2 siblings, 2 replies; 13+ messages in thread
From: Richard Wareing @ 2017-09-25 19:44 UTC (permalink / raw)
  To: linux-xfs; +Cc: david, darrick.wong, hch

- The rt_alloc_min sysfs option automatically selects the device (data
  device, or realtime) based on the size of the initial allocation of the
  file.
- This option can be used to route the storage of small files (and the
  inefficient workloads associated with them) to a suitable storage
  device such a SSD, while larger allocations are sent to a traditional
  HDD.
- Supports writes via O_DIRECT, buffered (i.e. page cache), and
  pre-allocations (i.e. fallocate)
- Available only when kernel is compiled w/ CONFIG_XFS_RT option.

Signed-off-by: Richard Wareing <rwareing@fb.com>
---
Changes since v4:
* Added xfs_inode_select_target function to hold target selection
  code
* XFS_IS_REALTIME_MOUNT check now moved inside xfs_inode_select_target
  function for better gating
* Improved consistency in the sysfs set behavior
* Style fixes

Changes since v3:
* Now functions via initial allocation regardless of O_DIRECT, buffered or
  pre-allocation code paths.  Provides a consistent user-experience.
* I Did do some experiments putting this in the xfs_bmapi_write code path
  however pre-allocation accounting unfortunately prevents this cleaner
  approach.  As such, this proved to be the cleanest and functional approach.
* No longer a mount option, now a sysfs tunable

 fs/xfs/xfs_bmap_util.c |  2 ++
 fs/xfs/xfs_inode.c     | 18 ++++++++++++------
 fs/xfs/xfs_iomap.c     |  5 +++++
 fs/xfs/xfs_mount.h     |  1 +
 fs/xfs/xfs_rtalloc.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h   |  2 ++
 fs/xfs/xfs_sysfs.c     | 38 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 9e3cc21..8205669d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1026,6 +1026,8 @@ xfs_alloc_file_space(
 	if (len <= 0)
 		return -EINVAL;
 
+	xfs_inode_select_target(ip, len);
+
 	rt = XFS_IS_REALTIME_INODE(ip);
 	extsz = xfs_get_extsz_hint(ip);
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ec9826c..f9e2deb 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1620,12 +1620,18 @@ xfs_itruncate_extents(
 	if (error)
 		goto out;
 
-	/*
-	 * Clear the reflink flag if we truncated everything.
-	 */
-	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
-		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		xfs_inode_clear_cowblocks_tag(ip);
+	if (ip->i_d.di_nblocks == 0) {
+		/*
+		 * Clear the reflink flag if we truncated everything.
+		 */
+		if (xfs_is_reflink_inode(ip)) {
+			ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+			xfs_inode_clear_cowblocks_tag(ip);
+		}
+		/* Clear realtime flag if m_rt_alloc_min policy is in place */
+		if (XFS_IS_REALTIME_MOUNT(mp) && mp->m_rt_alloc_min) {
+			ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
+		}
 	}
 
 	/*
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 94e5bdf..b3c3b9b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -40,6 +40,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_rtalloc.h"
 
 
 #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
@@ -174,6 +175,8 @@ xfs_iomap_write_direct(
 	int		bmapi_flags = XFS_BMAPI_PREALLOC;
 	uint		tflags = 0;
 
+	xfs_inode_select_target(ip, count);
+
 	rt = XFS_IS_REALTIME_INODE(ip);
 	extsz = xfs_get_extsz_hint(ip);
 	lockmode = XFS_ILOCK_SHARED;	/* locked by caller */
@@ -981,6 +984,8 @@ xfs_file_iomap_begin(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	xfs_inode_select_target(ip, length);
+
 	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
 			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
 		/* Reserve delalloc blocks for regular writeback. */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9fa312a..2adc701 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -197,6 +197,7 @@ typedef struct xfs_mount {
 	__uint32_t		m_generation;
 
 	bool			m_fail_unmount;
+	uint			m_rt_alloc_min; /* Min RT allocation */
 #ifdef DEBUG
 	/*
 	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index c57aa7f..421f860 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1284,3 +1284,53 @@ xfs_rtpick_extent(
 	*pick = b;
 	return 0;
 }
+
+/*
+ * If allocation length is less than rt_alloc_min threshold select the
+ * data device.   Otherwise, select the realtime device.
+ */
+void xfs_rt_alloc_min(
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
+	xfs_off_t			len)
+{
+	if (!mp->m_rt_alloc_min)
+		return;
+
+	if (len < mp->m_rt_alloc_min) {
+		ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
+	} else {
+		ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
+	}
+}
+
+/*
+* Select the target device for the inode based on either the size of the
+* initial allocation, or the amount of space available on the data device.
+*
+*/
+void xfs_inode_select_target(
+	struct xfs_inode	*ip,
+	xfs_off_t			len)
+{
+	struct xfs_mount    *mp = ip->i_mount;
+
+	/* If the mount does not have a realtime device configured, there's
+	 * nothing to do here.
+	 */
+	if (!XFS_IS_REALTIME_MOUNT(mp))
+		return;
+
+	/* You cannot select a new device target once blocks have been allocated
+	 * (e.g. fallocate() beyond EOF), or if data has been written already.
+	 */
+	if (ip->i_d.di_nextents)
+		return;
+	if (ip->i_d.di_size)
+		return;
+
+	/* m_rt_alloc_min controls target selection.  Target selection code is
+	 * not valid if not set.
+	 */
+	xfs_rt_alloc_min(mp, ip, len);
+}
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index f13133e..eaf7ed3 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -136,6 +136,7 @@ int xfs_rtalloc_query_range(struct xfs_trans *tp,
 int xfs_rtalloc_query_all(struct xfs_trans *tp,
 			  xfs_rtalloc_query_range_fn fn,
 			  void *priv);
+void xfs_inode_select_target(struct xfs_inode *ip, xfs_off_t len);
 #else
 # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
 # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
@@ -155,6 +156,7 @@ xfs_rtmount_init(
 }
 # define xfs_rtmount_inodes(m)  (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
 # define xfs_rtunmount_inodes(m)
+# define xfs_inode_select_target(i,l)
 #endif	/* CONFIG_XFS_RT */
 
 #endif	/* __XFS_RTALLOC_H__ */
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index 80ac15f..1e202a1 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -129,10 +129,48 @@ XFS_SYSFS_ATTR_RW(drop_writes);
 
 #endif /* DEBUG */
 
+STATIC ssize_t
+rt_alloc_min_store(
+	struct kobject			*kobject,
+	const char				*buf,
+	size_t					count)
+{
+	struct xfs_mount		*mp = to_mp(kobject);
+	int						ret;
+	int						val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	/* Only valid if using a real-time device */
+	if(!XFS_IS_REALTIME_MOUNT(mp))
+		return -EINVAL;
+
+	if (val >= 0)
+		mp->m_rt_alloc_min = val;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+STATIC ssize_t
+rt_alloc_min_show(
+	struct kobject          *kobject,
+	char                    *buf)
+{
+	struct xfs_mount        *mp = to_mp(kobject);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_alloc_min);
+}
+XFS_SYSFS_ATTR_RW(rt_alloc_min);
+
 static struct attribute *xfs_mp_attrs[] = {
 #ifdef DEBUG
 	ATTR_LIST(drop_writes),
 #endif
+	ATTR_LIST(rt_alloc_min),
 	NULL,
 };
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/3] xfs: Add realtime fallback if data device full
  2017-09-25 19:44 [PATCH v5 0/3] XFS realtime device tweaks Richard Wareing
  2017-09-25 19:44 ` [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set Richard Wareing
  2017-09-25 19:44 ` [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size Richard Wareing
@ 2017-09-25 19:44 ` Richard Wareing
  2017-09-25 22:52   ` Darrick J. Wong
  2 siblings, 1 reply; 13+ messages in thread
From: Richard Wareing @ 2017-09-25 19:44 UTC (permalink / raw)
  To: linux-xfs; +Cc: david, darrick.wong, hch

- For FSes which have a realtime device configured, rt_fallback_pct forces
  allocations to the realtime device after data device usage reaches
  rt_fallback_pct.
- Useful for realtime device users to help prevent ENOSPC errors when
  selectively storing some files (e.g. small files) on data device, while
  others are stored on realtime block device.
- Set via the "rt_fallback_pct" sysfs value which is available if
  the kernel is compiled with CONFIG_XFS_RT.

Signed-off-by: Richard Wareing <rwareing@fb.com>
---
Changes since v4:
* Refactored to align with xfs_inode_select_target change
* Fallback percentage reworked to trigger on % space used on data device.
  I find this a bit more intuitive as it aligns well with "df" output.
* mp->m_rt_min_fdblocks now assigned via function call
* Better consistency on sysfs options

Changes since v3:
* None, new patch to patch set

 fs/xfs/xfs_fsops.c   |  2 ++
 fs/xfs/xfs_mount.c   | 24 ++++++++++++++++++++++++
 fs/xfs/xfs_mount.h   |  8 ++++++++
 fs/xfs/xfs_rtalloc.c | 32 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_sysfs.c   | 38 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 104 insertions(+)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 6ccaae9..80ccb14 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -610,6 +610,8 @@ xfs_growfs_data_private(
 	xfs_set_low_space_thresholds(mp);
 	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
+	mp->m_rt_min_free_dblocks = xfs_rt_calc_min_free_dblocks(mp);
+
 	/*
 	 * If we expanded the last AG, free the per-AG reservation
 	 * so we can reinitialize it with the new size.
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf818..e8bae5e 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1396,3 +1396,27 @@ xfs_dev_is_read_only(
 	}
 	return 0;
 }
+
+/*
+ * precalculate minimum of data blocks required, if we fall
+ * below this value, we will fallback to the real-time device.
+ *
+ * m_rt_fallback_pct can only be non-zero if a real-time device
+ * is configured.
+ */
+uint64_t
+xfs_rt_calc_min_free_dblocks(
+	struct xfs_mount	*mp)
+{
+	uint64_t	min_free_dblocks = 0;
+
+	if (!XFS_IS_REALTIME_MOUNT(mp))
+		return 0;
+
+	/* Pre-compute minimum data blocks required before
+	 * falling back to RT device for allocations
+	 */
+	min_free_dblocks = mp->m_sb.sb_dblocks * (100 - mp->m_rt_fallback_pct);
+	do_div(min_free_dblocks, 100);
+	return min_free_dblocks;
+}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2adc701..74dcdc3 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -198,6 +198,13 @@ typedef struct xfs_mount {
 
 	bool			m_fail_unmount;
 	uint			m_rt_alloc_min; /* Min RT allocation */
+	uint32_t		m_rt_fallback_pct; /* Fallback to realtime device if data
+										* device usage above rt_fallback_pct
+										*/
+	uint64_t		m_rt_min_free_dblocks; /* Use realtime device if free data
+											* device blocks falls below this;
+											* computed from m_rt_fallback_pct.
+											*/
 #ifdef DEBUG
 	/*
 	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
@@ -463,4 +470,5 @@ int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
 struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
 		int error_class, int error);
 
+uint64_t	xfs_rt_calc_min_free_dblocks(struct xfs_mount *mp);
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 421f860..c197b95 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1305,6 +1305,35 @@ void xfs_rt_alloc_min(
 }
 
 /*
+ * m_rt_min_free_dblocks is a pre-computed threshold, which controls target
+ * selection based on how many free blocks are available on the data device.
+ *
+ * If the number of free data device blocks falls below
+ * mp->m_rt_min_free_dblocks, the realtime device is selected as the target
+ * device.  If this value is not set, this target policy is in-active.
+ *
+ */
+void xfs_rt_min_free_dblocks(
+	struct xfs_mount	*mp,
+	struct xfs_inode	*ip,
+	xfs_off_t			len)
+{
+	/* Disabled */
+	if (!mp->m_rt_fallback_pct)
+		return;
+
+	/* If inode target is already realtime device, nothing to do here */
+	if(!XFS_IS_REALTIME_INODE(ip)) {
+		uint64_t	free_dblocks;
+		free_dblocks = percpu_counter_sum(&mp->m_fdblocks) -
+			mp->m_alloc_set_aside;
+		if (free_dblocks < mp->m_rt_min_free_dblocks) {
+			ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
+		}
+	}
+}
+
+/*
 * Select the target device for the inode based on either the size of the
 * initial allocation, or the amount of space available on the data device.
 *
@@ -1333,4 +1362,7 @@ void xfs_inode_select_target(
 	 * not valid if not set.
 	 */
 	xfs_rt_alloc_min(mp, ip, len);
+
+	/* Select target based on remaining free blocks on data device */
+	xfs_rt_min_free_dblocks(mp, ip, len);
 }
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index 1e202a1..889b006 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -166,11 +166,49 @@ rt_alloc_min_show(
 }
 XFS_SYSFS_ATTR_RW(rt_alloc_min);
 
+STATIC ssize_t
+rt_fallback_pct_store(
+	struct kobject			*kobject,
+	const char				*buf,
+	size_t					count)
+{
+	struct xfs_mount		*mp = to_mp(kobject);
+	int						ret;
+	int						val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (!XFS_IS_REALTIME_MOUNT(mp))
+		return -EINVAL;
+
+	if (val < 0)
+		return -EINVAL;
+
+	/* Only valid if using a real-time device */
+	mp->m_rt_fallback_pct = val;
+	mp->m_rt_min_free_dblocks = xfs_rt_calc_min_free_dblocks(mp);
+	return count;
+}
+
+STATIC ssize_t
+rt_fallback_pct_show(
+	struct kobject          *kobject,
+	char                    *buf)
+{
+	struct xfs_mount        *mp = to_mp(kobject);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_fallback_pct);
+}
+XFS_SYSFS_ATTR_RW(rt_fallback_pct);
+
 static struct attribute *xfs_mp_attrs[] = {
 #ifdef DEBUG
 	ATTR_LIST(drop_writes),
 #endif
 	ATTR_LIST(rt_alloc_min),
+	ATTR_LIST(rt_fallback_pct),
 	NULL,
 };
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size
  2017-09-25 19:44 ` [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size Richard Wareing
@ 2017-09-25 22:47   ` Darrick J. Wong
  2017-09-26  5:25     ` Dave Chinner
  2017-09-26  6:11     ` Richard Wareing
  2017-09-26  0:13   ` Eric Sandeen
  1 sibling, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-09-25 22:47 UTC (permalink / raw)
  To: Richard Wareing; +Cc: linux-xfs, david, hch

On Mon, Sep 25, 2017 at 12:44:17PM -0700, Richard Wareing wrote:
> - The rt_alloc_min sysfs option automatically selects the device (data
>   device, or realtime) based on the size of the initial allocation of the
>   file.
> - This option can be used to route the storage of small files (and the
>   inefficient workloads associated with them) to a suitable storage
>   device such a SSD, while larger allocations are sent to a traditional
>   HDD.
> - Supports writes via O_DIRECT, buffered (i.e. page cache), and
>   pre-allocations (i.e. fallocate)
> - Available only when kernel is compiled w/ CONFIG_XFS_RT option.
> 
> Signed-off-by: Richard Wareing <rwareing@fb.com>
> ---
> Changes since v4:
> * Added xfs_inode_select_target function to hold target selection
>   code
> * XFS_IS_REALTIME_MOUNT check now moved inside xfs_inode_select_target
>   function for better gating
> * Improved consistency in the sysfs set behavior
> * Style fixes
> 
> Changes since v3:
> * Now functions via initial allocation regardless of O_DIRECT, buffered or
>   pre-allocation code paths.  Provides a consistent user-experience.
> * I Did do some experiments putting this in the xfs_bmapi_write code path
>   however pre-allocation accounting unfortunately prevents this cleaner
>   approach.  As such, this proved to be the cleanest and functional approach.
> * No longer a mount option, now a sysfs tunable

I'm still struggling with the last two patches in this series.
Currently the decision to set or clear a file's RT flag rests with
whoever has write access to the file; and whatever they set stays that
way throughout the life of the file.

These patches add new behaviors to the filesystem, namely that initial
allocations and certain truncate operations can set or clear the
realtime flag, and that this happens without any particular notification
for whatever might've set that flag.  I imagine that programs that set
the realtime flag might be a little surprised when it doesn't stay set,
since we never used to do that.

At least the new behaviors are hidden behind a sysfs knob, but a big
thing missing from this series is a document for admins explaining that
we've added these various knobs to facilitate automatic tiering of files
on an XFS filesystem, what the knobs do, and which operations can change
a file's tier.  I also worry that we have no means to distinguish a file
that has had the rt flag set by the user from a file with automatic rt
flag management.

I think I may have mislead you a bit on the v1 patches when I objected
to the new fallocate behaviors -- my biggest concern back then (and now)
is the behavioral changes and how to help our existing users fit that
into their mental models of how XFS works.  Is the size of the first
non-fallocate write to a file sufficient to decide the rtflag?

TLDR: Is this how we (XFS community) want to present tiered XFS to users?

(Off in the crazy space that is my head we could just upload BPF
programs into XFS that would set and manage all the hints that can be
attached to an inode, along with either an "automanaged inode" flag that
locks out manual control or some kind of note that a particular setting
overrides the automated control.)

There are several mechanical problems too, which I'll cover below.

> 
>  fs/xfs/xfs_bmap_util.c |  2 ++
>  fs/xfs/xfs_inode.c     | 18 ++++++++++++------
>  fs/xfs/xfs_iomap.c     |  5 +++++
>  fs/xfs/xfs_mount.h     |  1 +
>  fs/xfs/xfs_rtalloc.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_rtalloc.h   |  2 ++
>  fs/xfs/xfs_sysfs.c     | 38 ++++++++++++++++++++++++++++++++++++++
>  7 files changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 9e3cc21..8205669d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1026,6 +1026,8 @@ xfs_alloc_file_space(
>  	if (len <= 0)
>  		return -EINVAL;
>  
> +	xfs_inode_select_target(ip, len);
> +
>  	rt = XFS_IS_REALTIME_INODE(ip);
>  	extsz = xfs_get_extsz_hint(ip);
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ec9826c..f9e2deb 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1620,12 +1620,18 @@ xfs_itruncate_extents(
>  	if (error)
>  		goto out;
>  
> -	/*
> -	 * Clear the reflink flag if we truncated everything.
> -	 */
> -	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
> -		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> -		xfs_inode_clear_cowblocks_tag(ip);

FYI this chunk is going to change when some of the patches I have queued
up for -rc3 go upstream.

> +	if (ip->i_d.di_nblocks == 0) {
> +		/*
> +		 * Clear the reflink flag if we truncated everything.
> +		 */
> +		if (xfs_is_reflink_inode(ip)) {
> +			ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> +			xfs_inode_clear_cowblocks_tag(ip);
> +		}
> +		/* Clear realtime flag if m_rt_alloc_min policy is in place */
> +		if (XFS_IS_REALTIME_MOUNT(mp) && mp->m_rt_alloc_min) {
> +			ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;

Why clear the rt flag from the file if we truncate the file and no
blocks are left, but not if (say) we punch out every block in the file?

Say we set rt_alloc_min = 64k and then pwrite 1mb at 10mb so that the rt
flag is automatically set.  Then we truncate the file to 9mb.  The file
size is 9MB, it has no blocks, and we auto-clear the rt flag?

Yet below I see that the auto-set function bails out if the /size/ is
greater than zero.

This seems inconsistent to me, why don't we also require the size to be
zero to clear the rt flag?  What if I set the rt flag myself and later
it gets cleared out automatically?  That's a new behavior that quietly
happens behind my back.

> +		}
>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 94e5bdf..b3c3b9b 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -40,6 +40,7 @@
>  #include "xfs_dquot_item.h"
>  #include "xfs_dquot.h"
>  #include "xfs_reflink.h"
> +#include "xfs_rtalloc.h"
>  
>  
>  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
> @@ -174,6 +175,8 @@ xfs_iomap_write_direct(
>  	int		bmapi_flags = XFS_BMAPI_PREALLOC;
>  	uint		tflags = 0;
>  
> +	xfs_inode_select_target(ip, count);
> +
>  	rt = XFS_IS_REALTIME_INODE(ip);
>  	extsz = xfs_get_extsz_hint(ip);
>  	lockmode = XFS_ILOCK_SHARED;	/* locked by caller */
> @@ -981,6 +984,8 @@ xfs_file_iomap_begin(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> +	xfs_inode_select_target(ip, length);

iomap_begin can get called for more than just file writes.
Don't change the rt flag because someone FIEMAP'd a zero-length file.

> +
>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
>  			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>  		/* Reserve delalloc blocks for regular writeback. */
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 9fa312a..2adc701 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -197,6 +197,7 @@ typedef struct xfs_mount {
>  	__uint32_t		m_generation;
>  
>  	bool			m_fail_unmount;
> +	uint			m_rt_alloc_min; /* Min RT allocation */

Bytes?  Do we need xfs_off_t here?

>  #ifdef DEBUG
>  	/*
>  	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index c57aa7f..421f860 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -1284,3 +1284,53 @@ xfs_rtpick_extent(
>  	*pick = b;
>  	return 0;
>  }
> +
> +/*
> + * If allocation length is less than rt_alloc_min threshold select the
> + * data device.   Otherwise, select the realtime device.
> + */
> +void xfs_rt_alloc_min(
> +	struct xfs_mount	*mp,
> +	struct xfs_inode	*ip,
> +	xfs_off_t			len)
> +{

Indenting...

void
xfs_rt_alloc_min(
	struct xfs_mount	*mp,
	struct xfs_inode	*ip,
	xfs_off_t		len)
{
	...
}

> +	if (!mp->m_rt_alloc_min)
> +		return;
> +
> +	if (len < mp->m_rt_alloc_min) {
> +		ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
> +	} else {
> +		ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
> +	}

Wait, xfs_rt_alloc_min updates inode flags??  Don't we need a
transaction and a xfs_trans_log_inode for this?  Since this is a
helper for the next function, the same question applies to it.

> +}
> +
> +/*
> +* Select the target device for the inode based on either the size of the
> +* initial allocation, or the amount of space available on the data device.
> +*
> +*/
> +void xfs_inode_select_target(
> +	struct xfs_inode	*ip,
> +	xfs_off_t			len)

Indenting again, and ... this is an inode function, shouldn't it go in
xfs_inode.c ?

And no, you can't just change the incore inode flags without a
transaction and without the ilock.

> +{
> +	struct xfs_mount    *mp = ip->i_mount;
> +
> +	/* If the mount does not have a realtime device configured, there's
> +	 * nothing to do here.
> +	 */
> +	if (!XFS_IS_REALTIME_MOUNT(mp))
> +		return;
> +
> +	/* You cannot select a new device target once blocks have been allocated
> +	 * (e.g. fallocate() beyond EOF), or if data has been written already.
> +	 */
> +	if (ip->i_d.di_nextents)
> +		return;
> +	if (ip->i_d.di_size)
> +		return;
> +
> +	/* m_rt_alloc_min controls target selection.  Target selection code is
> +	 * not valid if not set.
> +	 */
> +	xfs_rt_alloc_min(mp, ip, len);

If the helper functions are only going to be a few lines long and used
once, just put them directly in the function here.

> +}
> diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
> index f13133e..eaf7ed3 100644
> --- a/fs/xfs/xfs_rtalloc.h
> +++ b/fs/xfs/xfs_rtalloc.h
> @@ -136,6 +136,7 @@ int xfs_rtalloc_query_range(struct xfs_trans *tp,
>  int xfs_rtalloc_query_all(struct xfs_trans *tp,
>  			  xfs_rtalloc_query_range_fn fn,
>  			  void *priv);
> +void xfs_inode_select_target(struct xfs_inode *ip, xfs_off_t len);
>  #else
>  # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
>  # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
> @@ -155,6 +156,7 @@ xfs_rtmount_init(
>  }
>  # define xfs_rtmount_inodes(m)  (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
>  # define xfs_rtunmount_inodes(m)
> +# define xfs_inode_select_target(i,l)
>  #endif	/* CONFIG_XFS_RT */
>  
>  #endif	/* __XFS_RTALLOC_H__ */
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index 80ac15f..1e202a1 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -129,10 +129,48 @@ XFS_SYSFS_ATTR_RW(drop_writes);
>  
>  #endif /* DEBUG */
>  
> +STATIC ssize_t
> +rt_alloc_min_store(
> +	struct kobject			*kobject,
> +	const char				*buf,
> +	size_t					count)
> +{
> +	struct xfs_mount		*mp = to_mp(kobject);
> +	int						ret;
> +	int						val;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	/* Only valid if using a real-time device */
> +	if(!XFS_IS_REALTIME_MOUNT(mp))
> +		return -EINVAL;
> +
> +	if (val >= 0)
> +		mp->m_rt_alloc_min = val;
> +	else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +STATIC ssize_t
> +rt_alloc_min_show(
> +	struct kobject          *kobject,
> +	char                    *buf)
> +{
> +	struct xfs_mount        *mp = to_mp(kobject);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_alloc_min);
> +}
> +XFS_SYSFS_ATTR_RW(rt_alloc_min);
> +
>  static struct attribute *xfs_mp_attrs[] = {
>  #ifdef DEBUG
>  	ATTR_LIST(drop_writes),
>  #endif
> +	ATTR_LIST(rt_alloc_min),
>  	NULL,
>  };
>  
> -- 
> 2.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 3/3] xfs: Add realtime fallback if data device full
  2017-09-25 19:44 ` [PATCH v5 3/3] xfs: Add realtime fallback if data device full Richard Wareing
@ 2017-09-25 22:52   ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-09-25 22:52 UTC (permalink / raw)
  To: Richard Wareing; +Cc: linux-xfs, david, hch

On Mon, Sep 25, 2017 at 12:44:18PM -0700, Richard Wareing wrote:
> - For FSes which have a realtime device configured, rt_fallback_pct forces
>   allocations to the realtime device after data device usage reaches
>   rt_fallback_pct.
> - Useful for realtime device users to help prevent ENOSPC errors when
>   selectively storing some files (e.g. small files) on data device, while
>   others are stored on realtime block device.
> - Set via the "rt_fallback_pct" sysfs value which is available if
>   the kernel is compiled with CONFIG_XFS_RT.
> 
> Signed-off-by: Richard Wareing <rwareing@fb.com>
> ---
> Changes since v4:
> * Refactored to align with xfs_inode_select_target change
> * Fallback percentage reworked to trigger on % space used on data device.
>   I find this a bit more intuitive as it aligns well with "df" output.
> * mp->m_rt_min_fdblocks now assigned via function call
> * Better consistency on sysfs options
> 
> Changes since v3:
> * None, new patch to patch set
> 
>  fs/xfs/xfs_fsops.c   |  2 ++
>  fs/xfs/xfs_mount.c   | 24 ++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h   |  8 ++++++++
>  fs/xfs/xfs_rtalloc.c | 32 ++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_sysfs.c   | 38 ++++++++++++++++++++++++++++++++++++++
>  5 files changed, 104 insertions(+)
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 6ccaae9..80ccb14 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -610,6 +610,8 @@ xfs_growfs_data_private(
>  	xfs_set_low_space_thresholds(mp);
>  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
> +	mp->m_rt_min_free_dblocks = xfs_rt_calc_min_free_dblocks(mp);
> +
>  	/*
>  	 * If we expanded the last AG, free the per-AG reservation
>  	 * so we can reinitialize it with the new size.
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 2eaf818..e8bae5e 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1396,3 +1396,27 @@ xfs_dev_is_read_only(
>  	}
>  	return 0;
>  }
> +
> +/*
> + * precalculate minimum of data blocks required, if we fall
> + * below this value, we will fallback to the real-time device.
> + *
> + * m_rt_fallback_pct can only be non-zero if a real-time device
> + * is configured.
> + */
> +uint64_t
> +xfs_rt_calc_min_free_dblocks(
> +	struct xfs_mount	*mp)
> +{
> +	uint64_t	min_free_dblocks = 0;
> +
> +	if (!XFS_IS_REALTIME_MOUNT(mp))
> +		return 0;
> +
> +	/* Pre-compute minimum data blocks required before
> +	 * falling back to RT device for allocations
> +	 */
> +	min_free_dblocks = mp->m_sb.sb_dblocks * (100 - mp->m_rt_fallback_pct);
> +	do_div(min_free_dblocks, 100);
> +	return min_free_dblocks;
> +}
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 2adc701..74dcdc3 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -198,6 +198,13 @@ typedef struct xfs_mount {
>  
>  	bool			m_fail_unmount;
>  	uint			m_rt_alloc_min; /* Min RT allocation */
> +	uint32_t		m_rt_fallback_pct; /* Fallback to realtime device if data
> +										* device usage above rt_fallback_pct
> +										*/

uint?  Surely we'll never see a fallback_pct > 100...

> +	uint64_t		m_rt_min_free_dblocks; /* Use realtime device if free data

xfs_rfsblock_t, since that's what we use for blocks used?

> +											* device blocks falls below this;
> +											* computed from m_rt_fallback_pct.
> +											*/

These comments could go above the field, rather than being jammed
together at the right edge of the screen.

>  #ifdef DEBUG
>  	/*
>  	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
> @@ -463,4 +470,5 @@ int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
>  struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
>  		int error_class, int error);
>  
> +uint64_t	xfs_rt_calc_min_free_dblocks(struct xfs_mount *mp);
>  #endif	/* __XFS_MOUNT_H__ */
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 421f860..c197b95 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -1305,6 +1305,35 @@ void xfs_rt_alloc_min(
>  }
>  
>  /*
> + * m_rt_min_free_dblocks is a pre-computed threshold, which controls target
> + * selection based on how many free blocks are available on the data device.
> + *
> + * If the number of free data device blocks falls below
> + * mp->m_rt_min_free_dblocks, the realtime device is selected as the target
> + * device.  If this value is not set, this target policy is in-active.
> + *
> + */
> +void xfs_rt_min_free_dblocks(
> +	struct xfs_mount	*mp,
> +	struct xfs_inode	*ip,
> +	xfs_off_t			len)
> +{
> +	/* Disabled */
> +	if (!mp->m_rt_fallback_pct)
> +		return;
> +
> +	/* If inode target is already realtime device, nothing to do here */
> +	if(!XFS_IS_REALTIME_INODE(ip)) {
> +		uint64_t	free_dblocks;

Spacing after the variable, indentation on the if test, indentation on
the overflow line below...

> +		free_dblocks = percpu_counter_sum(&mp->m_fdblocks) -
> +			mp->m_alloc_set_aside;
> +		if (free_dblocks < mp->m_rt_min_free_dblocks) {
> +			ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
> +		}

More of the thing where we set flags but we don't log the inode core?

> +	}
> +}
> +
> +/*
>  * Select the target device for the inode based on either the size of the
>  * initial allocation, or the amount of space available on the data device.
>  *
> @@ -1333,4 +1362,7 @@ void xfs_inode_select_target(
>  	 * not valid if not set.
>  	 */
>  	xfs_rt_alloc_min(mp, ip, len);
> +
> +	/* Select target based on remaining free blocks on data device */
> +	xfs_rt_min_free_dblocks(mp, ip, len);

Ok, /me retracts his comment from the previous reply; these helpers
decide if we set the flag, so why don't they return boolean?  That's
much easier to figure out than a bunch of helpers that may or may not
tweak the realtime flag:

if (xfs_first_alloc_short_enough(...) && !xfs_data_dev_has_free(...))
	ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
else
	ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
xfs_log_inode_core(tp, ip, ...);

>  }
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index 1e202a1..889b006 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -166,11 +166,49 @@ rt_alloc_min_show(
>  }
>  XFS_SYSFS_ATTR_RW(rt_alloc_min);
>  
> +STATIC ssize_t
> +rt_fallback_pct_store(
> +	struct kobject			*kobject,
> +	const char				*buf,
> +	size_t					count)
> +{
> +	struct xfs_mount		*mp = to_mp(kobject);
> +	int						ret;
> +	int						val;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	if (!XFS_IS_REALTIME_MOUNT(mp))
> +		return -EINVAL;
> +
> +	if (val < 0)
> +		return -EINVAL;
> +
> +	/* Only valid if using a real-time device */
> +	mp->m_rt_fallback_pct = val;

echo 30000000 > rt_fallback_pct is ok?  Why?

--D
> +	mp->m_rt_min_free_dblocks = xfs_rt_calc_min_free_dblocks(mp);
> +	return count;
> +}
> +
> +STATIC ssize_t
> +rt_fallback_pct_show(
> +	struct kobject          *kobject,
> +	char                    *buf)
> +{
> +	struct xfs_mount        *mp = to_mp(kobject);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_fallback_pct);
> +}
> +XFS_SYSFS_ATTR_RW(rt_fallback_pct);
> +
>  static struct attribute *xfs_mp_attrs[] = {
>  #ifdef DEBUG
>  	ATTR_LIST(drop_writes),
>  #endif
>  	ATTR_LIST(rt_alloc_min),
> +	ATTR_LIST(rt_fallback_pct),
>  	NULL,
>  };
>  
> -- 
> 2.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set
  2017-09-25 19:44 ` [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set Richard Wareing
@ 2017-09-25 22:53   ` Eric Sandeen
  2017-09-26  3:32     ` Richard Wareing
  2017-09-25 22:55   ` Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2017-09-25 22:53 UTC (permalink / raw)
  To: Richard Wareing, linux-xfs; +Cc: david, darrick.wong, hch

On 9/25/17 2:44 PM, Richard Wareing wrote:
> - Reports realtime device free blocks in statfs calls if inheritance
>   bit is set on the inode of directory.  This is a bit more intuitive,
>   especially for use-cases which are using a much larger device for
>   the realtime device.
> - Add XFS_IS_REALTIME_MOUNT option to gate based on the existence of a
>   realtime device on the mount, similar to the XFS_IS_REALTIME_INODE
>   option.

Sorry for chiming in late.

So, this is a major change in behavior of statfs on xfs.  How will the
user know this, where is it documented?
 
Also, if I read this right, does that mean that you get the magical
new statfs results if you stat a /directory/ with XFS_DIFLAG_RTINHERIT
set, but not if you stat a /file/ with XFS_DIFLAG_REALTIME set?

That seems quite confusing, but maybe I'm missing something obvious.

As for which filesystem is getting reported, would it be totally 
bonkers to have an xfs-realtime f_type to make it 100% clear what's
being reported?

-Eric

> Signed-off-by: Richard Wareing <rwareing@fb.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> ---
> Changes since v4:
> * None
> 
> Changes since v3:
> * Fixed accounting bug, we are not required to substract m_alloc_set_aside
>   as this is a data device only requirement.
> * Added XFS_IS_REALTIME_MOUNT macro based on learnings from CVE-2017-14340,
>   now provides similar gating on the mount as XFS_IS_REALTIME_INODE does
>   for the inode.
> 
> Changes since v2:
> * Style updated per Christoph Hellwig's comment
> * Fixed bug: statp->f_bavail = statp->f_bfree
> 
>  fs/xfs/xfs_linux.h | 2 ++
>  fs/xfs/xfs_super.c | 8 ++++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 044fb0e..fe46e71 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -280,8 +280,10 @@ static inline __uint64_t howmany_64(__uint64_t x, __uint32_t y)
>  
>  #ifdef CONFIG_XFS_RT
>  #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME)
> +#define XFS_IS_REALTIME_MOUNT(mp) ((mp)->m_rtdev_targp ? 1 : 0)
>  #else
>  #define XFS_IS_REALTIME_INODE(ip) (0)
> +#define XFS_IS_REALTIME_MOUNT(mp) (0)
>  #endif
>  
>  #endif /* __XFS_LINUX__ */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 455a575..6d33a5e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1136,6 +1136,14 @@ xfs_fs_statfs(
>  	    ((mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))) ==
>  			      (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))
>  		xfs_qm_statvfs(ip, statp);
> +
> +	if (XFS_IS_REALTIME_MOUNT(mp) &&
> +	    (ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)) {
> +		statp->f_blocks = sbp->sb_rblocks;
> +		statp->f_bavail = statp->f_bfree =
> +			sbp->sb_frextents * sbp->sb_rextsize;
> +	}
> +
>  	return 0;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set
  2017-09-25 19:44 ` [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set Richard Wareing
  2017-09-25 22:53   ` Eric Sandeen
@ 2017-09-25 22:55   ` Darrick J. Wong
  1 sibling, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2017-09-25 22:55 UTC (permalink / raw)
  To: Richard Wareing; +Cc: linux-xfs, david, hch

On Mon, Sep 25, 2017 at 12:44:16PM -0700, Richard Wareing wrote:
> - Reports realtime device free blocks in statfs calls if inheritance
>   bit is set on the inode of directory.  This is a bit more intuitive,
>   especially for use-cases which are using a much larger device for
>   the realtime device.
> - Add XFS_IS_REALTIME_MOUNT option to gate based on the existence of a
>   realtime device on the mount, similar to the XFS_IS_REALTIME_INODE
>   option.
> 
> Signed-off-by: Richard Wareing <rwareing@fb.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> ---
> Changes since v4:
> * None
> 
> Changes since v3:
> * Fixed accounting bug, we are not required to substract m_alloc_set_aside
>   as this is a data device only requirement.
> * Added XFS_IS_REALTIME_MOUNT macro based on learnings from CVE-2017-14340,
>   now provides similar gating on the mount as XFS_IS_REALTIME_INODE does
>   for the inode.
> 
> Changes since v2:
> * Style updated per Christoph Hellwig's comment
> * Fixed bug: statp->f_bavail = statp->f_bfree
> 
>  fs/xfs/xfs_linux.h | 2 ++
>  fs/xfs/xfs_super.c | 8 ++++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 044fb0e..fe46e71 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -280,8 +280,10 @@ static inline __uint64_t howmany_64(__uint64_t x, __uint32_t y)
>  
>  #ifdef CONFIG_XFS_RT
>  #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME)
> +#define XFS_IS_REALTIME_MOUNT(mp) ((mp)->m_rtdev_targp ? 1 : 0)

(I kind of dislike these macros but I tried turning them into static inline
functions once and it turned into a nightmare.)

>  #else
>  #define XFS_IS_REALTIME_INODE(ip) (0)
> +#define XFS_IS_REALTIME_MOUNT(mp) (0)
>  #endif
>  
>  #endif /* __XFS_LINUX__ */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 455a575..6d33a5e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1136,6 +1136,14 @@ xfs_fs_statfs(
>  	    ((mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))) ==
>  			      (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))
>  		xfs_qm_statvfs(ip, statp);
> +
> +	if (XFS_IS_REALTIME_MOUNT(mp) &&
> +	    (ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)) {

Why don't we do this for any inode with REALTIME or RTINHERIT set?

--D

> +		statp->f_blocks = sbp->sb_rblocks;
> +		statp->f_bavail = statp->f_bfree =
> +			sbp->sb_frextents * sbp->sb_rextsize;
> +	}
> +
>  	return 0;
>  }
>  
> -- 
> 2.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size
  2017-09-25 19:44 ` [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size Richard Wareing
  2017-09-25 22:47   ` Darrick J. Wong
@ 2017-09-26  0:13   ` Eric Sandeen
  2017-09-26  5:17     ` Richard Wareing
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Sandeen @ 2017-09-26  0:13 UTC (permalink / raw)
  To: Richard Wareing, linux-xfs; +Cc: david, darrick.wong, hch

On 9/25/17 2:44 PM, Richard Wareing wrote:
> - The rt_alloc_min sysfs option automatically selects the device (data
>   device, or realtime) based on the size of the initial allocation of the
>   file.
> - This option can be used to route the storage of small files (and the
>   inefficient workloads associated with them) to a suitable storage
>   device such a SSD, while larger allocations are sent to a traditional
>   HDD.
> - Supports writes via O_DIRECT, buffered (i.e. page cache), and
>   pre-allocations (i.e. fallocate)
> - Available only when kernel is compiled w/ CONFIG_XFS_RT option.

This needs documentation in Documentation/filesystems/xfs.txt 
along with all the other sysctls.  We can't just add magical tunables
with no explanation of behavior.  ;)

> Signed-off-by: Richard Wareing <rwareing@fb.com>
> ---
> Changes since v4:
> * Added xfs_inode_select_target function to hold target selection
>   code
> * XFS_IS_REALTIME_MOUNT check now moved inside xfs_inode_select_target
>   function for better gating
> * Improved consistency in the sysfs set behavior
> * Style fixes
> 
> Changes since v3:
> * Now functions via initial allocation regardless of O_DIRECT, buffered or
>   pre-allocation code paths.  Provides a consistent user-experience.
> * I Did do some experiments putting this in the xfs_bmapi_write code path
>   however pre-allocation accounting unfortunately prevents this cleaner
>   approach.  As such, this proved to be the cleanest and functional approach.
> * No longer a mount option, now a sysfs tunable
> 
>  fs/xfs/xfs_bmap_util.c |  2 ++
>  fs/xfs/xfs_inode.c     | 18 ++++++++++++------
>  fs/xfs/xfs_iomap.c     |  5 +++++
>  fs/xfs/xfs_mount.h     |  1 +
>  fs/xfs/xfs_rtalloc.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_rtalloc.h   |  2 ++
>  fs/xfs/xfs_sysfs.c     | 38 ++++++++++++++++++++++++++++++++++++++
>  7 files changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 9e3cc21..8205669d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1026,6 +1026,8 @@ xfs_alloc_file_space(
>  	if (len <= 0)
>  		return -EINVAL;
>  
> +	xfs_inode_select_target(ip, len);
> +
>  	rt = XFS_IS_REALTIME_INODE(ip);
>  	extsz = xfs_get_extsz_hint(ip);
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ec9826c..f9e2deb 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1620,12 +1620,18 @@ xfs_itruncate_extents(
>  	if (error)
>  		goto out;
>  
> -	/*
> -	 * Clear the reflink flag if we truncated everything.
> -	 */
> -	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
> -		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> -		xfs_inode_clear_cowblocks_tag(ip);
> +	if (ip->i_d.di_nblocks == 0) {
> +		/*
> +		 * Clear the reflink flag if we truncated everything.
> +		 */
> +		if (xfs_is_reflink_inode(ip)) {
> +			ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
> +			xfs_inode_clear_cowblocks_tag(ip);
> +		}
> +		/* Clear realtime flag if m_rt_alloc_min policy is in place */
> +		if (XFS_IS_REALTIME_MOUNT(mp) && mp->m_rt_alloc_min) {
> +			ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
> +		}

This behavior needs documentation as well.  So now truncation will
clear the realtime flag but only if some /other/ tunable is set?

it used to be that if you set the realtime flag it would stay until
specifically cleared.  now there's magic to clear it if your policy
is in place.  I wonder what the reasoning behind this is?  Apologies
if I missed it in other discussions ...

>  	}
>  
>  	/*
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 94e5bdf..b3c3b9b 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -40,6 +40,7 @@
>  #include "xfs_dquot_item.h"
>  #include "xfs_dquot.h"
>  #include "xfs_reflink.h"
> +#include "xfs_rtalloc.h"
>  
>  
>  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
> @@ -174,6 +175,8 @@ xfs_iomap_write_direct(
>  	int		bmapi_flags = XFS_BMAPI_PREALLOC;
>  	uint		tflags = 0;
>  
> +	xfs_inode_select_target(ip, count);

I think you're changing on-disk inodes here with no locking
and no transaction ... ?

> +
>  	rt = XFS_IS_REALTIME_INODE(ip);
>  	extsz = xfs_get_extsz_hint(ip);
>  	lockmode = XFS_ILOCK_SHARED;	/* locked by caller */
> @@ -981,6 +984,8 @@ xfs_file_iomap_begin(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> +	xfs_inode_select_target(ip, length);
> +
>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
>  			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>  		/* Reserve delalloc blocks for regular writeback. */
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 9fa312a..2adc701 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -197,6 +197,7 @@ typedef struct xfs_mount {
>  	__uint32_t		m_generation;
>  
>  	bool			m_fail_unmount;
> +	uint			m_rt_alloc_min; /* Min RT allocation */

>  #ifdef DEBUG
>  	/*
>  	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index c57aa7f..421f860 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -1284,3 +1284,53 @@ xfs_rtpick_extent(
>  	*pick = b;
>  	return 0;
>  }
> +
> +/*
> + * If allocation length is less than rt_alloc_min threshold select the
> + * data device.   Otherwise, select the realtime device.
> + */
> +void xfs_rt_alloc_min(
> +	struct xfs_mount	*mp,
> +	struct xfs_inode	*ip,
> +	xfs_off_t			len)
> +{
> +	if (!mp->m_rt_alloc_min)
> +		return;
> +
> +	if (len < mp->m_rt_alloc_min) {
> +		ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
> +	} else {
> +		ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
> +	}
> +}

This has only one caller, over in another file.  Why not put it inline in
xfs_inode_select_target?

> +
> +/*
> +* Select the target device for the inode based on either the size of the
> +* initial allocation, or the amount of space available on the data device.
> +*
> +*/
> +void xfs_inode_select_target(
> +	struct xfs_inode	*ip,
> +	xfs_off_t			len)
> +{
> +	struct xfs_mount    *mp = ip->i_mount;

Pls line up these variables on a consistent minimum tab stop, unless it's some
email artifact that has thrown me off :)

> +
> +	/* If the mount does not have a realtime device configured, there's
> +	 * nothing to do here.
> +	 */
> +	if (!XFS_IS_REALTIME_MOUNT(mp))
> +		return;
> +
> +	/* You cannot select a new device target once blocks have been allocated
> +	 * (e.g. fallocate() beyond EOF), or if data has been written already.
> +	 */
> +	if (ip->i_d.di_nextents)
> +		return;
> +	if (ip->i_d.di_size)
> +		return;
> +
> +	/* m_rt_alloc_min controls target selection.  Target selection code is
> +	 * not valid if not set.
> +	 */
> +	xfs_rt_alloc_min(mp, ip, len);
> +}
> diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
> index f13133e..eaf7ed3 100644
> --- a/fs/xfs/xfs_rtalloc.h
> +++ b/fs/xfs/xfs_rtalloc.h
> @@ -136,6 +136,7 @@ int xfs_rtalloc_query_range(struct xfs_trans *tp,
>  int xfs_rtalloc_query_all(struct xfs_trans *tp,
>  			  xfs_rtalloc_query_range_fn fn,
>  			  void *priv);
> +void xfs_inode_select_target(struct xfs_inode *ip, xfs_off_t len);
>  #else
>  # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
>  # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
> @@ -155,6 +156,7 @@ xfs_rtmount_init(
>  }
>  # define xfs_rtmount_inodes(m)  (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
>  # define xfs_rtunmount_inodes(m)
> +# define xfs_inode_select_target(i,l)
>  #endif	/* CONFIG_XFS_RT */
>  
>  #endif	/* __XFS_RTALLOC_H__ */
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index 80ac15f..1e202a1 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -129,10 +129,48 @@ XFS_SYSFS_ATTR_RW(drop_writes);
>  
>  #endif /* DEBUG */
>  
> +STATIC ssize_t
> +rt_alloc_min_store(> +	struct kobject			*kobject,
> +	const char				*buf,
> +	size_t					count)
> +{
> +	struct xfs_mount		*mp = to_mp(kobject);
> +	int						ret;
> +	int						val;

ditto on the lining up sanely:

+rt_alloc_min_store(
+	struct kobject		*kobject,
+	const char		*buf,
+	size_t			count)
+{
+	struct xfs_mount	*mp = to_mp(kobject);
+	int			ret;
+	int			val;

> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	/* Only valid if using a real-time device */
> +	if(!XFS_IS_REALTIME_MOUNT(mp))
> +		return -EINVAL;
> +
> +	if (val >= 0)
> +		mp->m_rt_alloc_min = val;
> +	else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +STATIC ssize_t
> +rt_alloc_min_show(
> +	struct kobject          *kobject,
> +	char                    *buf)
> +{
> +	struct xfs_mount        *mp = to_mp(kobject);
> +
> +	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_alloc_min);
> +}
> +XFS_SYSFS_ATTR_RW(rt_alloc_min);

Is this sysctl presented even if CONFIG_XFS_RT is turned off?
Seems like that shouldn't show up if it doesn't do anything...

> +
>  static struct attribute *xfs_mp_attrs[] = {
>  #ifdef DEBUG
>  	ATTR_LIST(drop_writes),
>  #endif
> +	ATTR_LIST(rt_alloc_min),
>  	NULL,
>  };
>  
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set
  2017-09-25 22:53   ` Eric Sandeen
@ 2017-09-26  3:32     ` Richard Wareing
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Wareing @ 2017-09-26  3:32 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs, david, darrick.wong, hch

Eric Sandeen <sandeen@sandeen.net> wrote:

> On 9/25/17 2:44 PM, Richard Wareing wrote:
>> - Reports realtime device free blocks in statfs calls if inheritance
>>   bit is set on the inode of directory.  This is a bit more intuitive,
>>   especially for use-cases which are using a much larger device for
>>   the realtime device.
>> - Add XFS_IS_REALTIME_MOUNT option to gate based on the existence of a
>>   realtime device on the mount, similar to the XFS_IS_REALTIME_INODE
>>   option.
>
> Sorry for chiming in late.
>
> So, this is a major change in behavior of statfs on xfs.  How will the
> user know this, where is it documented?
>
> Also, if I read this right, does that mean that you get the magical
> new statfs results if you stat a /directory/ with XFS_DIFLAG_RTINHERIT
> set, but not if you stat a /file/ with XFS_DIFLAG_REALTIME set?
>
> That seems quite confusing, but maybe I'm missing something obvious.
>
> As for which filesystem is getting reported, would it be totally
> bonkers to have an xfs-realtime f_type to make it 100% clear what's
> being reported?
>
> -Eric
>

Wrt docs, I can write something up.  Any suggestions where the docs should
live?  Man page?

I'm also open to making this a sysfs option, it did cross my mind that
this change in behavior might be jarring (though in a good way ;) ).  For
those folks who wish to use realtime devices to place metadata on an SSD
this behavior is far more intuitive IMHO (as it more closely resembles the
presentation on a traditional FS setup).

You are also correct on the directory behavior.  For files you just
get their actual size on whatever device they happen to be stored on.
There is no change to that behavior.

Wrt new f_type, I'd be expecting a bit of push back on this as it may be
considered an abuse of this interface.

Richard


>> Signed-off-by: Richard Wareing <rwareing@fb.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Reviewed-by: Dave Chinner <dchinner@redhat.com>
>> ---
>> Changes since v4:
>> * None
>>
>> Changes since v3:
>> * Fixed accounting bug, we are not required to substract m_alloc_set_aside
>>   as this is a data device only requirement.
>> * Added XFS_IS_REALTIME_MOUNT macro based on learnings from  
>> CVE-2017-14340,
>>   now provides similar gating on the mount as XFS_IS_REALTIME_INODE does
>>   for the inode.
>>
>> Changes since v2:
>> * Style updated per Christoph Hellwig's comment
>> * Fixed bug: statp->f_bavail = statp->f_bfree
>>
>>  fs/xfs/xfs_linux.h | 2 ++
>>  fs/xfs/xfs_super.c | 8 ++++++++
>>  2 files changed, 10 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
>> index 044fb0e..fe46e71 100644
>> --- a/fs/xfs/xfs_linux.h
>> +++ b/fs/xfs/xfs_linux.h
>> @@ -280,8 +280,10 @@ static inline __uint64_t howmany_64(__uint64_t x,  
>> __uint32_t y)
>>
>>  #ifdef CONFIG_XFS_RT
>>  #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME)
>> +#define XFS_IS_REALTIME_MOUNT(mp) ((mp)->m_rtdev_targp ? 1 : 0)
>>  #else
>>  #define XFS_IS_REALTIME_INODE(ip) (0)
>> +#define XFS_IS_REALTIME_MOUNT(mp) (0)
>>  #endif
>>
>>  #endif /* __XFS_LINUX__ */
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 455a575..6d33a5e 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1136,6 +1136,14 @@ xfs_fs_statfs(
>>  	    ((mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))) ==
>>  			      (XFS_PQUOTA_ACCT|XFS_PQUOTA_ENFD))
>>  		xfs_qm_statvfs(ip, statp);
>> +
>> +	if (XFS_IS_REALTIME_MOUNT(mp) &&
>> +	    (ip->i_d.di_flags & XFS_DIFLAG_RTINHERIT)) {
>> +		statp->f_blocks = sbp->sb_rblocks;
>> +		statp->f_bavail = statp->f_bfree =
>> +			sbp->sb_frextents * sbp->sb_rextsize;
>> +	}
>> +
>>  	return 0;
>>  }



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size
  2017-09-26  0:13   ` Eric Sandeen
@ 2017-09-26  5:17     ` Richard Wareing
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Wareing @ 2017-09-26  5:17 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs, david, darrick.wong, hch

Eric Sandeen <sandeen@sandeen.net> wrote:

> On 9/25/17 2:44 PM, Richard Wareing wrote:
>> - The rt_alloc_min sysfs option automatically selects the device (data
>>   device, or realtime) based on the size of the initial allocation of the
>>   file.
>> - This option can be used to route the storage of small files (and the
>>   inefficient workloads associated with them) to a suitable storage
>>   device such a SSD, while larger allocations are sent to a traditional
>>   HDD.
>> - Supports writes via O_DIRECT, buffered (i.e. page cache), and
>>   pre-allocations (i.e. fallocate)
>> - Available only when kernel is compiled w/ CONFIG_XFS_RT option.
>
> This needs documentation in Documentation/filesystems/xfs.txt
> along with all the other sysctls.  We can't just add magical tunables
> with no explanation of behavior.  ;)
>

I'll write it up tomorrow, not a problem.

>> Signed-off-by: Richard Wareing <rwareing@fb.com>
>> ---
>> Changes since v4:
>> * Added xfs_inode_select_target function to hold target selection
>>   code
>> * XFS_IS_REALTIME_MOUNT check now moved inside xfs_inode_select_target
>>   function for better gating
>> * Improved consistency in the sysfs set behavior
>> * Style fixes
>>
>> Changes since v3:
>> * Now functions via initial allocation regardless of O_DIRECT, buffered or
>>   pre-allocation code paths.  Provides a consistent user-experience.
>> * I Did do some experiments putting this in the xfs_bmapi_write code path
>>   however pre-allocation accounting unfortunately prevents this cleaner
>>   approach.  As such, this proved to be the cleanest and functional approach.
>> * No longer a mount option, now a sysfs tunable
>>
>>  fs/xfs/xfs_bmap_util.c |  2 ++
>>  fs/xfs/xfs_inode.c     | 18 ++++++++++++------
>>  fs/xfs/xfs_iomap.c     |  5 +++++
>>  fs/xfs/xfs_mount.h     |  1 +
>>  fs/xfs/xfs_rtalloc.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/xfs/xfs_rtalloc.h   |  2 ++
>>  fs/xfs/xfs_sysfs.c     | 38 ++++++++++++++++++++++++++++++++++++++
>>  7 files changed, 110 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 9e3cc21..8205669d 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1026,6 +1026,8 @@ xfs_alloc_file_space(
>>  	if (len <= 0)
>>  		return -EINVAL;
>>
>> +	xfs_inode_select_target(ip, len);
>> +
>>  	rt = XFS_IS_REALTIME_INODE(ip);
>>  	extsz = xfs_get_extsz_hint(ip);
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index ec9826c..f9e2deb 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1620,12 +1620,18 @@ xfs_itruncate_extents(
>>  	if (error)
>>  		goto out;
>>
>> -	/*
>> -	 * Clear the reflink flag if we truncated everything.
>> -	 */
>> -	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
>> -		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
>> -		xfs_inode_clear_cowblocks_tag(ip);
>> +	if (ip->i_d.di_nblocks == 0) {
>> +		/*
>> +		 * Clear the reflink flag if we truncated everything.
>> +		 */
>> +		if (xfs_is_reflink_inode(ip)) {
>> +			ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
>> +			xfs_inode_clear_cowblocks_tag(ip);
>> +		}
>> +		/* Clear realtime flag if m_rt_alloc_min policy is in place */
>> +		if (XFS_IS_REALTIME_MOUNT(mp) && mp->m_rt_alloc_min) {
>> +			ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
>> +		}
>
> This behavior needs documentation as well.  So now truncation will
> clear the realtime flag but only if some /other/ tunable is set?
>
> it used to be that if you set the realtime flag it would stay until
> specifically cleared.  now there's magic to clear it if your policy
> is in place.  I wonder what the reasoning behind this is?  Apologies
> if I missed it in other discussions ...
>


The reasoning here is that if you are using policy driven realtime
allocation, then you really want these policies deciding where data
should live, and for the policies to work consistently we need to
clear the flag upon truncation.  For those who want to manually decide
where data lives, the behavior will be what they know and love if they
do not enable these features.


>> }
>>
>>  	/*
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 94e5bdf..b3c3b9b 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -40,6 +40,7 @@
>>  #include "xfs_dquot_item.h"
>>  #include "xfs_dquot.h"
>>  #include "xfs_reflink.h"
>> +#include "xfs_rtalloc.h"
>>
>>
>>  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
>> @@ -174,6 +175,8 @@ xfs_iomap_write_direct(
>>  	int		bmapi_flags = XFS_BMAPI_PREALLOC;
>>  	uint		tflags = 0;
>>
>> +	xfs_inode_select_target(ip, count);
>
> I think you're changing on-disk inodes here with no locking
> and no transaction ... ?
>

You are correct, after chatting with Dave tonight I should be able
to fix this up.  You'll have to bear with me a bit as I'm still learning
this code base :).

The plan is to create/set the XFS_BMAPI_ and then update the inode flag
within a transaction later on (in xfs_bmapi* functions).  The
xfs_bmapi_reserve_delalloc function has to be special cased as there's no
transaction context in there.

I'll update/re-work things tomorrow.


>> +
>>  	rt = XFS_IS_REALTIME_INODE(ip);
>>  	extsz = xfs_get_extsz_hint(ip);
>>  	lockmode = XFS_ILOCK_SHARED;	/* locked by caller */
>> @@ -981,6 +984,8 @@ xfs_file_iomap_begin(
>>  	if (XFS_FORCED_SHUTDOWN(mp))
>>  		return -EIO;
>>
>> +	xfs_inode_select_target(ip, length);
>> +
>>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
>>  			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>  		/* Reserve delalloc blocks for regular writeback. */
>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>> index 9fa312a..2adc701 100644
>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -197,6 +197,7 @@ typedef struct xfs_mount {
>>  	__uint32_t		m_generation;
>>
>>  	bool			m_fail_unmount;
>> +	uint			m_rt_alloc_min; /* Min RT allocation */
>
>> #ifdef DEBUG
>>  	/*
>>  	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
>> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
>> index c57aa7f..421f860 100644
>> --- a/fs/xfs/xfs_rtalloc.c
>> +++ b/fs/xfs/xfs_rtalloc.c
>> @@ -1284,3 +1284,53 @@ xfs_rtpick_extent(
>>  	*pick = b;
>>  	return 0;
>>  }
>> +
>> +/*
>> + * If allocation length is less than rt_alloc_min threshold select the
>> + * data device.   Otherwise, select the realtime device.
>> + */
>> +void xfs_rt_alloc_min(
>> +	struct xfs_mount	*mp,
>> +	struct xfs_inode	*ip,
>> +	xfs_off_t			len)
>> +{
>> +	if (!mp->m_rt_alloc_min)
>> +		return;
>> +
>> +	if (len < mp->m_rt_alloc_min) {
>> +		ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
>> +	} else {
>> +		ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
>> +	}
>> +}
>
> This has only one caller, over in another file.  Why not put it inline in
> xfs_inode_select_target?
>

Well rational here was to try to keep as much of the realtime code living
close together.  If this is already a lost cause I can stick it in the same
file.

>> +
>> +/*
>> +* Select the target device for the inode based on either the size of the
>> +* initial allocation, or the amount of space available on the data  
>> device.
>> +*
>> +*/
>> +void xfs_inode_select_target(
>> +	struct xfs_inode	*ip,
>> +	xfs_off_t			len)
>> +{
>> +	struct xfs_mount    *mp = ip->i_mount;
>
> Pls line up these variables on a consistent minimum tab stop, unless it's  
> some
> email artifact that has thrown me off :)

Hmmm, looks good on my vim editor, though I just grabbed the vim plugin from
https://github.com/vivien/vim-linux-coding-style and now I see it.  Fixing...

>
>> +
>> +	/* If the mount does not have a realtime device configured, there's
>> +	 * nothing to do here.
>> +	 */
>> +	if (!XFS_IS_REALTIME_MOUNT(mp))
>> +		return;
>> +
>> +	/* You cannot select a new device target once blocks have been allocated
>> +	 * (e.g. fallocate() beyond EOF), or if data has been written already.
>> +	 */
>> +	if (ip->i_d.di_nextents)
>> +		return;
>> +	if (ip->i_d.di_size)
>> +		return;
>> +
>> +	/* m_rt_alloc_min controls target selection.  Target selection code is
>> +	 * not valid if not set.
>> +	 */
>> +	xfs_rt_alloc_min(mp, ip, len);
>> +}
>> diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
>> index f13133e..eaf7ed3 100644
>> --- a/fs/xfs/xfs_rtalloc.h
>> +++ b/fs/xfs/xfs_rtalloc.h
>> @@ -136,6 +136,7 @@ int xfs_rtalloc_query_range(struct xfs_trans *tp,
>>  int xfs_rtalloc_query_all(struct xfs_trans *tp,
>>  			  xfs_rtalloc_query_range_fn fn,
>>  			  void *priv);
>> +void xfs_inode_select_target(struct xfs_inode *ip, xfs_off_t len);
>>  #else
>>  # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
>>  # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
>> @@ -155,6 +156,7 @@ xfs_rtmount_init(
>>  }
>>  # define xfs_rtmount_inodes(m)  (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
>>  # define xfs_rtunmount_inodes(m)
>> +# define xfs_inode_select_target(i,l)
>>  #endif	/* CONFIG_XFS_RT */
>>
>>  #endif	/* __XFS_RTALLOC_H__ */
>> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
>> index 80ac15f..1e202a1 100644
>> --- a/fs/xfs/xfs_sysfs.c
>> +++ b/fs/xfs/xfs_sysfs.c
>> @@ -129,10 +129,48 @@ XFS_SYSFS_ATTR_RW(drop_writes);
>>
>>  #endif /* DEBUG */
>>
>> +STATIC ssize_t
>> +rt_alloc_min_store(> +	struct kobject			*kobject,
>> +	const char				*buf,
>> +	size_t					count)
>> +{
>> +	struct xfs_mount		*mp = to_mp(kobject);
>> +	int						ret;
>> +	int						val;
>
> ditto on the lining up sanely:
>
> +rt_alloc_min_store(
> +	struct kobject		*kobject,
> +	const char		*buf,
> +	size_t			count)
> +{
> +	struct xfs_mount	*mp = to_mp(kobject);
> +	int			ret;
> +	int			val;
>

Fixed, hoping with the Linux kernel style vim plugin I won't have these
problems any longer.


>> +
>> +	ret = kstrtoint(buf, 0, &val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* Only valid if using a real-time device */
>> +	if(!XFS_IS_REALTIME_MOUNT(mp))
>> +		return -EINVAL;
>> +
>> +	if (val >= 0)
>> +		mp->m_rt_alloc_min = val;
>> +	else
>> +		return -EINVAL;
>> +
>> +	return count;
>> +}
>> +
>> +STATIC ssize_t
>> +rt_alloc_min_show(
>> +	struct kobject          *kobject,
>> +	char                    *buf)
>> +{
>> +	struct xfs_mount        *mp = to_mp(kobject);
>> +
>> +	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_alloc_min);
>> +}
>> +XFS_SYSFS_ATTR_RW(rt_alloc_min);
>
> Is this sysctl presented even if CONFIG_XFS_RT is turned off?
> Seems like that shouldn't show up if it doesn't do anything...
>

Agreed, I originally had it as such, but Dave suggested given it's
user facing it should remain visible.  I just pinged him on IRC and
he's fine changing it back.  I don't have strong feelings either way,
though I lean towards your thinking (hence my original choice).

>> +
>>  static struct attribute *xfs_mp_attrs[] = {
>>  #ifdef DEBUG
>>  	ATTR_LIST(drop_writes),
>>  #endif
>> +	ATTR_LIST(rt_alloc_min),
>>  	NULL,
>>  };


Richard...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size
  2017-09-25 22:47   ` Darrick J. Wong
@ 2017-09-26  5:25     ` Dave Chinner
  2017-09-26  6:11     ` Richard Wareing
  1 sibling, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2017-09-26  5:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Richard Wareing, linux-xfs, hch

On Mon, Sep 25, 2017 at 03:47:38PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 25, 2017 at 12:44:17PM -0700, Richard Wareing wrote:
> > - The rt_alloc_min sysfs option automatically selects the device (data
> >   device, or realtime) based on the size of the initial allocation of the
> >   file.
> > - This option can be used to route the storage of small files (and the
> >   inefficient workloads associated with them) to a suitable storage
> >   device such a SSD, while larger allocations are sent to a traditional
> >   HDD.
> > - Supports writes via O_DIRECT, buffered (i.e. page cache), and
> >   pre-allocations (i.e. fallocate)
> > - Available only when kernel is compiled w/ CONFIG_XFS_RT option.
> > 
> > Signed-off-by: Richard Wareing <rwareing@fb.com>
> > ---
> > Changes since v4:
> > * Added xfs_inode_select_target function to hold target selection
> >   code
> > * XFS_IS_REALTIME_MOUNT check now moved inside xfs_inode_select_target
> >   function for better gating
> > * Improved consistency in the sysfs set behavior
> > * Style fixes
> > 
> > Changes since v3:
> > * Now functions via initial allocation regardless of O_DIRECT, buffered or
> >   pre-allocation code paths.  Provides a consistent user-experience.
> > * I Did do some experiments putting this in the xfs_bmapi_write code path
> >   however pre-allocation accounting unfortunately prevents this cleaner
> >   approach.  As such, this proved to be the cleanest and functional approach.
> > * No longer a mount option, now a sysfs tunable
> 
> I'm still struggling with the last two patches in this series.
> Currently the decision to set or clear a file's RT flag rests with
> whoever has write access to the file; and whatever they set stays that
> way throughout the life of the file.
> 
> These patches add new behaviors to the filesystem, namely that initial
> allocations and certain truncate operations can set or clear the
> realtime flag, and that this happens without any particular notification
> for whatever might've set that flag.  I imagine that programs that set
> the realtime flag might be a little surprised when it doesn't stay set,
> since we never used to do that.

This only happens if the filesystem has been configured to use the
"auto rtdev selection" function. It will *not* happen to existing RT
device users using default behaviour.

i.e. if you turn on auto-placement, you get the new behaviour. If
you don't turn it on, nothing at all should change.

> At least the new behaviors are hidden behind a sysfs knob, but a big
> thing missing from this series is a document for admins explaining that
> we've added these various knobs to facilitate automatic tiering of files

Please stop calling this "teiring". It's nothing like the industry
definition of storage teiring - bcache and dm-cache get closer, but
without something like a HSM we do not have teiring in XFS because
there's no such thing as completely transparent automatic data
movement....

In reality, this is simply a filesystem allocation policy that is
very similar to inode32. Remember that inode32 selects the target AG
of the inode when XFS_ALLOC_INITIAL_USER_DATA is set. i.e. on the
first allocation to the file. That's exactly what we are doing here
- deciding on the physical location of the data on the first
write/allocation to the file.

If you have a concatenated set of storage volumes underneath the
filesystem, the using inode32 is actually selecting the physical
device the data will reside on when we use inode32. It's not trying
to keep data local to the directory, it's trying to spread data over
the entire block device address space. Align AG sizes to the
underlying physical devices, and you have a direct relationship
between the selected AG and the physical device the data is written
to. If you have multiple AGs to a physical device (e.g. multi-TB
lun) then you end up with stuff like ag_skip to provide better
per-device selection:

http://oss.sgi.com/archives/xfs/2013-01/msg00611.html

That wasn't data teiring, it was an allocation policy that was
tailored to the underlying storage layout. Auto-selecting the rtdev
is the same sort of allocation policy - it's not tiering data, it's
just a policy that selects the best physical location for the
data being written....

> on an XFS filesystem, what the knobs do, and which operations can change
> a file's tier.  I also worry that we have no means to distinguish a file
> that has had the rt flag set by the user from a file with automatic rt
> flag management.

Setting the flag should override the auto placement and directly
selects the rt device. Clearing the flag (which shouldn't be set on
a zero length file) means auto placement.

> I think I may have mislead you a bit on the v1 patches when I objected
> to the new fallocate behaviors -- my biggest concern back then (and now)
> is the behavioral changes and how to help our existing users fit that
> into their mental models of how XFS works.  Is the size of the first
> non-fallocate write to a file sufficient to decide the rtflag?

> TLDR: Is this how we (XFS community) want to present tiered XFS to users?

TL;DR: It's not teiring, and conceptually no different to how we've
used inode32 to direct physical placement of data for many, many
years.

> (Off in the crazy space that is my head we could just upload BPF
> programs into XFS that would set and manage all the hints that can be
> attached to an inode, along with either an "automanaged inode" flag that
> locks out manual control or some kind of note that a particular setting
> overrides the automated control.)

I think I mentioned that a couple of years ago when I saw people
using eBPF for inserting arbitrary debug code into the kernel... :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size
  2017-09-25 22:47   ` Darrick J. Wong
  2017-09-26  5:25     ` Dave Chinner
@ 2017-09-26  6:11     ` Richard Wareing
  1 sibling, 0 replies; 13+ messages in thread
From: Richard Wareing @ 2017-09-26  6:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, hch

Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Mon, Sep 25, 2017 at 12:44:17PM -0700, Richard Wareing wrote:
>> - The rt_alloc_min sysfs option automatically selects the device (data
>>   device, or realtime) based on the size of the initial allocation of the
>>   file.
>> - This option can be used to route the storage of small files (and the
>>   inefficient workloads associated with them) to a suitable storage
>>   device such a SSD, while larger allocations are sent to a traditional
>>   HDD.
>> - Supports writes via O_DIRECT, buffered (i.e. page cache), and
>>   pre-allocations (i.e. fallocate)
>> - Available only when kernel is compiled w/ CONFIG_XFS_RT option.
>>
>> Signed-off-by: Richard Wareing <rwareing@fb.com>
>> ---
>> Changes since v4:
>> * Added xfs_inode_select_target function to hold target selection
>>   code
>> * XFS_IS_REALTIME_MOUNT check now moved inside xfs_inode_select_target
>>   function for better gating
>> * Improved consistency in the sysfs set behavior
>> * Style fixes
>>
>> Changes since v3:
>> * Now functions via initial allocation regardless of O_DIRECT, buffered or
>>   pre-allocation code paths.  Provides a consistent user-experience.
>> * I Did do some experiments putting this in the xfs_bmapi_write code path
>>   however pre-allocation accounting unfortunately prevents this cleaner
>>   approach.  As such, this proved to be the cleanest and functional approach.
>> * No longer a mount option, now a sysfs tunable
>
> I'm still struggling with the last two patches in this series.
> Currently the decision to set or clear a file's RT flag rests with
> whoever has write access to the file; and whatever they set stays that
> way throughout the life of the file.
>
> These patches add new behaviors to the filesystem, namely that initial
> allocations and certain truncate operations can set or clear the
> realtime flag, and that this happens without any particular notification
> for whatever might've set that flag.  I imagine that programs that set
> the realtime flag might be a little surprised when it doesn't stay set,
> since we never used to do that.
>

I'll add documentation to explain the knobs and how they should be used.
The whole idea here is that programs should should not have to be written
to set these knobs as you can use a FS level policy mechanism  instead.
If you would rather more granular control, that's fine turn the policies
off and set the flags manually.

Nothing should come as a surprise as the knobs are all disabled by default.

> At least the new behaviors are hidden behind a sysfs knob, but a big
> thing missing from this series is a document for admins explaining that
> we've added these various knobs to facilitate automatic tiering of files
> on an XFS filesystem, what the knobs do, and which operations can change
> a file's tier.  I also worry that we have no means to distinguish a file
> that has had the rt flag set by the user from a file with automatic rt
> flag management.
>

I don't worry as much, I think folks will choose to either use system-wide
policies or set the flags manually.  Are there specific use-cases you have
in mind where this might be a concern?  I only know of our use-case and
perhaps some PVR usage out there.

> I think I may have mislead you a bit on the v1 patches when I objected
> to the new fallocate behaviors -- my biggest concern back then (and now)
> is the behavioral changes and how to help our existing users fit that
> into their mental models of how XFS works.  Is the size of the first
> non-fallocate write to a file sufficient to decide the rtflag?
>

Without knowing the future write pattern of the file, it's I think the
best I can do.  In a perfect world, I'd love to have XFS automagically
"promote" a file to the realtime device should it grow in size.  For
now, my plan is to update xfs_fsr to migrate files which grow beyond the
policy limit.


> TLDR: Is this how we (XFS community) want to present tiered XFS to users?
>

IMHO, I think one can't answer this question until there are some more
use-cases around which can help inform how things should look.  My hope is
we can spark some interest and additional use-cases by making some modest
improvements (making realtime more accessible/intuitive) and then iterate.

I'll also be sure to share our experience/learnings running this at scale.


> (Off in the crazy space that is my head we could just upload BPF
> programs into XFS that would set and manage all the hints that can be
> attached to an inode, along with either an "automanaged inode" flag that
> locks out manual control or some kind of note that a particular setting
> overrides the automated control.)
>
> There are several mechanical problems too, which I'll cover below.
>
>>  fs/xfs/xfs_bmap_util.c |  2 ++
>>  fs/xfs/xfs_inode.c     | 18 ++++++++++++------
>>  fs/xfs/xfs_iomap.c     |  5 +++++
>>  fs/xfs/xfs_mount.h     |  1 +
>>  fs/xfs/xfs_rtalloc.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/xfs/xfs_rtalloc.h   |  2 ++
>>  fs/xfs/xfs_sysfs.c     | 38 ++++++++++++++++++++++++++++++++++++++
>>  7 files changed, 110 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 9e3cc21..8205669d 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1026,6 +1026,8 @@ xfs_alloc_file_space(
>>  	if (len <= 0)
>>  		return -EINVAL;
>>
>> +	xfs_inode_select_target(ip, len);
>> +
>>  	rt = XFS_IS_REALTIME_INODE(ip);
>>  	extsz = xfs_get_extsz_hint(ip);
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index ec9826c..f9e2deb 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1620,12 +1620,18 @@ xfs_itruncate_extents(
>>  	if (error)
>>  		goto out;
>>
>> -	/*
>> -	 * Clear the reflink flag if we truncated everything.
>> -	 */
>> -	if (ip->i_d.di_nblocks == 0 && xfs_is_reflink_inode(ip)) {
>> -		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
>> -		xfs_inode_clear_cowblocks_tag(ip);
>
> FYI this chunk is going to change when some of the patches I have queued
> up for -rc3 go upstream.
>
>> +	if (ip->i_d.di_nblocks == 0) {
>> +		/*
>> +		 * Clear the reflink flag if we truncated everything.
>> +		 */
>> +		if (xfs_is_reflink_inode(ip)) {
>> +			ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
>> +			xfs_inode_clear_cowblocks_tag(ip);
>> +		}
>> +		/* Clear realtime flag if m_rt_alloc_min policy is in place */
>> +		if (XFS_IS_REALTIME_MOUNT(mp) && mp->m_rt_alloc_min) {
>> +			ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
>
> Why clear the rt flag from the file if we truncate the file and no
> blocks are left, but not if (say) we punch out every block in the file?
>
> Say we set rt_alloc_min = 64k and then pwrite 1mb at 10mb so that the rt
> flag is automatically set.  Then we truncate the file to 9mb.  The file
> size is 9MB, it has no blocks, and we auto-clear the rt flag?
>
> Yet below I see that the auto-set function bails out if the /size/ is
> greater than zero.
>
> This seems inconsistent to me, why don't we also require the size to be
> zero to clear the rt flag?  What if I set the rt flag myself and later
> it gets cleared out automatically?  That's a new behavior that quietly
> happens behind my back.
>

A fair point.  I think the original rationale here was that because the
placement is based on the size of the first allocation to the file or
first after truncation.  I don't really consider it happening behind
anyone's back, as the behavior can only occur if an administrator set
the option in the first place :).

I'm fine with requiring the size of the file be 0 as well on this, this
is probably a bit more intuitive.


>> +		}
>>  	}
>>
>>  	/*
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 94e5bdf..b3c3b9b 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -40,6 +40,7 @@
>>  #include "xfs_dquot_item.h"
>>  #include "xfs_dquot.h"
>>  #include "xfs_reflink.h"
>> +#include "xfs_rtalloc.h"
>>
>>
>>  #define XFS_WRITEIO_ALIGN(mp,off)	(((off) >> mp->m_writeio_log) \
>> @@ -174,6 +175,8 @@ xfs_iomap_write_direct(
>>  	int		bmapi_flags = XFS_BMAPI_PREALLOC;
>>  	uint		tflags = 0;
>>
>> +	xfs_inode_select_target(ip, count);
>> +
>>  	rt = XFS_IS_REALTIME_INODE(ip);
>>  	extsz = xfs_get_extsz_hint(ip);
>>  	lockmode = XFS_ILOCK_SHARED;	/* locked by caller */
>> @@ -981,6 +984,8 @@ xfs_file_iomap_begin(
>>  	if (XFS_FORCED_SHUTDOWN(mp))
>>  		return -EIO;
>>
>> +	xfs_inode_select_target(ip, length);
>
> iomap_begin can get called for more than just file writes.
> Don't change the rt flag because someone FIEMAP'd a zero-length file.
>

You'll have to help me out here, what is appropriate signal/flag to check
that the call is for a file write here?

>> +
>>  	if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
>>  			!IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>>  		/* Reserve delalloc blocks for regular writeback. */
>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>> index 9fa312a..2adc701 100644
>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -197,6 +197,7 @@ typedef struct xfs_mount {
>>  	__uint32_t		m_generation;
>>
>>  	bool			m_fail_unmount;
>> +	uint			m_rt_alloc_min; /* Min RT allocation */
>
> Bytes?  Do we need xfs_off_t here?
>

I can do this, but other fields in this structs don't appear to be using
xfs_off_t; though perhaps this is a legacy thing?

>> #ifdef DEBUG
>>  	/*
>>  	 * DEBUG mode instrumentation to test and/or trigger delayed allocation
>> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
>> index c57aa7f..421f860 100644
>> --- a/fs/xfs/xfs_rtalloc.c
>> +++ b/fs/xfs/xfs_rtalloc.c
>> @@ -1284,3 +1284,53 @@ xfs_rtpick_extent(
>>  	*pick = b;
>>  	return 0;
>>  }
>> +
>> +/*
>> + * If allocation length is less than rt_alloc_min threshold select the
>> + * data device.   Otherwise, select the realtime device.
>> + */
>> +void xfs_rt_alloc_min(
>> +	struct xfs_mount	*mp,
>> +	struct xfs_inode	*ip,
>> +	xfs_off_t			len)
>> +{
>
> Indenting...
>

Fixed...hopefully for good now that I have the Linux kernel style vim
plugin.

> void
> xfs_rt_alloc_min(
> 	struct xfs_mount	*mp,
> 	struct xfs_inode	*ip,
> 	xfs_off_t		len)
> {
> 	...
> }
>
>> +	if (!mp->m_rt_alloc_min)
>> +		return;
>> +
>> +	if (len < mp->m_rt_alloc_min) {
>> +		ip->i_d.di_flags &= ~XFS_DIFLAG_REALTIME;
>> +	} else {
>> +		ip->i_d.di_flags |= XFS_DIFLAG_REALTIME;
>> +	}
>
> Wait, xfs_rt_alloc_min updates inode flags??  Don't we need a
> transaction and a xfs_trans_log_inode for this?  Since this is a
> helper for the next function, the same question applies to it.
>

You totally correct, fixing per Eric as well.

>> +}
>> +
>> +/*
>> +* Select the target device for the inode based on either the size of the
>> +* initial allocation, or the amount of space available on the data  
>> device.
>> +*
>> +*/
>> +void xfs_inode_select_target(
>> +	struct xfs_inode	*ip,
>> +	xfs_off_t			len)
>
> Indenting again, and ... this is an inode function, shouldn't it go in
> xfs_inode.c ?
>
> And no, you can't just change the incore inode flags without a
> transaction and without the ilock.
>

Yup, bear with me as I learn :).

>> +{
>> +	struct xfs_mount    *mp = ip->i_mount;
>> +
>> +	/* If the mount does not have a realtime device configured, there's
>> +	 * nothing to do here.
>> +	 */
>> +	if (!XFS_IS_REALTIME_MOUNT(mp))
>> +		return;
>> +
>> +	/* You cannot select a new device target once blocks have been allocated
>> +	 * (e.g. fallocate() beyond EOF), or if data has been written already.
>> +	 */
>> +	if (ip->i_d.di_nextents)
>> +		return;
>> +	if (ip->i_d.di_size)
>> +		return;
>> +
>> +	/* m_rt_alloc_min controls target selection.  Target selection code is
>> +	 * not valid if not set.
>> +	 */
>> +	xfs_rt_alloc_min(mp, ip, len);
>
> If the helper functions are only going to be a few lines long and used
> once, just put them directly in the function here.
>

<retracted> ... ya the reason for this becomes clear in the next commit
as you found out.

>> +}
>> diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
>> index f13133e..eaf7ed3 100644
>> --- a/fs/xfs/xfs_rtalloc.h
>> +++ b/fs/xfs/xfs_rtalloc.h
>> @@ -136,6 +136,7 @@ int xfs_rtalloc_query_range(struct xfs_trans *tp,
>>  int xfs_rtalloc_query_all(struct xfs_trans *tp,
>>  			  xfs_rtalloc_query_range_fn fn,
>>  			  void *priv);
>> +void xfs_inode_select_target(struct xfs_inode *ip, xfs_off_t len);
>>  #else
>>  # define xfs_rtallocate_extent(t,b,min,max,l,f,p,rb)    (ENOSYS)
>>  # define xfs_rtfree_extent(t,b,l)                       (ENOSYS)
>> @@ -155,6 +156,7 @@ xfs_rtmount_init(
>>  }
>>  # define xfs_rtmount_inodes(m)  (((mp)->m_sb.sb_rblocks == 0)? 0 : (ENOSYS))
>>  # define xfs_rtunmount_inodes(m)
>> +# define xfs_inode_select_target(i,l)
>>  #endif	/* CONFIG_XFS_RT */
>>
>>  #endif	/* __XFS_RTALLOC_H__ */
>> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
>> index 80ac15f..1e202a1 100644
>> --- a/fs/xfs/xfs_sysfs.c
>> +++ b/fs/xfs/xfs_sysfs.c
>> @@ -129,10 +129,48 @@ XFS_SYSFS_ATTR_RW(drop_writes);
>>
>>  #endif /* DEBUG */
>>
>> +STATIC ssize_t
>> +rt_alloc_min_store(
>> +	struct kobject			*kobject,
>> +	const char				*buf,
>> +	size_t					count)
>> +{
>> +	struct xfs_mount		*mp = to_mp(kobject);
>> +	int						ret;
>> +	int						val;
>> +
>> +	ret = kstrtoint(buf, 0, &val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/* Only valid if using a real-time device */
>> +	if(!XFS_IS_REALTIME_MOUNT(mp))
>> +		return -EINVAL;
>> +
>> +	if (val >= 0)
>> +		mp->m_rt_alloc_min = val;
>> +	else
>> +		return -EINVAL;
>> +
>> +	return count;
>> +}
>> +
>> +STATIC ssize_t
>> +rt_alloc_min_show(
>> +	struct kobject          *kobject,
>> +	char                    *buf)
>> +{
>> +	struct xfs_mount        *mp = to_mp(kobject);
>> +
>> +	return snprintf(buf, PAGE_SIZE, "%d\n", mp->m_rt_alloc_min);
>> +}
>> +XFS_SYSFS_ATTR_RW(rt_alloc_min);
>> +
>>  static struct attribute *xfs_mp_attrs[] = {
>>  #ifdef DEBUG
>>  	ATTR_LIST(drop_writes),
>>  #endif
>> +	ATTR_LIST(rt_alloc_min),
>>  	NULL,
>>  };
>>
>> -- 
>> 2.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


As always, thanks for the feedback!

Richard





^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-09-26  6:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-25 19:44 [PATCH v5 0/3] XFS realtime device tweaks Richard Wareing
2017-09-25 19:44 ` [PATCH v5 1/3] xfs: Show realtime device stats on statfs calls if inherit flag set Richard Wareing
2017-09-25 22:53   ` Eric Sandeen
2017-09-26  3:32     ` Richard Wareing
2017-09-25 22:55   ` Darrick J. Wong
2017-09-25 19:44 ` [PATCH v5 2/3] xfs: Set realtime flag based on initial allocation size Richard Wareing
2017-09-25 22:47   ` Darrick J. Wong
2017-09-26  5:25     ` Dave Chinner
2017-09-26  6:11     ` Richard Wareing
2017-09-26  0:13   ` Eric Sandeen
2017-09-26  5:17     ` Richard Wareing
2017-09-25 19:44 ` [PATCH v5 3/3] xfs: Add realtime fallback if data device full Richard Wareing
2017-09-25 22:52   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.