All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 00/17] xfsprogs: various 5.15 fixes
@ 2022-01-20  0:21 Darrick J. Wong
  2022-01-20  0:21 ` [PATCH 01/17] libxcmd: use emacs mode for command history editing Darrick J. Wong
                   ` (18 more replies)
  0 siblings, 19 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:21 UTC (permalink / raw)
  To: sandeen, djwong
  Cc: Christoph Hellwig, Dave Chinner, Theodore Ts'o, linux-xfs,
	allison.henderson

Hi all,

Here are a number of tooling changes for xfsprogs 5.15 that are not a
direct consequence of the 5.15 kernel.  This is a bit of a grab bag,
folks; bug fixes include:

 - editline should use emacs mode instead of vim mode
 - fix some UAF bugs in libxfs
 - stop duplicating kernel headers for GETFSMAP
 - write the secondary sbs after upgrading v5 features
 - fix memory corruption bugs when parsing mkfs config files
 - whitespace and formatting fixes

Straight-up enhancements include:

 - adding mkfs config files
 - enabling the inobtcount and bigtime features by default
 - having scrub report whether or not it supports unicode name checks

Enjoy!

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=xfsprogs-5.15-fixes
---
 db/faddr.c               |    6 ++
 db/input.c               |    1 
 include/builddefs.in     |    2 +
 include/linux.h          |  105 -----------------------------------------
 io/crc32cselftest.c      |    2 -
 io/fsmap.c               |    1 
 io/io.h                  |    5 --
 libfrog/Makefile         |    1 
 libfrog/crc32.c          |    2 -
 libfrog/crc32cselftest.h |   21 ++++++--
 libfrog/fsmap.h          |  117 ++++++++++++++++++++++++++++++++++++++++++++++
 libxcmd/input.c          |    1 
 libxfs/libxfs_api_defs.h |    2 +
 libxfs/rdwr.c            |   23 +++++----
 libxfs/trans.c           |   19 +++++++
 man/man8/Makefile        |    7 +++
 man/man8/mkfs.xfs.8.in   |    8 +++
 man/man8/xfs_repair.8    |    4 ++
 mkfs/Makefile            |   11 ++++
 mkfs/dax_x86_64.conf     |   19 +++++++
 mkfs/lts_4.19.conf       |   13 +++++
 mkfs/lts_5.10.conf       |   13 +++++
 mkfs/lts_5.15.conf       |   13 +++++
 mkfs/lts_5.4.conf        |   13 +++++
 mkfs/xfs_mkfs.c          |   27 +++++++++--
 repair/dir2.c            |    2 -
 repair/globals.c         |    1 
 repair/globals.h         |    1 
 repair/init.c            |    5 ++
 repair/phase2.c          |   38 +++++++--------
 repair/quotacheck.c      |    9 ++--
 repair/xfs_repair.c      |   15 ++++++
 scrub/phase6.c           |    1 
 scrub/phase7.c           |    1 
 scrub/spacemap.c         |    1 
 scrub/xfs_scrub.c        |   12 ++++-
 spaceman/freesp.c        |    1 
 spaceman/space.h         |    4 --
 tools/libxfs-apply       |   42 +++++++----------
 39 files changed, 379 insertions(+), 190 deletions(-)
 create mode 100644 libfrog/fsmap.h
 rename man/man8/{mkfs.xfs.8 => mkfs.xfs.8.in} (99%)
 create mode 100644 mkfs/dax_x86_64.conf
 create mode 100644 mkfs/lts_4.19.conf
 create mode 100644 mkfs/lts_5.10.conf
 create mode 100644 mkfs/lts_5.15.conf
 create mode 100644 mkfs/lts_5.4.conf


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/17] libxcmd: use emacs mode for command history editing
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
@ 2022-01-20  0:21 ` Darrick J. Wong
  2022-01-20  0:21 ` [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:21 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: Christoph Hellwig, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Prior to xfsprogs 5.7.0, we built xfsprogs with libreadline support by
default.  In its default configuration, that library interpreted various
keystrokes in a direct manner (e.g. backspace deletes the character to
the left of the cursor), which seems consistent with how emacs behaves.

However, libeditline's default keybindings are consistent with vim,
which means that suddenly users are presented with not the same line
editing interface that they had before.  Since libeditline is
configurable (put "bind -v" in editrc if you really want vim mode),
let's put things back the way they were.  At least as much as we can.

Fixes: bbe12eb9 ("xfsprogs: remove libreadline support")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 db/input.c      |    1 +
 libxcmd/input.c |    1 +
 2 files changed, 2 insertions(+)


diff --git a/db/input.c b/db/input.c
index 448e84b0..d8113599 100644
--- a/db/input.c
+++ b/db/input.c
@@ -227,6 +227,7 @@ fetchline(void)
 		el_set(el, EL_SIGNAL, 1);
 		el_set(el, EL_PROMPT, el_get_prompt);
 		el_set(el, EL_HIST, history, (const char *)hist);
+		el_set(el, EL_EDITOR, "emacs");
 	}
 
 	if (inputstacksize == 1) {
diff --git a/libxcmd/input.c b/libxcmd/input.c
index e3fa626a..fa80e5ab 100644
--- a/libxcmd/input.c
+++ b/libxcmd/input.c
@@ -45,6 +45,7 @@ fetchline(void)
 		el_set(el, EL_SIGNAL, 1);
 		el_set(el, EL_PROMPT, el_get_prompt);
 		el_set(el, EL_HIST, history, (const char *)hist);
+		el_set(el, EL_EDITOR, "emacs");
 	}
 	cmd = el_gets(el, &count);
 	if (!cmd)


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
  2022-01-20  0:21 ` [PATCH 01/17] libxcmd: use emacs mode for command history editing Darrick J. Wong
@ 2022-01-20  0:21 ` Darrick J. Wong
  2022-02-04 21:36   ` Eric Sandeen
  2022-01-20  0:21 ` [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf Darrick J. Wong
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:21 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

While debugging some very strange rmap corruption reports in connection
with the online directory repair code.  I root-caused the error to the
following incorrect sequence:

<start repair transaction>
<expand directory, causing a deferred rmap to be queued>
<roll transaction>
<cancel transaction>

Obviously, we should have committed the transaction instead of
cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
have warned us that we were throwing away work item that we already
committed to performing.  This is not correct, and we need to shut down
the filesystem.

Change xfs_trans_cancel to complain in the loudest manner if we're
cancelling any transaction with deferred work items attached.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/trans.c |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)


diff --git a/libxfs/trans.c b/libxfs/trans.c
index fd2e6f9d..8c16cb8d 100644
--- a/libxfs/trans.c
+++ b/libxfs/trans.c
@@ -318,13 +318,30 @@ void
 libxfs_trans_cancel(
 	struct xfs_trans	*tp)
 {
+	bool			dirty;
+
 	trace_xfs_trans_cancel(tp, _RET_IP_);
 
 	if (tp == NULL)
 		return;
+	dirty = (tp->t_flags & XFS_TRANS_DIRTY);
 
-	if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
+	/*
+	 * It's never valid to cancel a transaction with deferred ops attached,
+	 * because the transaction is effectively dirty.  Complain about this
+	 * loudly before freeing the in-memory defer items.
+	 */
+	if (!list_empty(&tp->t_dfops)) {
+		ASSERT(list_empty(&tp->t_dfops));
+		ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
+		dirty = true;
 		xfs_defer_cancel(tp);
+	}
+
+	if (dirty) {
+		fprintf(stderr, _("Cancelling dirty transaction!\n"));
+		abort();
+	}
 
 	xfs_trans_free_items(tp);
 	xfs_trans_free(tp);


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
  2022-01-20  0:21 ` [PATCH 01/17] libxcmd: use emacs mode for command history editing Darrick J. Wong
  2022-01-20  0:21 ` [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
@ 2022-01-20  0:21 ` Darrick J. Wong
  2022-02-04 22:05   ` Eric Sandeen
  2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:21 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

When we're preparing to move a list of xfs_buf(fers) to the freelist, be
sure to detach the perag reference so that we don't leak the reference
or leave dangling pointers.  Currently this has no negative effects
since we only call libxfs_bulkrelse while exiting programs, but let's
not be sloppy.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/rdwr.c |   23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)


diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 2a9e8c98..b43527e4 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -887,11 +887,19 @@ libxfs_buf_mark_dirty(
 	bp->b_flags |= LIBXFS_B_DIRTY;
 }
 
-/* Complain about (and remember) dropping dirty buffers. */
-static void
-libxfs_whine_dirty_buf(
+/* Prepare a buffer to be sent to the MRU list. */
+static inline void
+libxfs_buf_prepare_mru(
 	struct xfs_buf		*bp)
 {
+	if (bp->b_pag)
+		xfs_perag_put(bp->b_pag);
+	bp->b_pag = NULL;
+
+	if (!(bp->b_flags & LIBXFS_B_DIRTY))
+		return;
+
+	/* Complain about (and remember) dropping dirty buffers. */
 	fprintf(stderr, _("%s: Releasing dirty buffer to free list!\n"),
 			progname);
 
@@ -909,11 +917,7 @@ libxfs_brelse(
 
 	if (!bp)
 		return;
-	if (bp->b_flags & LIBXFS_B_DIRTY)
-		libxfs_whine_dirty_buf(bp);
-	if (bp->b_pag)
-		xfs_perag_put(bp->b_pag);
-	bp->b_pag = NULL;
+	libxfs_buf_prepare_mru(bp);
 
 	pthread_mutex_lock(&xfs_buf_freelist.cm_mutex);
 	list_add(&bp->b_node.cn_mru, &xfs_buf_freelist.cm_list);
@@ -932,8 +936,7 @@ libxfs_bulkrelse(
 		return 0 ;
 
 	list_for_each_entry(bp, list, b_node.cn_mru) {
-		if (bp->b_flags & LIBXFS_B_DIRTY)
-			libxfs_whine_dirty_buf(bp);
+		libxfs_buf_prepare_mru(bp);
 		count++;
 	}
 


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-01-20  0:21 ` [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf Darrick J. Wong
@ 2022-01-20  0:21 ` Darrick J. Wong
  2022-02-04 23:18   ` Eric Sandeen
  2022-02-08 16:46   ` [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions Darrick J. Wong
  2022-01-20  0:22 ` [PATCH 05/17] misc: add a crc32c self test to mkfs and repair Darrick J. Wong
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:21 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Move our private copy of the GETFSMAP definition into a libfrog header
file that (preferentially) uses the system header files.  We have no
business shipping kernel headers in the xfslibs package, but this shim
is still needed to build fully functional xfsprogs on old userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/linux.h   |  105 ------------------------------------------------
 io/fsmap.c        |    1 
 io/io.h           |    5 --
 libfrog/Makefile  |    1 
 libfrog/fsmap.h   |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/phase6.c    |    1 
 scrub/phase7.c    |    1 
 scrub/spacemap.c  |    1 
 spaceman/freesp.c |    1 
 spaceman/space.h  |    4 --
 10 files changed, 123 insertions(+), 114 deletions(-)
 create mode 100644 libfrog/fsmap.h


diff --git a/include/linux.h b/include/linux.h
index de8a7122..3d9f4e3d 100644
--- a/include/linux.h
+++ b/include/linux.h
@@ -251,111 +251,6 @@ struct fsxattr {
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 #endif
 
-#ifdef HAVE_GETFSMAP
-# include <linux/fsmap.h>
-#else
-/*
- *	Structure for FS_IOC_GETFSMAP.
- *
- *	The memory layout for this call are the scalar values defined in
- *	struct fsmap_head, followed by two struct fsmap that describe
- *	the lower and upper bound of mappings to return, followed by an
- *	array of struct fsmap mappings.
- *
- *	fmh_iflags control the output of the call, whereas fmh_oflags report
- *	on the overall record output.  fmh_count should be set to the
- *	length of the fmh_recs array, and fmh_entries will be set to the
- *	number of entries filled out during each call.  If fmh_count is
- *	zero, the number of reverse mappings will be returned in
- *	fmh_entries, though no mappings will be returned.  fmh_reserved
- *	must be set to zero.
- *
- *	The two elements in the fmh_keys array are used to constrain the
- *	output.  The first element in the array should represent the
- *	lowest disk mapping ("low key") that the user wants to learn
- *	about.  If this value is all zeroes, the filesystem will return
- *	the first entry it knows about.  For a subsequent call, the
- *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
- *	copied into fmh_keys[0] to have the kernel start where it left off.
- *
- *	The second element in the fmh_keys array should represent the
- *	highest disk mapping ("high key") that the user wants to learn
- *	about.  If this value is all ones, the filesystem will not stop
- *	until it runs out of mapping to return or runs out of space in
- *	fmh_recs.
- *
- *	fmr_device can be either a 32-bit cookie representing a device, or
- *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
- *	fmr_offset, and fmr_length are expressed in units of bytes.
- *	fmr_owner is either an inode number, or a special value if
- *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
- */
-struct fsmap {
-	__u32		fmr_device;	/* device id */
-	__u32		fmr_flags;	/* mapping flags */
-	__u64		fmr_physical;	/* device offset of segment */
-	__u64		fmr_owner;	/* owner id */
-	__u64		fmr_offset;	/* file offset of segment */
-	__u64		fmr_length;	/* length of segment */
-	__u64		fmr_reserved[3];	/* must be zero */
-};
-
-struct fsmap_head {
-	__u32		fmh_iflags;	/* control flags */
-	__u32		fmh_oflags;	/* output flags */
-	__u32		fmh_count;	/* # of entries in array incl. input */
-	__u32		fmh_entries;	/* # of entries filled in (output). */
-	__u64		fmh_reserved[6];	/* must be zero */
-
-	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
-	struct fsmap	fmh_recs[];	/* returned records */
-};
-
-/* Size of an fsmap_head with room for nr records. */
-static inline size_t
-fsmap_sizeof(
-	unsigned int	nr)
-{
-	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
-}
-
-/* Start the next fsmap query at the end of the current query results. */
-static inline void
-fsmap_advance(
-	struct fsmap_head	*head)
-{
-	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
-}
-
-/*	fmh_iflags values - set by XFS_IOC_GETFSMAP caller in the header. */
-/* no flags defined yet */
-#define FMH_IF_VALID		0
-
-/*	fmh_oflags values - returned in the header segment only. */
-#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
-
-/*	fmr_flags values - returned for each non-header segment */
-#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
-#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
-#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
-#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
-#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
-#define FMR_OF_LAST		0x20	/* segment is the last in the FS */
-
-/* Each FS gets to define its own special owner codes. */
-#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
-				 ((__u64)code & 0xFFFFFFFFULL))
-#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
-#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
-#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
-#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
-#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
-
-#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
-
-#define HAVE_GETFSMAP
-#endif /* HAVE_GETFSMAP */
-
 #ifndef HAVE_MAP_SYNC
 #define MAP_SYNC 0
 #define MAP_SHARED_VALIDATE 0
diff --git a/io/fsmap.c b/io/fsmap.c
index f540a7c0..9ff36bf4 100644
--- a/io/fsmap.c
+++ b/io/fsmap.c
@@ -10,6 +10,7 @@
 #include "io.h"
 #include "input.h"
 #include "libfrog/fsgeom.h"
+#include "libfrog/fsmap.h"
 
 static cmdinfo_t	fsmap_cmd;
 static dev_t		xfs_data_dev;
diff --git a/io/io.h b/io/io.h
index 49db902f..39fb5878 100644
--- a/io/io.h
+++ b/io/io.h
@@ -167,12 +167,7 @@ extern void		readdir_init(void);
 extern void		reflink_init(void);
 
 extern void		cowextsize_init(void);
-
-#ifdef HAVE_GETFSMAP
 extern void		fsmap_init(void);
-#else
-# define fsmap_init()	do { } while (0)
-#endif
 
 #ifdef HAVE_DEVMAPPER
 extern void		log_writes_init(void);
diff --git a/libfrog/Makefile b/libfrog/Makefile
index 01107082..d6044455 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -40,6 +40,7 @@ crc32cselftest.h \
 crc32defs.h \
 crc32table.h \
 fsgeom.h \
+fsmap.h \
 logging.h \
 paths.h \
 projects.h \
diff --git a/libfrog/fsmap.h b/libfrog/fsmap.h
new file mode 100644
index 00000000..dc290962
--- /dev/null
+++ b/libfrog/fsmap.h
@@ -0,0 +1,117 @@
+#ifdef HAVE_GETFSMAP
+# include <linux/fsmap.h>
+#endif
+
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * FS_IOC_GETFSMAP ioctl infrastructure.
+ *
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _LINUX_FSMAP_H
+#define _LINUX_FSMAP_H
+
+#include <linux/types.h>
+
+/*
+ *	Structure for FS_IOC_GETFSMAP.
+ *
+ *	The memory layout for this call are the scalar values defined in
+ *	struct fsmap_head, followed by two struct fsmap that describe
+ *	the lower and upper bound of mappings to return, followed by an
+ *	array of struct fsmap mappings.
+ *
+ *	fmh_iflags control the output of the call, whereas fmh_oflags report
+ *	on the overall record output.  fmh_count should be set to the
+ *	length of the fmh_recs array, and fmh_entries will be set to the
+ *	number of entries filled out during each call.  If fmh_count is
+ *	zero, the number of reverse mappings will be returned in
+ *	fmh_entries, though no mappings will be returned.  fmh_reserved
+ *	must be set to zero.
+ *
+ *	The two elements in the fmh_keys array are used to constrain the
+ *	output.  The first element in the array should represent the
+ *	lowest disk mapping ("low key") that the user wants to learn
+ *	about.  If this value is all zeroes, the filesystem will return
+ *	the first entry it knows about.  For a subsequent call, the
+ *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
+ *	copied into fmh_keys[0] to have the kernel start where it left off.
+ *
+ *	The second element in the fmh_keys array should represent the
+ *	highest disk mapping ("high key") that the user wants to learn
+ *	about.  If this value is all ones, the filesystem will not stop
+ *	until it runs out of mapping to return or runs out of space in
+ *	fmh_recs.
+ *
+ *	fmr_device can be either a 32-bit cookie representing a device, or
+ *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
+ *	fmr_offset, and fmr_length are expressed in units of bytes.
+ *	fmr_owner is either an inode number, or a special value if
+ *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
+ */
+struct fsmap {
+	__u32		fmr_device;	/* device id */
+	__u32		fmr_flags;	/* mapping flags */
+	__u64		fmr_physical;	/* device offset of segment */
+	__u64		fmr_owner;	/* owner id */
+	__u64		fmr_offset;	/* file offset of segment */
+	__u64		fmr_length;	/* length of segment */
+	__u64		fmr_reserved[3];	/* must be zero */
+};
+
+struct fsmap_head {
+	__u32		fmh_iflags;	/* control flags */
+	__u32		fmh_oflags;	/* output flags */
+	__u32		fmh_count;	/* # of entries in array incl. input */
+	__u32		fmh_entries;	/* # of entries filled in (output). */
+	__u64		fmh_reserved[6];	/* must be zero */
+
+	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
+	struct fsmap	fmh_recs[];	/* returned records */
+};
+
+/* Size of an fsmap_head with room for nr records. */
+static __inline__ size_t
+fsmap_sizeof(
+	unsigned int	nr)
+{
+	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
+}
+
+/* Start the next fsmap query at the end of the current query results. */
+static __inline__ void
+fsmap_advance(
+	struct fsmap_head	*head)
+{
+	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
+}
+
+/*	fmh_iflags values - set by FS_IOC_GETFSMAP caller in the header. */
+/* no flags defined yet */
+#define FMH_IF_VALID		0
+
+/*	fmh_oflags values - returned in the header segment only. */
+#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
+
+/*	fmr_flags values - returned for each non-header segment */
+#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
+#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
+#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
+#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
+#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
+#define FMR_OF_LAST		0x20	/* segment is the last in the dataset */
+
+/* Each FS gets to define its own special owner codes. */
+#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
+				 ((__u64)code & 0xFFFFFFFFULL))
+#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
+#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
+#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
+#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
+#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
+
+#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
+
+#endif /* _LINUX_FSMAP_H */
diff --git a/scrub/phase6.c b/scrub/phase6.c
index 87828b60..dd42b66c 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -10,6 +10,7 @@
 #include "handle.h"
 #include "libfrog/paths.h"
 #include "libfrog/workqueue.h"
+#include "libfrog/fsmap.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "libfrog/bitmap.h"
diff --git a/scrub/phase7.c b/scrub/phase7.c
index bc652ab6..e24906d1 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -9,6 +9,7 @@
 #include <sys/statvfs.h>
 #include "libfrog/paths.h"
 #include "libfrog/ptvar.h"
+#include "libfrog/fsmap.h"
 #include "list.h"
 #include "xfs_scrub.h"
 #include "common.h"
diff --git a/scrub/spacemap.c b/scrub/spacemap.c
index a5508d56..b7f17e57 100644
--- a/scrub/spacemap.c
+++ b/scrub/spacemap.c
@@ -10,6 +10,7 @@
 #include <sys/statvfs.h>
 #include "libfrog/workqueue.h"
 #include "libfrog/paths.h"
+#include "libfrog/fsmap.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "spacemap.h"
diff --git a/spaceman/freesp.c b/spaceman/freesp.c
index de301c19..4e46ab26 100644
--- a/spaceman/freesp.c
+++ b/spaceman/freesp.c
@@ -9,6 +9,7 @@
 #include "libxfs.h"
 #include <linux/fiemap.h>
 #include "libfrog/fsgeom.h"
+#include "libfrog/fsmap.h"
 #include "command.h"
 #include "init.h"
 #include "libfrog/paths.h"
diff --git a/spaceman/space.h b/spaceman/space.h
index 723209ed..a8055535 100644
--- a/spaceman/space.h
+++ b/spaceman/space.h
@@ -26,11 +26,7 @@ extern void	help_init(void);
 extern void	prealloc_init(void);
 extern void	quit_init(void);
 extern void	trim_init(void);
-#ifdef HAVE_GETFSMAP
 extern void	freesp_init(void);
-#else
-# define freesp_init()	do { } while (0)
-#endif
 extern void	info_init(void);
 extern void	health_init(void);
 


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/17] misc: add a crc32c self test to mkfs and repair
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-02-04 23:23   ` Eric Sandeen
  2022-01-20  0:22 ` [PATCH 06/17] libxfs-apply: support filterdiff >= 0.4.2 only Darrick J. Wong
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Enhance mkfs and xfs_repair to run the crc32c self test when they start
up, and refuse to continue if the self test fails.   We don't want to
format a filesystem if the checksum algorithm produces incorrect
results, and we especially don't want repair to tear a filesystem apart
because it thinks the checksum is wrong.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/crc32cselftest.c      |    2 +-
 libfrog/crc32.c          |    2 +-
 libfrog/crc32cselftest.h |   21 ++++++++++++++-------
 man/man8/mkfs.xfs.8      |    4 ++++
 man/man8/xfs_repair.8    |    4 ++++
 mkfs/xfs_mkfs.c          |    8 ++++++++
 repair/init.c            |    5 +++++
 7 files changed, 37 insertions(+), 9 deletions(-)


diff --git a/io/crc32cselftest.c b/io/crc32cselftest.c
index f8f757f6..49eb5b6d 100644
--- a/io/crc32cselftest.c
+++ b/io/crc32cselftest.c
@@ -16,7 +16,7 @@ crc32cselftest_f(
 	int		argc,
 	char		**argv)
 {
-	return crc32c_test() != 0;
+	return crc32c_test(0) != 0;
 }
 
 static const cmdinfo_t	crc32cselftest_cmd = {
diff --git a/libfrog/crc32.c b/libfrog/crc32.c
index 6a273b71..2499615d 100644
--- a/libfrog/crc32.c
+++ b/libfrog/crc32.c
@@ -202,7 +202,7 @@ int main(int argc, char **argv)
 
 	printf("CRC_LE_BITS = %d\n", CRC_LE_BITS);
 
-	errors = crc32c_test();
+	errors = crc32c_test(0);
 
 	return errors != 0;
 }
diff --git a/libfrog/crc32cselftest.h b/libfrog/crc32cselftest.h
index 08284153..447a7f7d 100644
--- a/libfrog/crc32cselftest.h
+++ b/libfrog/crc32cselftest.h
@@ -661,18 +661,22 @@ static struct crc_test {
 	{0xb18a0319, 0x00000026, 0x000007db, 0x9dc0bb48},
 };
 
+/* Don't print anything to stdout. */
+#define CRC32CTEST_QUIET	(1U << 0)
+
 static int
-crc32c_test(void)
+crc32c_test(
+	unsigned int	flags)
 {
-	int i;
-	int errors = 0;
-	int bytes = 0;
-	struct timeval start, stop;
-	uint64_t usec;
+	int		i;
+	int		errors = 0;
+	int		bytes = 0;
+	struct timeval	start, stop;
+	uint64_t	usec;
 
 	/* keep static to prevent cache warming code from
 	 * getting eliminated by the compiler */
-	static uint32_t crc;
+	static uint32_t	crc;
 
 	/* pre-warm the cache */
 	for (i = 0; i < 100; i++) {
@@ -693,6 +697,9 @@ crc32c_test(void)
 	usec = stop.tv_usec - start.tv_usec +
 		1000000 * (stop.tv_sec - start.tv_sec);
 
+	if (flags & CRC32CTEST_QUIET)
+		return errors;
+
 	if (errors)
 		printf("crc32c: %d self tests failed\n", errors);
 	else {
diff --git a/man/man8/mkfs.xfs.8 b/man/man8/mkfs.xfs.8
index a7f70285..880e949b 100644
--- a/man/man8/mkfs.xfs.8
+++ b/man/man8/mkfs.xfs.8
@@ -121,6 +121,10 @@ If the size of the block or sector is not specified, the default sizes
 .PP
 Many feature options allow an optional argument of 0 or 1, to explicitly
 disable or enable the functionality.
+
+The correctness of the crc32c checksum implementation will be tested
+before formatting the filesystem.
+If the test fails, the format will abort.
 .SH OPTIONS
 Options may be specified either on the command line or in a configuration file.
 Not all command line options can be specified in configuration files; only the
diff --git a/man/man8/xfs_repair.8 b/man/man8/xfs_repair.8
index cc6a2be8..6625b47a 100644
--- a/man/man8/xfs_repair.8
+++ b/man/man8/xfs_repair.8
@@ -184,6 +184,10 @@ usual 0. This option cannot be used together with
 .B \-V
 Prints the version number and exits.
 .SS Checks Performed
+The correctness of the crc32c checksum implementation will be tested
+before examining the filesystem.
+If the test fails, the program will abort.
+
 Inconsistencies corrected include the following:
 .IP 1.
 Inode and inode blockmap (addressing) checks:
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 057b3b09..3a41e17f 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -10,6 +10,7 @@
 #include "libxcmd.h"
 #include "libfrog/fsgeom.h"
 #include "libfrog/convert.h"
+#include "libfrog/crc32cselftest.h"
 #include "proto.h"
 #include <ini.h>
 
@@ -4044,6 +4045,13 @@ main(
 			exit(0);
 	}
 
+	/* Make sure our checksum algorithm really works. */
+	if (crc32c_test(CRC32CTEST_QUIET) != 0) {
+		fprintf(stderr,
+ _("crc32c self-test failed, will not create a filesystem here.\n"));
+		return 1;
+	}
+
 	/*
 	 * All values have been validated, discard the old device layout.
 	 */
diff --git a/repair/init.c b/repair/init.c
index 55f226e9..3a320b4f 100644
--- a/repair/init.c
+++ b/repair/init.c
@@ -14,6 +14,7 @@
 #include "bmap.h"
 #include "incore.h"
 #include "prefetch.h"
+#include "libfrog/crc32cselftest.h"
 #include <sys/resource.h>
 
 static void
@@ -100,4 +101,8 @@ _("Unmount or use the dangerous (-d) option to repair a read-only mounted filesy
 	ts_create();
 	increase_rlimit();
 	pftrace_init();
+
+	if (crc32c_test(CRC32CTEST_QUIET) != 0)
+		do_error(
+ _("crc32c self-test failed, will not examine filesystem.\n"));
 }


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/17] libxfs-apply: support filterdiff >= 0.4.2 only
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 05/17] misc: add a crc32c self test to mkfs and repair Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-01-20  0:22 ` [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions Darrick J. Wong
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: Dave Chinner, linux-xfs, allison.henderson

From: Dave Chinner <dchinner@redhat.com>

We currently require filterdiff v0.3.4 as a minimum for handling git
based patches. This was the first version to handle git diff
metadata well enough to do patch reformatting. It was, however, very
buggy and required several workarounds to get it to do what we
needed.

However, these bugs have been fixed and on a machine with v0.4.2,
the workarounds result in libxfs-apply breaking and creating corrupt
patches. Rather than try to carry around workarounds for a broken
filterdiff version and one that just works, just increase the
minimum required version to 0.4.2 and remove all the workarounds for
the bugs in 0.3.4.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tools/libxfs-apply |   42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)


diff --git a/tools/libxfs-apply b/tools/libxfs-apply
index 9271db38..097a695f 100755
--- a/tools/libxfs-apply
+++ b/tools/libxfs-apply
@@ -30,21 +30,22 @@ fail()
 	exit
 }
 
-# filterdiff 0.3.4 is the first version that handles git diff metadata (almost)
-# correctly. It just doesn't work properly in prior versions, so those versions
-# can't be used to extract the commit message prior to the diff. Hence just
-# abort and tell the user to upgrade if an old version is detected. We need to
+# filterdiff didn't start handling git diff metadata correctly until some time
+# after 0.3.4. The handling in 0.3.4 was buggy and broken, requiring working
+# around that bugs to use it. Now that 0.4.2 has fixed all those bugs, the
+# work-arounds for 0.3.4 do not work. Hence set 0.4.2 as the minimum required
+# version and tell the user to upgrade if an old version is detected. We need to
 # check against x.y.z version numbers here.
 _version=`filterdiff --version | cut -d " " -f 5`
 _major=`echo $_version | cut -d "." -f 1`
 _minor=`echo $_version | cut -d "." -f 2`
 _patch=`echo $_version | cut -d "." -f 3`
 if [ $_major -eq 0 ]; then
-	if [ $_minor -lt 3 ]; then
-		fail "filterdiff $_version found. 0.3.4 or greater is required."
+	if [ $_minor -lt 4 ]; then
+		fail "filterdiff $_version found. 0.4.2 or greater is required."
 	fi
-	if [ $_minor -eq 3 -a $_patch -le 3 ]; then
-		fail "filterdiff $_version found. 0.3.4 or greater is required."
+	if [ $_minor -eq 4 -a $_patch -lt 2 ]; then
+		fail "filterdiff $_version found. 0.4.2 or greater is required."
 	fi
 fi
 
@@ -158,8 +159,7 @@ filter_kernel_patch()
 			--addoldprefix=a/fs/xfs/ \
 			--addnewprefix=b/fs/xfs/ \
 			$_patch | \
-		sed -e 's, [ab]\/fs\/xfs\/\(\/dev\/null\), \1,' \
-		    -e '/^diff --git/d'
+		sed -e 's, [ab]\/fs\/xfs\/\(\/dev\/null\), \1,'
 
 
 	rm -f $_libxfs_files
@@ -187,8 +187,7 @@ filter_xfsprogs_patch()
 			--addoldprefix=a/ \
 			--addnewprefix=b/ \
 			$_patch | \
-		sed -e 's, [ab]\/\(\/dev\/null\), \1,' \
-		    -e '/^diff --git/d'
+		sed -e 's, [ab]\/\(\/dev\/null\), \1,'
 
 	rm -f $_libxfs_files
 }
@@ -209,30 +208,23 @@ fixup_header_format()
 	local _diff=`mktemp`
 	local _new_hdr=$_hdr.new
 
-	# there's a bug in filterdiff that leaves a line at the end of the
-	# header in the filtered git show output like:
-	#
-	# difflibxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
-	#
-	# split the header on that (convenient!)
-	sed -e /^difflib/q $_patch > $_hdr
+	# Split the header on the first ^diff --git line (convenient!)
+	sed -e /^diff/q $_patch > $_hdr
 	cat $_patch | awk '
-		BEGIN { difflib_seen = 0; index_seen = 0 }
-		/^difflib/ { difflib_seen++; next }
+		BEGIN { diff_seen = 0; index_seen = 0 }
+		/^diff/ { diff_seen++; next }
 		/^index/ { if (++index_seen == 1) { next } }
-		// { if (difflib_seen) { print $0 } }' > $_diff
+		// { if (diff_seen) { print $0 } }' > $_diff
 
 	# the header now has the format:
 	# commit 0d5a75e9e23ee39cd0d8a167393dcedb4f0f47b2
 	# Author: Eric Sandeen <sandeen@sandeen.net>
 	# Date:   Wed Jun 1 17:38:15 2016 +1000
-	# 
+	#
 	#     xfs: make several functions static
 	#....
 	#     Signed-off-by: Dave Chinner <david@fromorbit.com>
 	#
-	#difflibxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
-	#
 	# We want to format it like a normal patch with a line to say what repo
 	# and commit it was sourced from:
 	#


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (5 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 06/17] libxfs-apply: support filterdiff >= 0.4.2 only Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-02-25 21:45   ` Eric Sandeen
  2022-01-20  0:22 ` [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn Darrick J. Wong
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Use the proper macro to convert ino4 and ino8 field byte sizes to a bit
count in the functions that navigate shortform directories.  This just
happens to work correctly for ino4 entries, but omits the upper 4 bytes
of an ino8 entry.  Note that the entries display correctly; it's just
the command "addr u3.sfdir3.list[X].inumber.i8" that won't.

Found by running smatch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 db/faddr.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)


diff --git a/db/faddr.c b/db/faddr.c
index 81d69c94..0127c5d1 100644
--- a/db/faddr.c
+++ b/db/faddr.c
@@ -353,7 +353,8 @@ fa_ino4(
 	xfs_ino_t	ino;
 
 	ASSERT(next == TYP_INODE);
-	ino = (xfs_ino_t)getbitval(obj, bit, bitsz(XFS_INO32_SIZE), BVUNSIGNED);
+	ino = (xfs_ino_t)getbitval(obj, bit, bitize(XFS_INO32_SIZE),
+			BVUNSIGNED);
 	if (ino == NULLFSINO) {
 		dbprintf(_("null inode number, cannot set new addr\n"));
 		return;
@@ -370,7 +371,8 @@ fa_ino8(
 	xfs_ino_t	ino;
 
 	ASSERT(next == TYP_INODE);
-	ino = (xfs_ino_t)getbitval(obj, bit, bitsz(XFS_INO64_SIZE), BVUNSIGNED);
+	ino = (xfs_ino_t)getbitval(obj, bit, bitize(XFS_INO64_SIZE),
+			BVUNSIGNED);
 	if (ino == NULLFSINO) {
 		dbprintf(_("null inode number, cannot set new addr\n"));
 		return;


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (6 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-02-25 21:46   ` Eric Sandeen
  2022-01-20  0:22 ` [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers " Darrick J. Wong
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Explicitly cast the ondisk dquot counter argument to do_warn when
complaining about incorrect quota counts.  This avoids build warnings on
ppc64le.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/quotacheck.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)


diff --git a/repair/quotacheck.c b/repair/quotacheck.c
index 758160d3..ba87081c 100644
--- a/repair/quotacheck.c
+++ b/repair/quotacheck.c
@@ -306,21 +306,24 @@ qc_check_dquot(
 	if (be64_to_cpu(ddq->d_bcount) != qrec->bcount) {
 		do_warn(_("%s id %u has bcount %llu, expected %"PRIu64"\n"),
 				qflags_typestr(dquots->type), id,
-				be64_to_cpu(ddq->d_bcount), qrec->bcount);
+				(unsigned long long)be64_to_cpu(ddq->d_bcount),
+				qrec->bcount);
 		chkd_flags = 0;
 	}
 
 	if (be64_to_cpu(ddq->d_rtbcount) != qrec->rtbcount) {
 		do_warn(_("%s id %u has rtbcount %llu, expected %"PRIu64"\n"),
 				qflags_typestr(dquots->type), id,
-				be64_to_cpu(ddq->d_rtbcount), qrec->rtbcount);
+				(unsigned long long)be64_to_cpu(ddq->d_rtbcount),
+				qrec->rtbcount);
 		chkd_flags = 0;
 	}
 
 	if (be64_to_cpu(ddq->d_icount) != qrec->icount) {
 		do_warn(_("%s id %u has icount %llu, expected %"PRIu64"\n"),
 				qflags_typestr(dquots->type), id,
-				be64_to_cpu(ddq->d_icount), qrec->icount);
+				(unsigned long long)be64_to_cpu(ddq->d_icount),
+				qrec->icount);
 		chkd_flags = 0;
 	}
 


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers in do_warn
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (7 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-02-25 21:48   ` Eric Sandeen
  2022-01-20  0:22 ` [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem Darrick J. Wong
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Explicitly cast the ondisk directory inode argument to do_warn when
complaining about corrupt directories.  This avoids build warnings on
armv7l.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/dir2.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/repair/dir2.c b/repair/dir2.c
index fdf91532..946e729e 100644
--- a/repair/dir2.c
+++ b/repair/dir2.c
@@ -1358,7 +1358,7 @@ _("can't read block %" PRIu64 " for directory inode %" PRIu64 "\n"),
 		}
 		if (bp->b_error == -EFSCORRUPTED) {
 			do_warn(
-_("corrupt directory data block %lu for inode %" PRIu64 "\n"),
+_("corrupt directory data block %" PRIu64 " for inode %" PRIu64 "\n"),
 				dbno, ino);
 			libxfs_buf_relse(bp);
 			continue;


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (8 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers " Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-02-25 21:53   ` Eric Sandeen
  2022-01-20  0:22 ` [PATCH 11/17] xfs_repair: update secondary superblocks after changing features Darrick J. Wong
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Indentation is supposed to be tabs, not spaces.  Fix that, and unindent
the bwrite clause because do_error aborts the program.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 repair/phase2.c |   37 ++++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 19 deletions(-)


diff --git a/repair/phase2.c b/repair/phase2.c
index bda834de..cfba649a 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -196,29 +196,28 @@ upgrade_filesystem(
 		return;
 
 	mp->m_features |= libxfs_sb_version_to_features(&mp->m_sb);
-        if (no_modify)
-                return;
+	if (no_modify)
+		return;
 
-        bp = libxfs_getsb(mp);
-        if (!bp || bp->b_error) {
-                do_error(
+	bp = libxfs_getsb(mp);
+	if (!bp || bp->b_error)
+		do_error(
 	_("couldn't get superblock for feature upgrade, err=%d\n"),
-                                bp ? bp->b_error : ENOMEM);
-        } else {
-                libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
+				bp ? bp->b_error : ENOMEM);
 
-                /*
-		 * Write the primary super to disk immediately so that
-		 * needsrepair will be set if repair doesn't complete.
-		 */
-                error = -libxfs_bwrite(bp);
-                if (error)
-                        do_error(
+	libxfs_sb_to_disk(bp->b_addr, &mp->m_sb);
+
+	/*
+	 * Write the primary super to disk immediately so that needsrepair will
+	 * be set if repair doesn't complete.
+	 */
+	error = -libxfs_bwrite(bp);
+	if (error)
+		do_error(
 	_("filesystem feature upgrade failed, err=%d\n"),
-                                        error);
-        }
-        if (bp)
-                libxfs_buf_relse(bp);
+				error);
+
+	libxfs_buf_relse(bp);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/17] xfs_repair: update secondary superblocks after changing features
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (9 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-02-25 21:57   ` Eric Sandeen
  2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

When we add features to an existing filesystem, make sure we update the
secondary superblocks to reflect the new geometry so that if we lose the
primary super in the future, repair will recover correctly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    2 ++
 repair/globals.c         |    1 +
 repair/globals.h         |    1 +
 repair/phase2.c          |    1 +
 repair/xfs_repair.c      |   15 +++++++++++++++
 5 files changed, 20 insertions(+)


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index a086fca2..0862d4b0 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -196,6 +196,8 @@
 #define xfs_trans_roll			libxfs_trans_roll
 #define xfs_trim_extent			libxfs_trim_extent
 
+#define xfs_update_secondary_sbs	libxfs_update_secondary_sbs
+
 #define xfs_validate_stripe_geometry	libxfs_validate_stripe_geometry
 #define xfs_verify_agbno		libxfs_verify_agbno
 #define xfs_verify_agino		libxfs_verify_agino
diff --git a/repair/globals.c b/repair/globals.c
index 506a4e72..f8d4f1e4 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -48,6 +48,7 @@ char	*rt_name;		/* Name of realtime device */
 int	rt_spec;		/* Realtime dev specified as option */
 int	convert_lazy_count;	/* Convert lazy-count mode on/off */
 int	lazy_count;		/* What to set if to if converting */
+bool	features_changed;	/* did we change superblock feature bits? */
 bool	add_inobtcount;		/* add inode btree counts to AGI */
 bool	add_bigtime;		/* add support for timestamps up to 2486 */
 
diff --git a/repair/globals.h b/repair/globals.h
index 929b82be..0f98bd2b 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -89,6 +89,7 @@ extern char	*rt_name;		/* Name of realtime device */
 extern int	rt_spec;		/* Realtime dev specified as option */
 extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
 extern int	lazy_count;		/* What to set if to if converting */
+extern bool	features_changed;	/* did we change superblock feature bits? */
 extern bool	add_inobtcount;		/* add inode btree counts to AGI */
 extern bool	add_bigtime;		/* add support for timestamps up to 2486 */
 
diff --git a/repair/phase2.c b/repair/phase2.c
index cfba649a..13832701 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -218,6 +218,7 @@ upgrade_filesystem(
 				error);
 
 	libxfs_buf_relse(bp);
+	features_changed = true;
 }
 
 /*
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 4769c130..de8617ba 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -1297,6 +1297,21 @@ _("Note - stripe unit (%d) and width (%d) were copied from a backup superblock.\
 	libxfs_buf_mark_dirty(sbp);
 	libxfs_buf_relse(sbp);
 
+	/*
+	 * If we upgraded V5 filesystem features, we need to update the
+	 * secondary superblocks to include the new feature bits.  Don't set
+	 * NEEDSREPAIR on the secondaries.
+	 */
+	if (features_changed) {
+		mp->m_sb.sb_features_incompat &=
+				~XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+		error = -libxfs_update_secondary_sbs(mp);
+		if (error)
+			do_error(_("upgrading features of secondary supers"));
+		mp->m_sb.sb_features_incompat |=
+				XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+	}
+
 	/*
 	 * Done. Flush all cached buffers and inodes first to ensure all
 	 * verifiers are run (where we discover the max metadata LSN), reformat


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/17] xfs_scrub: report optional features in version string
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (10 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 11/17] xfs_repair: update secondary superblocks after changing features Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-01-20  1:16   ` Theodore Ts'o
                     ` (2 more replies)
  2022-01-20  0:22 ` [PATCH 13/17] mkfs: prevent corruption of passed-in suboption string values Darrick J. Wong
                   ` (6 subsequent siblings)
  18 siblings, 3 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Ted T'so reported brittleness in the fstests logic in generic/45[34] to
detect whether or not xfs_scrub is capable of detecting Unicode mischief
in directory and xattr names.  This is a compile-time feature, since we
do not assume that all distros will want to ship xfsprogs with libicu.

Rather than relying on ldd tests (which don't work at all if xfs_scrub
is compiled statically), let's have -V print whether or not the feature
is built into the tool.  Phase 5 still requires the presence of "UTF-8"
in LC_MESSAGES to enable Unicode confusable detection; this merely makes
the feature easier to discover.

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrub.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)


diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index bc2e84a7..45e58727 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -582,6 +582,13 @@ report_outcome(
 	}
 }
 
+/* Compile-time features discoverable via version strings */
+#ifdef HAVE_LIBICU
+# define XFS_SCRUB_HAVE_UNICODE	"+"
+#else
+# define XFS_SCRUB_HAVE_UNICODE	"-"
+#endif
+
 int
 main(
 	int			argc,
@@ -670,8 +677,9 @@ main(
 			verbose = true;
 			break;
 		case 'V':
-			fprintf(stdout, _("%s version %s\n"), progname,
-					VERSION);
+			fprintf(stdout, _("%s version %s %sUnicode\n"),
+					progname, VERSION,
+					XFS_SCRUB_HAVE_UNICODE);
 			fflush(stdout);
 			return SCRUB_RET_SUCCESS;
 		case 'x':


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 13/17] mkfs: prevent corruption of passed-in suboption string values
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (11 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-01-20  0:22 ` [PATCH 14/17] mkfs: add configuration files for the last few LTS kernels Darrick J. Wong
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: Christoph Hellwig, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Eric and I were trying to play with mkfs.configuration files, when I
spotted this (with the libini package from Ubuntu 20.04):

# cat << EOF > /tmp/r
[data]
su=2097152
sw=1
EOF
# mkfs.xfs -f -c options=/tmp/r /dev/sda
Parameters parsed from config file /tmp/r successfully
-d su option requires a value

It turns out that libini's parser uses stack variables(!) to store the
value of a key=value pair that it parses, and passes this stack array to
the parse_cfgopt function.  If the particular option calls getstr(),
then we save the value of that pointer (not its contents) to the
cli_params.  Being a stack array, the contents will be overwritten by
other function calls, which means that our value of '2097152' has been
destroyed by the time we actually call getnum when we're validating the
new fs config.

We never noticed this until now because the only other caller was
getsubopt on the argv array, which gets chopped up but left intact in
memory.  The solution is to make a private copy of those strings if we
ever save them for later.  For now we'll be lazy and let the memory
leak, since mkfs is not a long-running process.

Fixes: 33c62516 ("mkfs: add initial ini format config file parsing support")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 mkfs/xfs_mkfs.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 3a41e17f..fcad6b55 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1438,12 +1438,21 @@ getstr(
 	struct opt_params	*opts,
 	int			index)
 {
+	char			*ret;
+
 	check_opt(opts, index, true);
 
 	/* empty strings for string options are not valid */
 	if (!str || *str == '\0')
 		reqval(opts->name, opts->subopts, index);
-	return (char *)str;
+
+	ret = strdup(str);
+	if (!ret) {
+		fprintf(stderr, _("Out of memory while saving suboptions.\n"));
+		exit(1);
+	}
+
+	return ret;
 }
 
 static int


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 14/17] mkfs: add configuration files for the last few LTS kernels
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (12 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 13/17] mkfs: prevent corruption of passed-in suboption string values Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-01-20  0:22 ` [PATCH 15/17] mkfs: document sample configuration file location Darrick J. Wong
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: Christoph Hellwig, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Add some sample mkfs configuration files that capture the mkfs feature
defaults at the time of the release of the last four upstream LTS
kernels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/builddefs.in |    2 ++
 mkfs/Makefile        |   10 +++++++++-
 mkfs/lts_4.19.conf   |   13 +++++++++++++
 mkfs/lts_5.10.conf   |   13 +++++++++++++
 mkfs/lts_5.15.conf   |   13 +++++++++++++
 mkfs/lts_5.4.conf    |   13 +++++++++++++
 mkfs/xfs_mkfs.c      |    4 ++++
 7 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 mkfs/lts_4.19.conf
 create mode 100644 mkfs/lts_5.10.conf
 create mode 100644 mkfs/lts_5.15.conf
 create mode 100644 mkfs/lts_5.4.conf


diff --git a/include/builddefs.in b/include/builddefs.in
index 9d0b0800..0bb36431 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -56,6 +56,8 @@ DK_INC_DIR	= @includedir@/disk
 PKG_MAN_DIR	= @mandir@
 PKG_DOC_DIR	= @datadir@/doc/@pkg_name@
 PKG_LOCALE_DIR	= @datadir@/locale
+PKG_DATA_DIR	= @datadir@/@pkg_name@
+MKFS_CFG_DIR	= @datadir@/@pkg_name@/mkfs
 
 CC		= @cc@
 BUILD_CC	= @BUILD_CC@
diff --git a/mkfs/Makefile b/mkfs/Makefile
index 9f6a4fad..0aaf9d06 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -9,19 +9,27 @@ LTCOMMAND = mkfs.xfs
 
 HFILES =
 CFILES = proto.c xfs_mkfs.c
+CFGFILES = \
+	lts_4.19.conf \
+	lts_5.4.conf \
+	lts_5.10.conf \
+	lts_5.15.conf
 
 LLDLIBS += $(LIBXFS) $(LIBXCMD) $(LIBFROG) $(LIBRT) $(LIBBLKID) \
 	$(LIBUUID) $(LIBINIH) $(LIBURCU) $(LIBPTHREAD)
 LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBFROG)
 LLDFLAGS = -static-libtool-libs
 
-default: depend $(LTCOMMAND)
+default: depend $(LTCOMMAND) $(CFGFILES)
 
 include $(BUILDRULES)
 
 install: default
 	$(INSTALL) -m 755 -d $(PKG_ROOT_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_ROOT_SBIN_DIR)
+	$(INSTALL) -m 755 -d $(MKFS_CFG_DIR)
+	$(INSTALL) -m 644 $(CFGFILES) $(MKFS_CFG_DIR)
+
 install-dev:
 
 -include .dep
diff --git a/mkfs/lts_4.19.conf b/mkfs/lts_4.19.conf
new file mode 100644
index 00000000..d21fcb7e
--- /dev/null
+++ b/mkfs/lts_4.19.conf
@@ -0,0 +1,13 @@
+# V5 features that were the mkfs defaults when the upstream Linux 4.19 LTS
+# kernel was released at the end of 2018.
+
+[metadata]
+bigtime=0
+crc=1
+finobt=1
+inobtcount=0
+reflink=0
+rmapbt=0
+
+[inode]
+sparse=1
diff --git a/mkfs/lts_5.10.conf b/mkfs/lts_5.10.conf
new file mode 100644
index 00000000..ac00960e
--- /dev/null
+++ b/mkfs/lts_5.10.conf
@@ -0,0 +1,13 @@
+# V5 features that were the mkfs defaults when the upstream Linux 5.10 LTS
+# kernel was released at the end of 2020.
+
+[metadata]
+bigtime=0
+crc=1
+finobt=1
+inobtcount=0
+reflink=1
+rmapbt=0
+
+[inode]
+sparse=1
diff --git a/mkfs/lts_5.15.conf b/mkfs/lts_5.15.conf
new file mode 100644
index 00000000..32082958
--- /dev/null
+++ b/mkfs/lts_5.15.conf
@@ -0,0 +1,13 @@
+# V5 features that were the mkfs defaults when the upstream Linux 5.15 LTS
+# kernel was released at the end of 2021.
+
+[metadata]
+bigtime=1
+crc=1
+finobt=1
+inobtcount=1
+reflink=1
+rmapbt=0
+
+[inode]
+sparse=1
diff --git a/mkfs/lts_5.4.conf b/mkfs/lts_5.4.conf
new file mode 100644
index 00000000..dd60b9f1
--- /dev/null
+++ b/mkfs/lts_5.4.conf
@@ -0,0 +1,13 @@
+# V5 features that were the mkfs defaults when the upstream Linux 5.4 LTS
+# kernel was released at the end of 2019.
+
+[metadata]
+bigtime=0
+crc=1
+finobt=1
+inobtcount=0
+reflink=1
+rmapbt=0
+
+[inode]
+sparse=1
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index fcad6b55..af536a8a 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3875,6 +3875,10 @@ main(
 			.nodalign = false,
 			.nortalign = false,
 			.bigtime = false,
+			/*
+			 * When we decide to enable a new feature by default,
+			 * please remember to update the mkfs conf files.
+			 */
 		},
 	};
 


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 15/17] mkfs: document sample configuration file location
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (13 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 14/17] mkfs: add configuration files for the last few LTS kernels Darrick J. Wong
@ 2022-01-20  0:22 ` Darrick J. Wong
  2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:22 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: Christoph Hellwig, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Update the documentation to note where one can find sample configuration
files.  While we're at it, add -c to the topmost list of mkfs.xfs
options.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 man/man8/Makefile      |    7 +++++++
 man/man8/mkfs.xfs.8.in |    4 ++++
 2 files changed, 11 insertions(+)
 rename man/man8/{mkfs.xfs.8 => mkfs.xfs.8.in} (99%)


diff --git a/man/man8/Makefile b/man/man8/Makefile
index e6a55729..272e45ae 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -12,8 +12,10 @@ ifneq ("$(ENABLE_SCRUB)","yes")
 else
   MAN_PAGES = $(shell echo *.$(MAN_SECTION))
 endif
+MAN_PAGES	+= mkfs.xfs.8
 MAN_DEST	= $(PKG_MAN_DIR)/man$(MAN_SECTION)
 LSRCFILES	= $(MAN_PAGES)
+DIRT		= mkfs.xfs.8
 
 default : $(MAN_PAGES)
 
@@ -22,4 +24,9 @@ include $(BUILDRULES)
 install : default
 	$(INSTALL) -m 755 -d $(MAN_DEST)
 	$(INSTALL_MAN)
+
+mkfs.xfs.8: mkfs.xfs.8.in
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e 's|@mkfs_cfg_dir@|$(MKFS_CFG_DIR)|g' < $^ > $@
+
 install-dev :
diff --git a/man/man8/mkfs.xfs.8 b/man/man8/mkfs.xfs.8.in
similarity index 99%
rename from man/man8/mkfs.xfs.8
rename to man/man8/mkfs.xfs.8.in
index 880e949b..a3526753 100644
--- a/man/man8/mkfs.xfs.8
+++ b/man/man8/mkfs.xfs.8.in
@@ -7,6 +7,9 @@ mkfs.xfs \- construct an XFS filesystem
 .B \-b
 .I block_size_options
 ] [
+.B \-c
+.I config_file_options
+] [
 .B \-m
 .I global_metadata_options
 ] [
@@ -159,6 +162,7 @@ The configuration options will be sourced from the file specified by the
 option string.
 This option can be use either an absolute or relative path to the configuration
 file to be read.
+Sample configuration files can be found in @mkfs_cfg_dir@.
 .RE
 .PP
 .PD 0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (14 preceding siblings ...)
  2022-01-20  0:22 ` [PATCH 15/17] mkfs: document sample configuration file location Darrick J. Wong
@ 2022-01-20  0:23 ` Darrick J. Wong
  2022-02-25 22:21   ` Eric Sandeen
  2022-02-26  2:52   ` [PATCH v2 " Darrick J. Wong
  2022-01-20  0:23 ` [PATCH 17/17] mkfs: enable inobtcount and bigtime by default Darrick J. Wong
                   ` (2 subsequent siblings)
  18 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:23 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

We have a handful of users who continually ping the maintainer with
questions about why pmem and dax don't work quite the way they want
(which is to say 2MB extents and PMD mappings) because they copy-pasted
some garbage from Google that's wrong.  Encode the correct defaults into
a mkfs config file and ship that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 mkfs/Makefile        |    1 +
 mkfs/dax_x86_64.conf |   19 +++++++++++++++++++
 2 files changed, 20 insertions(+)
 create mode 100644 mkfs/dax_x86_64.conf


diff --git a/mkfs/Makefile b/mkfs/Makefile
index 0aaf9d06..55d9362f 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -10,6 +10,7 @@ LTCOMMAND = mkfs.xfs
 HFILES =
 CFILES = proto.c xfs_mkfs.c
 CFGFILES = \
+	dax_x86_64.conf \
 	lts_4.19.conf \
 	lts_5.4.conf \
 	lts_5.10.conf \
diff --git a/mkfs/dax_x86_64.conf b/mkfs/dax_x86_64.conf
new file mode 100644
index 00000000..bc3f3c9a
--- /dev/null
+++ b/mkfs/dax_x86_64.conf
@@ -0,0 +1,19 @@
+# mkfs.xfs configuration file for persistent memory on x86_64.
+# Block size must match page size (4K) and we require V5 for the DAX inode
+# flag.  Set extent size hints and stripe units to encourage the filesystem to
+# allocate PMD sized (2MB) blocks.
+
+[block]
+size=4096
+
+[metadata]
+crc=1
+
+[data]
+sunit=4096
+swidth=4096
+extszinherit=512
+daxinherit=1
+
+[realtime]
+extsize=2097152


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 17/17] mkfs: enable inobtcount and bigtime by default
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (15 preceding siblings ...)
  2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
@ 2022-01-20  0:23 ` Darrick J. Wong
  2022-02-25 22:22   ` Eric Sandeen
  2022-01-28 22:44 ` [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices Darrick J. Wong
  2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  0:23 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Enable the inode btree counters and large timestamp features by default.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index af536a8a..96682f9a 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3870,11 +3870,11 @@ main(
 			.spinodes = true,
 			.rmapbt = false,
 			.reflink = true,
-			.inobtcnt = false,
+			.inobtcnt = true,
 			.parent_pointers = false,
 			.nodalign = false,
 			.nortalign = false,
-			.bigtime = false,
+			.bigtime = true,
 			/*
 			 * When we decide to enable a new feature by default,
 			 * please remember to update the mkfs conf files.


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/17] xfs_scrub: report optional features in version string
  2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
@ 2022-01-20  1:16   ` Theodore Ts'o
  2022-01-20  1:28     ` Darrick J. Wong
  2022-01-20  1:32   ` [PATCH v2 " Darrick J. Wong
  2022-02-26  2:53   ` [PATCH v3 " Darrick J. Wong
  2 siblings, 1 reply; 60+ messages in thread
From: Theodore Ts'o @ 2022-01-20  1:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: sandeen, linux-xfs, allison.henderson

On Wed, Jan 19, 2022 at 04:22:40PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Ted T'so reported brittleness in the fstests logic in generic/45[34] to

Not a super big deal, but my last name is "Ts'o".

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/17] xfs_scrub: report optional features in version string
  2022-01-20  1:16   ` Theodore Ts'o
@ 2022-01-20  1:28     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  1:28 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: sandeen, linux-xfs, allison.henderson

On Wed, Jan 19, 2022 at 08:16:56PM -0500, Theodore Ts'o wrote:
> On Wed, Jan 19, 2022 at 04:22:40PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Ted T'so reported brittleness in the fstests logic in generic/45[34] to
> 
> Not a super big deal, but my last name is "Ts'o".

D'oh.  Sorry about that, will fix.

--D

> Thanks,
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 12/17] xfs_scrub: report optional features in version string
  2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
  2022-01-20  1:16   ` Theodore Ts'o
@ 2022-01-20  1:32   ` Darrick J. Wong
  2022-02-25 22:14     ` Eric Sandeen
  2022-02-26  2:53   ` [PATCH v3 " Darrick J. Wong
  2 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-20  1:32 UTC (permalink / raw)
  To: sandeen; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Ted Ts'o reported brittleness in the fstests logic in generic/45[34] to
detect whether or not xfs_scrub is capable of detecting Unicode mischief
in directory and xattr names.  This is a compile-time feature, since we
do not assume that all distros will want to ship xfsprogs with libicu.

Rather than relying on ldd tests (which don't work at all if xfs_scrub
is compiled statically), let's have -V print whether or not the feature
is built into the tool.  Phase 5 still requires the presence of "UTF-8"
in LC_MESSAGES to enable Unicode confusable detection; this merely makes
the feature easier to discover.

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v2: correct the name of the reporter
---
 scrub/xfs_scrub.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index bc2e84a7..45e58727 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -582,6 +582,13 @@ report_outcome(
 	}
 }
 
+/* Compile-time features discoverable via version strings */
+#ifdef HAVE_LIBICU
+# define XFS_SCRUB_HAVE_UNICODE	"+"
+#else
+# define XFS_SCRUB_HAVE_UNICODE	"-"
+#endif
+
 int
 main(
 	int			argc,
@@ -670,8 +677,9 @@ main(
 			verbose = true;
 			break;
 		case 'V':
-			fprintf(stdout, _("%s version %s\n"), progname,
-					VERSION);
+			fprintf(stdout, _("%s version %s %sUnicode\n"),
+					progname, VERSION,
+					XFS_SCRUB_HAVE_UNICODE);
 			fflush(stdout);
 			return SCRUB_RET_SUCCESS;
 		case 'x':

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (16 preceding siblings ...)
  2022-01-20  0:23 ` [PATCH 17/17] mkfs: enable inobtcount and bigtime by default Darrick J. Wong
@ 2022-01-28 22:44 ` Darrick J. Wong
  2022-01-31 12:28   ` Christoph Hellwig
  2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
  18 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-01-28 22:44 UTC (permalink / raw)
  To: sandeen
  Cc: Christoph Hellwig, Dave Chinner, Theodore Ts'o, linux-xfs,
	allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

The error checking logic for opening the data, log, and rt device is
totally broken.  Fix this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase1.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/scrub/phase1.c b/scrub/phase1.c
index 4f028249..fd1050c9 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -170,9 +170,9 @@ _("Unable to find realtime device path."));
 
 	/* Open the raw devices. */
 	ctx->datadev = disk_open(ctx->fsinfo.fs_name);
-	if (error) {
-		str_errno(ctx, ctx->fsinfo.fs_name);
-		return error;
+	if (!ctx->datadev) {
+		str_error(ctx, ctx->mntpoint, _("Unable to open data device."));
+		return ECANCELED;
 	}
 
 	ctx->nr_io_threads = disk_heads(ctx->datadev);
@@ -184,16 +184,18 @@ _("Unable to find realtime device path."));
 
 	if (ctx->fsinfo.fs_log) {
 		ctx->logdev = disk_open(ctx->fsinfo.fs_log);
-		if (error) {
-			str_errno(ctx, ctx->fsinfo.fs_name);
-			return error;
+		if (!ctx->logdev) {
+			str_error(ctx, ctx->mntpoint,
+				_("Unable to open external log device."));
+			return ECANCELED;
 		}
 	}
 	if (ctx->fsinfo.fs_rt) {
 		ctx->rtdev = disk_open(ctx->fsinfo.fs_rt);
-		if (error) {
-			str_errno(ctx, ctx->fsinfo.fs_name);
-			return error;
+		if (!ctx->rtdev) {
+			str_error(ctx, ctx->mntpoint,
+				_("Unable to open realtime device."));
+			return ECANCELED;
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices
  2022-01-28 22:44 ` [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices Darrick J. Wong
@ 2022-01-31 12:28   ` Christoph Hellwig
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Hellwig @ 2022-01-31 12:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sandeen, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	linux-xfs, allison.henderson

On Fri, Jan 28, 2022 at 02:44:10PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The error checking logic for opening the data, log, and rt device is
> totally broken.  Fix this.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items
  2022-01-20  0:21 ` [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
@ 2022-02-04 21:36   ` Eric Sandeen
  2022-02-04 21:47     ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Sandeen @ 2022-02-04 21:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While debugging some very strange rmap corruption reports in connection
> with the online directory repair code.  I root-caused the error to the
> following incorrect sequence:
> 
> <start repair transaction>
> <expand directory, causing a deferred rmap to be queued>
> <roll transaction>
> <cancel transaction>
> 
> Obviously, we should have committed the transaction instead of
> cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
> have warned us that we were throwing away work item that we already
> committed to performing.  This is not correct, and we need to shut down
> the filesystem.
> 
> Change xfs_trans_cancel to complain in the loudest manner if we're
> cancelling any transaction with deferred work items attached.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

So this is basically:

Source kernel commit: 47a6df7cd3174b91c6c862eae0b8d4e13591df52

plus the actual shutting down / aborting part 

Seems ok; did you run into this in practice, in userspace?

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> ---
>  libxfs/trans.c |   19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/libxfs/trans.c b/libxfs/trans.c
> index fd2e6f9d..8c16cb8d 100644
> --- a/libxfs/trans.c
> +++ b/libxfs/trans.c
> @@ -318,13 +318,30 @@ void
>  libxfs_trans_cancel(
>  	struct xfs_trans	*tp)
>  {
> +	bool			dirty;
> +
>  	trace_xfs_trans_cancel(tp, _RET_IP_);
>  
>  	if (tp == NULL)
>  		return;
> +	dirty = (tp->t_flags & XFS_TRANS_DIRTY);
>  
> -	if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> +	/*
> +	 * It's never valid to cancel a transaction with deferred ops attached,
> +	 * because the transaction is effectively dirty.  Complain about this
> +	 * loudly before freeing the in-memory defer items.
> +	 */
> +	if (!list_empty(&tp->t_dfops)) {
> +		ASSERT(list_empty(&tp->t_dfops));
> +		ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> +		dirty = true;
>  		xfs_defer_cancel(tp);
> +	}
> +
> +	if (dirty) {
> +		fprintf(stderr, _("Cancelling dirty transaction!\n"));
> +		abort();
> +	}
>  
>  	xfs_trans_free_items(tp);
>  	xfs_trans_free(tp);
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items
  2022-02-04 21:36   ` Eric Sandeen
@ 2022-02-04 21:47     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-04 21:47 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs, allison.henderson

On Fri, Feb 04, 2022 at 03:36:18PM -0600, Eric Sandeen wrote:
> On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > While debugging some very strange rmap corruption reports in connection
> > with the online directory repair code.  I root-caused the error to the
> > following incorrect sequence:
> > 
> > <start repair transaction>
> > <expand directory, causing a deferred rmap to be queued>
> > <roll transaction>
> > <cancel transaction>
> > 
> > Obviously, we should have committed the transaction instead of
> > cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
> > have warned us that we were throwing away work item that we already
> > committed to performing.  This is not correct, and we need to shut down
> > the filesystem.
> > 
> > Change xfs_trans_cancel to complain in the loudest manner if we're
> > cancelling any transaction with deferred work items attached.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> So this is basically:
> 
> Source kernel commit: 47a6df7cd3174b91c6c862eae0b8d4e13591df52
> 
> plus the actual shutting down / aborting part 
> 
> Seems ok; did you run into this in practice, in userspace?

Nope, I was merely porting it to userspace and decided the source was
different enough not to bother trying to libxfs-apply it.

--D

> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> 
> > ---
> >  libxfs/trans.c |   19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/libxfs/trans.c b/libxfs/trans.c
> > index fd2e6f9d..8c16cb8d 100644
> > --- a/libxfs/trans.c
> > +++ b/libxfs/trans.c
> > @@ -318,13 +318,30 @@ void
> >  libxfs_trans_cancel(
> >  	struct xfs_trans	*tp)
> >  {
> > +	bool			dirty;
> > +
> >  	trace_xfs_trans_cancel(tp, _RET_IP_);
> >  
> >  	if (tp == NULL)
> >  		return;
> > +	dirty = (tp->t_flags & XFS_TRANS_DIRTY);
> >  
> > -	if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
> > +	/*
> > +	 * It's never valid to cancel a transaction with deferred ops attached,
> > +	 * because the transaction is effectively dirty.  Complain about this
> > +	 * loudly before freeing the in-memory defer items.
> > +	 */
> > +	if (!list_empty(&tp->t_dfops)) {
> > +		ASSERT(list_empty(&tp->t_dfops));
> > +		ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > +		dirty = true;
> >  		xfs_defer_cancel(tp);
> > +	}
> > +
> > +	if (dirty) {
> > +		fprintf(stderr, _("Cancelling dirty transaction!\n"));
> > +		abort();
> > +	}
> >  
> >  	xfs_trans_free_items(tp);
> >  	xfs_trans_free(tp);
> > 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf
  2022-01-20  0:21 ` [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf Darrick J. Wong
@ 2022-02-04 22:05   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-04 22:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When we're preparing to move a list of xfs_buf(fers) to the freelist, be
> sure to detach the perag reference so that we don't leak the reference
> or leave dangling pointers.  Currently this has no negative effects
> since we only call libxfs_bulkrelse while exiting programs, but let's
> not be sloppy.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Seems fine

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> ---
>  libxfs/rdwr.c |   23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> 
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 2a9e8c98..b43527e4 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -887,11 +887,19 @@ libxfs_buf_mark_dirty(
>  	bp->b_flags |= LIBXFS_B_DIRTY;
>  }
>  
> -/* Complain about (and remember) dropping dirty buffers. */
> -static void
> -libxfs_whine_dirty_buf(
> +/* Prepare a buffer to be sent to the MRU list. */
> +static inline void
> +libxfs_buf_prepare_mru(
>  	struct xfs_buf		*bp)
>  {
> +	if (bp->b_pag)
> +		xfs_perag_put(bp->b_pag);
> +	bp->b_pag = NULL;
> +
> +	if (!(bp->b_flags & LIBXFS_B_DIRTY))
> +		return;
> +
> +	/* Complain about (and remember) dropping dirty buffers. */
>  	fprintf(stderr, _("%s: Releasing dirty buffer to free list!\n"),
>  			progname);
>  
> @@ -909,11 +917,7 @@ libxfs_brelse(
>  
>  	if (!bp)
>  		return;
> -	if (bp->b_flags & LIBXFS_B_DIRTY)
> -		libxfs_whine_dirty_buf(bp);
> -	if (bp->b_pag)
> -		xfs_perag_put(bp->b_pag);
> -	bp->b_pag = NULL;
> +	libxfs_buf_prepare_mru(bp);
>  
>  	pthread_mutex_lock(&xfs_buf_freelist.cm_mutex);
>  	list_add(&bp->b_node.cn_mru, &xfs_buf_freelist.cm_list);
> @@ -932,8 +936,7 @@ libxfs_bulkrelse(
>  		return 0 ;
>  
>  	list_for_each_entry(bp, list, b_node.cn_mru) {
> -		if (bp->b_flags & LIBXFS_B_DIRTY)
> -			libxfs_whine_dirty_buf(bp);
> +		libxfs_buf_prepare_mru(bp);
>  		count++;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
@ 2022-02-04 23:18   ` Eric Sandeen
  2022-02-05  0:36     ` Darrick J. Wong
  2022-02-08 16:46   ` [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions Darrick J. Wong
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Sandeen @ 2022-02-04 23:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Move our private copy of the GETFSMAP definition into a libfrog header
> file that (preferentially) uses the system header files.  We have no
> business shipping kernel headers in the xfslibs package, but this shim
> is still needed to build fully functional xfsprogs on old userspace.

Hm. Fine, but I wonder if we can get a bit more intentional about how
we handle this kind of thing, I understand why we copy this stuff into
xfsprogs early, but then we never know how to get rid of it.

Do we /need/ to build fully functional xfsprogs on old userspace?
(really: systems with old kernel headers?)  How far back do we go,
I wonder?  Anyway...

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> ---
>  include/linux.h   |  105 ------------------------------------------------
>  io/fsmap.c        |    1 
>  io/io.h           |    5 --
>  libfrog/Makefile  |    1 
>  libfrog/fsmap.h   |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  scrub/phase6.c    |    1 
>  scrub/phase7.c    |    1 
>  scrub/spacemap.c  |    1 
>  spaceman/freesp.c |    1 
>  spaceman/space.h  |    4 --
>  10 files changed, 123 insertions(+), 114 deletions(-)
>  create mode 100644 libfrog/fsmap.h
> 
> 
> diff --git a/include/linux.h b/include/linux.h
> index de8a7122..3d9f4e3d 100644
> --- a/include/linux.h
> +++ b/include/linux.h
> @@ -251,111 +251,6 @@ struct fsxattr {
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
>  #endif
>  
> -#ifdef HAVE_GETFSMAP
> -# include <linux/fsmap.h>
> -#else
> -/*
> - *	Structure for FS_IOC_GETFSMAP.
> - *
> - *	The memory layout for this call are the scalar values defined in
> - *	struct fsmap_head, followed by two struct fsmap that describe
> - *	the lower and upper bound of mappings to return, followed by an
> - *	array of struct fsmap mappings.
> - *
> - *	fmh_iflags control the output of the call, whereas fmh_oflags report
> - *	on the overall record output.  fmh_count should be set to the
> - *	length of the fmh_recs array, and fmh_entries will be set to the
> - *	number of entries filled out during each call.  If fmh_count is
> - *	zero, the number of reverse mappings will be returned in
> - *	fmh_entries, though no mappings will be returned.  fmh_reserved
> - *	must be set to zero.
> - *
> - *	The two elements in the fmh_keys array are used to constrain the
> - *	output.  The first element in the array should represent the
> - *	lowest disk mapping ("low key") that the user wants to learn
> - *	about.  If this value is all zeroes, the filesystem will return
> - *	the first entry it knows about.  For a subsequent call, the
> - *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
> - *	copied into fmh_keys[0] to have the kernel start where it left off.
> - *
> - *	The second element in the fmh_keys array should represent the
> - *	highest disk mapping ("high key") that the user wants to learn
> - *	about.  If this value is all ones, the filesystem will not stop
> - *	until it runs out of mapping to return or runs out of space in
> - *	fmh_recs.
> - *
> - *	fmr_device can be either a 32-bit cookie representing a device, or
> - *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
> - *	fmr_offset, and fmr_length are expressed in units of bytes.
> - *	fmr_owner is either an inode number, or a special value if
> - *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
> - */
> -struct fsmap {
> -	__u32		fmr_device;	/* device id */
> -	__u32		fmr_flags;	/* mapping flags */
> -	__u64		fmr_physical;	/* device offset of segment */
> -	__u64		fmr_owner;	/* owner id */
> -	__u64		fmr_offset;	/* file offset of segment */
> -	__u64		fmr_length;	/* length of segment */
> -	__u64		fmr_reserved[3];	/* must be zero */
> -};
> -
> -struct fsmap_head {
> -	__u32		fmh_iflags;	/* control flags */
> -	__u32		fmh_oflags;	/* output flags */
> -	__u32		fmh_count;	/* # of entries in array incl. input */
> -	__u32		fmh_entries;	/* # of entries filled in (output). */
> -	__u64		fmh_reserved[6];	/* must be zero */
> -
> -	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
> -	struct fsmap	fmh_recs[];	/* returned records */
> -};
> -
> -/* Size of an fsmap_head with room for nr records. */
> -static inline size_t
> -fsmap_sizeof(
> -	unsigned int	nr)
> -{
> -	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
> -}
> -
> -/* Start the next fsmap query at the end of the current query results. */
> -static inline void
> -fsmap_advance(
> -	struct fsmap_head	*head)
> -{
> -	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
> -}
> -
> -/*	fmh_iflags values - set by XFS_IOC_GETFSMAP caller in the header. */
> -/* no flags defined yet */
> -#define FMH_IF_VALID		0
> -
> -/*	fmh_oflags values - returned in the header segment only. */
> -#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
> -
> -/*	fmr_flags values - returned for each non-header segment */
> -#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
> -#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
> -#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
> -#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
> -#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
> -#define FMR_OF_LAST		0x20	/* segment is the last in the FS */
> -
> -/* Each FS gets to define its own special owner codes. */
> -#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
> -				 ((__u64)code & 0xFFFFFFFFULL))
> -#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
> -#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
> -#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
> -#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
> -#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
> -
> -#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
> -
> -#define HAVE_GETFSMAP
> -#endif /* HAVE_GETFSMAP */
> -
>  #ifndef HAVE_MAP_SYNC
>  #define MAP_SYNC 0
>  #define MAP_SHARED_VALIDATE 0
> diff --git a/io/fsmap.c b/io/fsmap.c
> index f540a7c0..9ff36bf4 100644
> --- a/io/fsmap.c
> +++ b/io/fsmap.c
> @@ -10,6 +10,7 @@
>  #include "io.h"
>  #include "input.h"
>  #include "libfrog/fsgeom.h"
> +#include "libfrog/fsmap.h"
>  
>  static cmdinfo_t	fsmap_cmd;
>  static dev_t		xfs_data_dev;
> diff --git a/io/io.h b/io/io.h
> index 49db902f..39fb5878 100644
> --- a/io/io.h
> +++ b/io/io.h
> @@ -167,12 +167,7 @@ extern void		readdir_init(void);
>  extern void		reflink_init(void);
>  
>  extern void		cowextsize_init(void);
> -
> -#ifdef HAVE_GETFSMAP
>  extern void		fsmap_init(void);
> -#else
> -# define fsmap_init()	do { } while (0)
> -#endif
>  
>  #ifdef HAVE_DEVMAPPER
>  extern void		log_writes_init(void);
> diff --git a/libfrog/Makefile b/libfrog/Makefile
> index 01107082..d6044455 100644
> --- a/libfrog/Makefile
> +++ b/libfrog/Makefile
> @@ -40,6 +40,7 @@ crc32cselftest.h \
>  crc32defs.h \
>  crc32table.h \
>  fsgeom.h \
> +fsmap.h \
>  logging.h \
>  paths.h \
>  projects.h \
> diff --git a/libfrog/fsmap.h b/libfrog/fsmap.h
> new file mode 100644
> index 00000000..dc290962
> --- /dev/null
> +++ b/libfrog/fsmap.h
> @@ -0,0 +1,117 @@
> +#ifdef HAVE_GETFSMAP
> +# include <linux/fsmap.h>
> +#endif
> +
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * FS_IOC_GETFSMAP ioctl infrastructure.
> + *
> + * Copyright (C) 2017 Oracle.  All Rights Reserved.
> + *
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#ifndef _LINUX_FSMAP_H
> +#define _LINUX_FSMAP_H
> +
> +#include <linux/types.h>
> +
> +/*
> + *	Structure for FS_IOC_GETFSMAP.
> + *
> + *	The memory layout for this call are the scalar values defined in
> + *	struct fsmap_head, followed by two struct fsmap that describe
> + *	the lower and upper bound of mappings to return, followed by an
> + *	array of struct fsmap mappings.
> + *
> + *	fmh_iflags control the output of the call, whereas fmh_oflags report
> + *	on the overall record output.  fmh_count should be set to the
> + *	length of the fmh_recs array, and fmh_entries will be set to the
> + *	number of entries filled out during each call.  If fmh_count is
> + *	zero, the number of reverse mappings will be returned in
> + *	fmh_entries, though no mappings will be returned.  fmh_reserved
> + *	must be set to zero.
> + *
> + *	The two elements in the fmh_keys array are used to constrain the
> + *	output.  The first element in the array should represent the
> + *	lowest disk mapping ("low key") that the user wants to learn
> + *	about.  If this value is all zeroes, the filesystem will return
> + *	the first entry it knows about.  For a subsequent call, the
> + *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
> + *	copied into fmh_keys[0] to have the kernel start where it left off.
> + *
> + *	The second element in the fmh_keys array should represent the
> + *	highest disk mapping ("high key") that the user wants to learn
> + *	about.  If this value is all ones, the filesystem will not stop
> + *	until it runs out of mapping to return or runs out of space in
> + *	fmh_recs.
> + *
> + *	fmr_device can be either a 32-bit cookie representing a device, or
> + *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
> + *	fmr_offset, and fmr_length are expressed in units of bytes.
> + *	fmr_owner is either an inode number, or a special value if
> + *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
> + */
> +struct fsmap {
> +	__u32		fmr_device;	/* device id */
> +	__u32		fmr_flags;	/* mapping flags */
> +	__u64		fmr_physical;	/* device offset of segment */
> +	__u64		fmr_owner;	/* owner id */
> +	__u64		fmr_offset;	/* file offset of segment */
> +	__u64		fmr_length;	/* length of segment */
> +	__u64		fmr_reserved[3];	/* must be zero */
> +};
> +
> +struct fsmap_head {
> +	__u32		fmh_iflags;	/* control flags */
> +	__u32		fmh_oflags;	/* output flags */
> +	__u32		fmh_count;	/* # of entries in array incl. input */
> +	__u32		fmh_entries;	/* # of entries filled in (output). */
> +	__u64		fmh_reserved[6];	/* must be zero */
> +
> +	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
> +	struct fsmap	fmh_recs[];	/* returned records */
> +};
> +
> +/* Size of an fsmap_head with room for nr records. */
> +static __inline__ size_t
> +fsmap_sizeof(
> +	unsigned int	nr)
> +{
> +	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
> +}
> +
> +/* Start the next fsmap query at the end of the current query results. */
> +static __inline__ void
> +fsmap_advance(
> +	struct fsmap_head	*head)
> +{
> +	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
> +}
> +
> +/*	fmh_iflags values - set by FS_IOC_GETFSMAP caller in the header. */
> +/* no flags defined yet */
> +#define FMH_IF_VALID		0
> +
> +/*	fmh_oflags values - returned in the header segment only. */
> +#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
> +
> +/*	fmr_flags values - returned for each non-header segment */
> +#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
> +#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
> +#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
> +#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
> +#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
> +#define FMR_OF_LAST		0x20	/* segment is the last in the dataset */
> +
> +/* Each FS gets to define its own special owner codes. */
> +#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
> +				 ((__u64)code & 0xFFFFFFFFULL))
> +#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
> +#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
> +#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
> +#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
> +#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
> +
> +#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
> +
> +#endif /* _LINUX_FSMAP_H */
> diff --git a/scrub/phase6.c b/scrub/phase6.c
> index 87828b60..dd42b66c 100644
> --- a/scrub/phase6.c
> +++ b/scrub/phase6.c
> @@ -10,6 +10,7 @@
>  #include "handle.h"
>  #include "libfrog/paths.h"
>  #include "libfrog/workqueue.h"
> +#include "libfrog/fsmap.h"
>  #include "xfs_scrub.h"
>  #include "common.h"
>  #include "libfrog/bitmap.h"
> diff --git a/scrub/phase7.c b/scrub/phase7.c
> index bc652ab6..e24906d1 100644
> --- a/scrub/phase7.c
> +++ b/scrub/phase7.c
> @@ -9,6 +9,7 @@
>  #include <sys/statvfs.h>
>  #include "libfrog/paths.h"
>  #include "libfrog/ptvar.h"
> +#include "libfrog/fsmap.h"
>  #include "list.h"
>  #include "xfs_scrub.h"
>  #include "common.h"
> diff --git a/scrub/spacemap.c b/scrub/spacemap.c
> index a5508d56..b7f17e57 100644
> --- a/scrub/spacemap.c
> +++ b/scrub/spacemap.c
> @@ -10,6 +10,7 @@
>  #include <sys/statvfs.h>
>  #include "libfrog/workqueue.h"
>  #include "libfrog/paths.h"
> +#include "libfrog/fsmap.h"
>  #include "xfs_scrub.h"
>  #include "common.h"
>  #include "spacemap.h"
> diff --git a/spaceman/freesp.c b/spaceman/freesp.c
> index de301c19..4e46ab26 100644
> --- a/spaceman/freesp.c
> +++ b/spaceman/freesp.c
> @@ -9,6 +9,7 @@
>  #include "libxfs.h"
>  #include <linux/fiemap.h>
>  #include "libfrog/fsgeom.h"
> +#include "libfrog/fsmap.h"
>  #include "command.h"
>  #include "init.h"
>  #include "libfrog/paths.h"
> diff --git a/spaceman/space.h b/spaceman/space.h
> index 723209ed..a8055535 100644
> --- a/spaceman/space.h
> +++ b/spaceman/space.h
> @@ -26,11 +26,7 @@ extern void	help_init(void);
>  extern void	prealloc_init(void);
>  extern void	quit_init(void);
>  extern void	trim_init(void);
> -#ifdef HAVE_GETFSMAP
>  extern void	freesp_init(void);
> -#else
> -# define freesp_init()	do { } while (0)
> -#endif
>  extern void	info_init(void);
>  extern void	health_init(void);
>  
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/17] misc: add a crc32c self test to mkfs and repair
  2022-01-20  0:22 ` [PATCH 05/17] misc: add a crc32c self test to mkfs and repair Darrick J. Wong
@ 2022-02-04 23:23   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-04 23:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:22 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Enhance mkfs and xfs_repair to run the crc32c self test when they start
> up, and refuse to continue if the self test fails.   We don't want to
> format a filesystem if the checksum algorithm produces incorrect
> results, and we especially don't want repair to tear a filesystem apart
> because it thinks the checksum is wrong.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>


Good idea.

Reviewed-by: Eric Sandeen <sandeen@redhat.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-02-04 23:18   ` Eric Sandeen
@ 2022-02-05  0:36     ` Darrick J. Wong
  2022-02-07  1:05       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-05  0:36 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs, allison.henderson

On Fri, Feb 04, 2022 at 05:18:12PM -0600, Eric Sandeen wrote:
> On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Move our private copy of the GETFSMAP definition into a libfrog header
> > file that (preferentially) uses the system header files.  We have no
> > business shipping kernel headers in the xfslibs package, but this shim
> > is still needed to build fully functional xfsprogs on old userspace.
> 
> Hm. Fine, but I wonder if we can get a bit more intentional about how
> we handle this kind of thing, I understand why we copy this stuff into
> xfsprogs early, but then we never know how to get rid of it.
> 
> Do we /need/ to build fully functional xfsprogs on old userspace?
> (really: systems with old kernel headers?)  How far back do we go,
> I wonder?  Anyway...

TBH we could probably get rid of these entirely, assuming nobody is
building xfsprogs with old kernel headers for a system with a newer
kernel?

--D

> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> 
> > ---
> >  include/linux.h   |  105 ------------------------------------------------
> >  io/fsmap.c        |    1 
> >  io/io.h           |    5 --
> >  libfrog/Makefile  |    1 
> >  libfrog/fsmap.h   |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  scrub/phase6.c    |    1 
> >  scrub/phase7.c    |    1 
> >  scrub/spacemap.c  |    1 
> >  spaceman/freesp.c |    1 
> >  spaceman/space.h  |    4 --
> >  10 files changed, 123 insertions(+), 114 deletions(-)
> >  create mode 100644 libfrog/fsmap.h
> > 
> > 
> > diff --git a/include/linux.h b/include/linux.h
> > index de8a7122..3d9f4e3d 100644
> > --- a/include/linux.h
> > +++ b/include/linux.h
> > @@ -251,111 +251,6 @@ struct fsxattr {
> >  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> >  #endif
> >  
> > -#ifdef HAVE_GETFSMAP
> > -# include <linux/fsmap.h>
> > -#else
> > -/*
> > - *	Structure for FS_IOC_GETFSMAP.
> > - *
> > - *	The memory layout for this call are the scalar values defined in
> > - *	struct fsmap_head, followed by two struct fsmap that describe
> > - *	the lower and upper bound of mappings to return, followed by an
> > - *	array of struct fsmap mappings.
> > - *
> > - *	fmh_iflags control the output of the call, whereas fmh_oflags report
> > - *	on the overall record output.  fmh_count should be set to the
> > - *	length of the fmh_recs array, and fmh_entries will be set to the
> > - *	number of entries filled out during each call.  If fmh_count is
> > - *	zero, the number of reverse mappings will be returned in
> > - *	fmh_entries, though no mappings will be returned.  fmh_reserved
> > - *	must be set to zero.
> > - *
> > - *	The two elements in the fmh_keys array are used to constrain the
> > - *	output.  The first element in the array should represent the
> > - *	lowest disk mapping ("low key") that the user wants to learn
> > - *	about.  If this value is all zeroes, the filesystem will return
> > - *	the first entry it knows about.  For a subsequent call, the
> > - *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
> > - *	copied into fmh_keys[0] to have the kernel start where it left off.
> > - *
> > - *	The second element in the fmh_keys array should represent the
> > - *	highest disk mapping ("high key") that the user wants to learn
> > - *	about.  If this value is all ones, the filesystem will not stop
> > - *	until it runs out of mapping to return or runs out of space in
> > - *	fmh_recs.
> > - *
> > - *	fmr_device can be either a 32-bit cookie representing a device, or
> > - *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
> > - *	fmr_offset, and fmr_length are expressed in units of bytes.
> > - *	fmr_owner is either an inode number, or a special value if
> > - *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
> > - */
> > -struct fsmap {
> > -	__u32		fmr_device;	/* device id */
> > -	__u32		fmr_flags;	/* mapping flags */
> > -	__u64		fmr_physical;	/* device offset of segment */
> > -	__u64		fmr_owner;	/* owner id */
> > -	__u64		fmr_offset;	/* file offset of segment */
> > -	__u64		fmr_length;	/* length of segment */
> > -	__u64		fmr_reserved[3];	/* must be zero */
> > -};
> > -
> > -struct fsmap_head {
> > -	__u32		fmh_iflags;	/* control flags */
> > -	__u32		fmh_oflags;	/* output flags */
> > -	__u32		fmh_count;	/* # of entries in array incl. input */
> > -	__u32		fmh_entries;	/* # of entries filled in (output). */
> > -	__u64		fmh_reserved[6];	/* must be zero */
> > -
> > -	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
> > -	struct fsmap	fmh_recs[];	/* returned records */
> > -};
> > -
> > -/* Size of an fsmap_head with room for nr records. */
> > -static inline size_t
> > -fsmap_sizeof(
> > -	unsigned int	nr)
> > -{
> > -	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
> > -}
> > -
> > -/* Start the next fsmap query at the end of the current query results. */
> > -static inline void
> > -fsmap_advance(
> > -	struct fsmap_head	*head)
> > -{
> > -	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
> > -}
> > -
> > -/*	fmh_iflags values - set by XFS_IOC_GETFSMAP caller in the header. */
> > -/* no flags defined yet */
> > -#define FMH_IF_VALID		0
> > -
> > -/*	fmh_oflags values - returned in the header segment only. */
> > -#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
> > -
> > -/*	fmr_flags values - returned for each non-header segment */
> > -#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
> > -#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
> > -#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
> > -#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
> > -#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
> > -#define FMR_OF_LAST		0x20	/* segment is the last in the FS */
> > -
> > -/* Each FS gets to define its own special owner codes. */
> > -#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
> > -				 ((__u64)code & 0xFFFFFFFFULL))
> > -#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
> > -#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
> > -#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
> > -#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
> > -#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
> > -
> > -#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
> > -
> > -#define HAVE_GETFSMAP
> > -#endif /* HAVE_GETFSMAP */
> > -
> >  #ifndef HAVE_MAP_SYNC
> >  #define MAP_SYNC 0
> >  #define MAP_SHARED_VALIDATE 0
> > diff --git a/io/fsmap.c b/io/fsmap.c
> > index f540a7c0..9ff36bf4 100644
> > --- a/io/fsmap.c
> > +++ b/io/fsmap.c
> > @@ -10,6 +10,7 @@
> >  #include "io.h"
> >  #include "input.h"
> >  #include "libfrog/fsgeom.h"
> > +#include "libfrog/fsmap.h"
> >  
> >  static cmdinfo_t	fsmap_cmd;
> >  static dev_t		xfs_data_dev;
> > diff --git a/io/io.h b/io/io.h
> > index 49db902f..39fb5878 100644
> > --- a/io/io.h
> > +++ b/io/io.h
> > @@ -167,12 +167,7 @@ extern void		readdir_init(void);
> >  extern void		reflink_init(void);
> >  
> >  extern void		cowextsize_init(void);
> > -
> > -#ifdef HAVE_GETFSMAP
> >  extern void		fsmap_init(void);
> > -#else
> > -# define fsmap_init()	do { } while (0)
> > -#endif
> >  
> >  #ifdef HAVE_DEVMAPPER
> >  extern void		log_writes_init(void);
> > diff --git a/libfrog/Makefile b/libfrog/Makefile
> > index 01107082..d6044455 100644
> > --- a/libfrog/Makefile
> > +++ b/libfrog/Makefile
> > @@ -40,6 +40,7 @@ crc32cselftest.h \
> >  crc32defs.h \
> >  crc32table.h \
> >  fsgeom.h \
> > +fsmap.h \
> >  logging.h \
> >  paths.h \
> >  projects.h \
> > diff --git a/libfrog/fsmap.h b/libfrog/fsmap.h
> > new file mode 100644
> > index 00000000..dc290962
> > --- /dev/null
> > +++ b/libfrog/fsmap.h
> > @@ -0,0 +1,117 @@
> > +#ifdef HAVE_GETFSMAP
> > +# include <linux/fsmap.h>
> > +#endif
> > +
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * FS_IOC_GETFSMAP ioctl infrastructure.
> > + *
> > + * Copyright (C) 2017 Oracle.  All Rights Reserved.
> > + *
> > + * Author: Darrick J. Wong <djwong@kernel.org>
> > + */
> > +#ifndef _LINUX_FSMAP_H
> > +#define _LINUX_FSMAP_H
> > +
> > +#include <linux/types.h>
> > +
> > +/*
> > + *	Structure for FS_IOC_GETFSMAP.
> > + *
> > + *	The memory layout for this call are the scalar values defined in
> > + *	struct fsmap_head, followed by two struct fsmap that describe
> > + *	the lower and upper bound of mappings to return, followed by an
> > + *	array of struct fsmap mappings.
> > + *
> > + *	fmh_iflags control the output of the call, whereas fmh_oflags report
> > + *	on the overall record output.  fmh_count should be set to the
> > + *	length of the fmh_recs array, and fmh_entries will be set to the
> > + *	number of entries filled out during each call.  If fmh_count is
> > + *	zero, the number of reverse mappings will be returned in
> > + *	fmh_entries, though no mappings will be returned.  fmh_reserved
> > + *	must be set to zero.
> > + *
> > + *	The two elements in the fmh_keys array are used to constrain the
> > + *	output.  The first element in the array should represent the
> > + *	lowest disk mapping ("low key") that the user wants to learn
> > + *	about.  If this value is all zeroes, the filesystem will return
> > + *	the first entry it knows about.  For a subsequent call, the
> > + *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
> > + *	copied into fmh_keys[0] to have the kernel start where it left off.
> > + *
> > + *	The second element in the fmh_keys array should represent the
> > + *	highest disk mapping ("high key") that the user wants to learn
> > + *	about.  If this value is all ones, the filesystem will not stop
> > + *	until it runs out of mapping to return or runs out of space in
> > + *	fmh_recs.
> > + *
> > + *	fmr_device can be either a 32-bit cookie representing a device, or
> > + *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
> > + *	fmr_offset, and fmr_length are expressed in units of bytes.
> > + *	fmr_owner is either an inode number, or a special value if
> > + *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
> > + */
> > +struct fsmap {
> > +	__u32		fmr_device;	/* device id */
> > +	__u32		fmr_flags;	/* mapping flags */
> > +	__u64		fmr_physical;	/* device offset of segment */
> > +	__u64		fmr_owner;	/* owner id */
> > +	__u64		fmr_offset;	/* file offset of segment */
> > +	__u64		fmr_length;	/* length of segment */
> > +	__u64		fmr_reserved[3];	/* must be zero */
> > +};
> > +
> > +struct fsmap_head {
> > +	__u32		fmh_iflags;	/* control flags */
> > +	__u32		fmh_oflags;	/* output flags */
> > +	__u32		fmh_count;	/* # of entries in array incl. input */
> > +	__u32		fmh_entries;	/* # of entries filled in (output). */
> > +	__u64		fmh_reserved[6];	/* must be zero */
> > +
> > +	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
> > +	struct fsmap	fmh_recs[];	/* returned records */
> > +};
> > +
> > +/* Size of an fsmap_head with room for nr records. */
> > +static __inline__ size_t
> > +fsmap_sizeof(
> > +	unsigned int	nr)
> > +{
> > +	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
> > +}
> > +
> > +/* Start the next fsmap query at the end of the current query results. */
> > +static __inline__ void
> > +fsmap_advance(
> > +	struct fsmap_head	*head)
> > +{
> > +	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
> > +}
> > +
> > +/*	fmh_iflags values - set by FS_IOC_GETFSMAP caller in the header. */
> > +/* no flags defined yet */
> > +#define FMH_IF_VALID		0
> > +
> > +/*	fmh_oflags values - returned in the header segment only. */
> > +#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
> > +
> > +/*	fmr_flags values - returned for each non-header segment */
> > +#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
> > +#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
> > +#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
> > +#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
> > +#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
> > +#define FMR_OF_LAST		0x20	/* segment is the last in the dataset */
> > +
> > +/* Each FS gets to define its own special owner codes. */
> > +#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
> > +				 ((__u64)code & 0xFFFFFFFFULL))
> > +#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
> > +#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
> > +#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
> > +#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
> > +#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
> > +
> > +#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
> > +
> > +#endif /* _LINUX_FSMAP_H */
> > diff --git a/scrub/phase6.c b/scrub/phase6.c
> > index 87828b60..dd42b66c 100644
> > --- a/scrub/phase6.c
> > +++ b/scrub/phase6.c
> > @@ -10,6 +10,7 @@
> >  #include "handle.h"
> >  #include "libfrog/paths.h"
> >  #include "libfrog/workqueue.h"
> > +#include "libfrog/fsmap.h"
> >  #include "xfs_scrub.h"
> >  #include "common.h"
> >  #include "libfrog/bitmap.h"
> > diff --git a/scrub/phase7.c b/scrub/phase7.c
> > index bc652ab6..e24906d1 100644
> > --- a/scrub/phase7.c
> > +++ b/scrub/phase7.c
> > @@ -9,6 +9,7 @@
> >  #include <sys/statvfs.h>
> >  #include "libfrog/paths.h"
> >  #include "libfrog/ptvar.h"
> > +#include "libfrog/fsmap.h"
> >  #include "list.h"
> >  #include "xfs_scrub.h"
> >  #include "common.h"
> > diff --git a/scrub/spacemap.c b/scrub/spacemap.c
> > index a5508d56..b7f17e57 100644
> > --- a/scrub/spacemap.c
> > +++ b/scrub/spacemap.c
> > @@ -10,6 +10,7 @@
> >  #include <sys/statvfs.h>
> >  #include "libfrog/workqueue.h"
> >  #include "libfrog/paths.h"
> > +#include "libfrog/fsmap.h"
> >  #include "xfs_scrub.h"
> >  #include "common.h"
> >  #include "spacemap.h"
> > diff --git a/spaceman/freesp.c b/spaceman/freesp.c
> > index de301c19..4e46ab26 100644
> > --- a/spaceman/freesp.c
> > +++ b/spaceman/freesp.c
> > @@ -9,6 +9,7 @@
> >  #include "libxfs.h"
> >  #include <linux/fiemap.h>
> >  #include "libfrog/fsgeom.h"
> > +#include "libfrog/fsmap.h"
> >  #include "command.h"
> >  #include "init.h"
> >  #include "libfrog/paths.h"
> > diff --git a/spaceman/space.h b/spaceman/space.h
> > index 723209ed..a8055535 100644
> > --- a/spaceman/space.h
> > +++ b/spaceman/space.h
> > @@ -26,11 +26,7 @@ extern void	help_init(void);
> >  extern void	prealloc_init(void);
> >  extern void	quit_init(void);
> >  extern void	trim_init(void);
> > -#ifdef HAVE_GETFSMAP
> >  extern void	freesp_init(void);
> > -#else
> > -# define freesp_init()	do { } while (0)
> > -#endif
> >  extern void	info_init(void);
> >  extern void	health_init(void);
> >  
> > 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-02-05  0:36     ` Darrick J. Wong
@ 2022-02-07  1:05       ` Dave Chinner
  2022-02-07 17:09         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2022-02-07  1:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Eric Sandeen, linux-xfs, allison.henderson

On Fri, Feb 04, 2022 at 04:36:18PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 04, 2022 at 05:18:12PM -0600, Eric Sandeen wrote:
> > On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Move our private copy of the GETFSMAP definition into a libfrog header
> > > file that (preferentially) uses the system header files.  We have no
> > > business shipping kernel headers in the xfslibs package, but this shim
> > > is still needed to build fully functional xfsprogs on old userspace.
> > 
> > Hm. Fine, but I wonder if we can get a bit more intentional about how
> > we handle this kind of thing, I understand why we copy this stuff into
> > xfsprogs early, but then we never know how to get rid of it.
> > 
> > Do we /need/ to build fully functional xfsprogs on old userspace?
> > (really: systems with old kernel headers?)  How far back do we go,
> > I wonder?  Anyway...
> 
> TBH we could probably get rid of these entirely, assuming nobody is
> building xfsprogs with old kernel headers for a system with a newer
> kernel?

Just fiddle the autoconf rules to refuse to build if the system
headers we need aren't present. It just means that build systems
need to have the userspace they intend to target installed in the
build environment.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-02-07  1:05       ` Dave Chinner
@ 2022-02-07 17:09         ` Darrick J. Wong
  2022-02-07 21:32           ` Eric Sandeen
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-07 17:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs, allison.henderson

On Mon, Feb 07, 2022 at 12:05:41PM +1100, Dave Chinner wrote:
> On Fri, Feb 04, 2022 at 04:36:18PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 04, 2022 at 05:18:12PM -0600, Eric Sandeen wrote:
> > > On 1/19/22 6:21 PM, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Move our private copy of the GETFSMAP definition into a libfrog header
> > > > file that (preferentially) uses the system header files.  We have no
> > > > business shipping kernel headers in the xfslibs package, but this shim
> > > > is still needed to build fully functional xfsprogs on old userspace.
> > > 
> > > Hm. Fine, but I wonder if we can get a bit more intentional about how
> > > we handle this kind of thing, I understand why we copy this stuff into
> > > xfsprogs early, but then we never know how to get rid of it.
> > > 
> > > Do we /need/ to build fully functional xfsprogs on old userspace?
> > > (really: systems with old kernel headers?)  How far back do we go,
> > > I wonder?  Anyway...
> > 
> > TBH we could probably get rid of these entirely, assuming nobody is
> > building xfsprogs with old kernel headers for a system with a newer
> > kernel?
> 
> Just fiddle the autoconf rules to refuse to build if the system
> headers we need aren't present. It just means that build systems
> need to have the userspace they intend to target installed in the
> build environment.

GETFSMAP premiered in 4.12, so I'm going to take this response (and the
lack of any others) as a sign that I can respin this patch to require
recent kernel headers instead of providing our own copy.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-02-07 17:09         ` Darrick J. Wong
@ 2022-02-07 21:32           ` Eric Sandeen
  2022-02-10  3:33             ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Sandeen @ 2022-02-07 21:32 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner; +Cc: linux-xfs, allison.henderson

On 2/7/22 11:09 AM, Darrick J. Wong wrote:
> On Mon, Feb 07, 2022 at 12:05:41PM +1100, Dave Chinner wrote:
>> On Fri, Feb 04, 2022 at 04:36:18PM -0800, Darrick J. Wong wrote:
>>> On Fri, Feb 04, 2022 at 05:18:12PM -0600, Eric Sandeen wrote:

...

>>>> Do we /need/ to build fully functional xfsprogs on old userspace?
>>>> (really: systems with old kernel headers?)  How far back do we go,
>>>> I wonder?  Anyway...
>>>
>>> TBH we could probably get rid of these entirely, assuming nobody is
>>> building xfsprogs with old kernel headers for a system with a newer
>>> kernel?
>>
>> Just fiddle the autoconf rules to refuse to build if the system
>> headers we need aren't present. It just means that build systems
>> need to have the userspace they intend to target installed in the
>> build environment.
> 
> GETFSMAP premiered in 4.12, so I'm going to take this response (and the
> lack of any others) as a sign that I can respin this patch to require
> recent kernel headers instead of providing our own copy.

Sounds reasonable, thanks. Maybe in the future when we add stuff like
this for bleeding edge interfaces we can mark the date, and mark another
one in what, a year or two, as a reminder for removal.

-Eric


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions
  2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
  2022-02-04 23:18   ` Eric Sandeen
@ 2022-02-08 16:46   ` Darrick J. Wong
  2022-02-25 22:35     ` Eric Sandeen
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-08 16:46 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs, allison.henderson, Dave Chinner

From: Darrick J. Wong <djwong@kernel.org>

The GETFSMAP ioctl has been a part of the kernel since 4.12.  We have no
business shipping a stale copy of kernel header contents in the xfslibs
package, so get rid of it.  This means that xfs_scrub now has a hard
dependency on the build system having new kernel headers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/linux.h   |  105 -----------------------------------------------------
 io/Makefile       |    5 +--
 io/fsmap.c        |    1 +
 scrub/Makefile    |    7 +---
 scrub/phase6.c    |    1 +
 scrub/phase7.c    |    1 +
 scrub/spacemap.c  |    1 +
 spaceman/Makefile |    5 +--
 spaceman/freesp.c |    1 +
 9 files changed, 11 insertions(+), 116 deletions(-)

diff --git a/include/linux.h b/include/linux.h
index de8a7122..3d9f4e3d 100644
--- a/include/linux.h
+++ b/include/linux.h
@@ -251,111 +251,6 @@ struct fsxattr {
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 #endif
 
-#ifdef HAVE_GETFSMAP
-# include <linux/fsmap.h>
-#else
-/*
- *	Structure for FS_IOC_GETFSMAP.
- *
- *	The memory layout for this call are the scalar values defined in
- *	struct fsmap_head, followed by two struct fsmap that describe
- *	the lower and upper bound of mappings to return, followed by an
- *	array of struct fsmap mappings.
- *
- *	fmh_iflags control the output of the call, whereas fmh_oflags report
- *	on the overall record output.  fmh_count should be set to the
- *	length of the fmh_recs array, and fmh_entries will be set to the
- *	number of entries filled out during each call.  If fmh_count is
- *	zero, the number of reverse mappings will be returned in
- *	fmh_entries, though no mappings will be returned.  fmh_reserved
- *	must be set to zero.
- *
- *	The two elements in the fmh_keys array are used to constrain the
- *	output.  The first element in the array should represent the
- *	lowest disk mapping ("low key") that the user wants to learn
- *	about.  If this value is all zeroes, the filesystem will return
- *	the first entry it knows about.  For a subsequent call, the
- *	contents of fsmap_head.fmh_recs[fsmap_head.fmh_count - 1] should be
- *	copied into fmh_keys[0] to have the kernel start where it left off.
- *
- *	The second element in the fmh_keys array should represent the
- *	highest disk mapping ("high key") that the user wants to learn
- *	about.  If this value is all ones, the filesystem will not stop
- *	until it runs out of mapping to return or runs out of space in
- *	fmh_recs.
- *
- *	fmr_device can be either a 32-bit cookie representing a device, or
- *	a 32-bit dev_t if the FMH_OF_DEV_T flag is set.  fmr_physical,
- *	fmr_offset, and fmr_length are expressed in units of bytes.
- *	fmr_owner is either an inode number, or a special value if
- *	FMR_OF_SPECIAL_OWNER is set in fmr_flags.
- */
-struct fsmap {
-	__u32		fmr_device;	/* device id */
-	__u32		fmr_flags;	/* mapping flags */
-	__u64		fmr_physical;	/* device offset of segment */
-	__u64		fmr_owner;	/* owner id */
-	__u64		fmr_offset;	/* file offset of segment */
-	__u64		fmr_length;	/* length of segment */
-	__u64		fmr_reserved[3];	/* must be zero */
-};
-
-struct fsmap_head {
-	__u32		fmh_iflags;	/* control flags */
-	__u32		fmh_oflags;	/* output flags */
-	__u32		fmh_count;	/* # of entries in array incl. input */
-	__u32		fmh_entries;	/* # of entries filled in (output). */
-	__u64		fmh_reserved[6];	/* must be zero */
-
-	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
-	struct fsmap	fmh_recs[];	/* returned records */
-};
-
-/* Size of an fsmap_head with room for nr records. */
-static inline size_t
-fsmap_sizeof(
-	unsigned int	nr)
-{
-	return sizeof(struct fsmap_head) + nr * sizeof(struct fsmap);
-}
-
-/* Start the next fsmap query at the end of the current query results. */
-static inline void
-fsmap_advance(
-	struct fsmap_head	*head)
-{
-	head->fmh_keys[0] = head->fmh_recs[head->fmh_entries - 1];
-}
-
-/*	fmh_iflags values - set by XFS_IOC_GETFSMAP caller in the header. */
-/* no flags defined yet */
-#define FMH_IF_VALID		0
-
-/*	fmh_oflags values - returned in the header segment only. */
-#define FMH_OF_DEV_T		0x1	/* fmr_device values will be dev_t */
-
-/*	fmr_flags values - returned for each non-header segment */
-#define FMR_OF_PREALLOC		0x1	/* segment = unwritten pre-allocation */
-#define FMR_OF_ATTR_FORK	0x2	/* segment = attribute fork */
-#define FMR_OF_EXTENT_MAP	0x4	/* segment = extent map */
-#define FMR_OF_SHARED		0x8	/* segment = shared with another file */
-#define FMR_OF_SPECIAL_OWNER	0x10	/* owner is a special value */
-#define FMR_OF_LAST		0x20	/* segment is the last in the FS */
-
-/* Each FS gets to define its own special owner codes. */
-#define FMR_OWNER(type, code)	(((__u64)type << 32) | \
-				 ((__u64)code & 0xFFFFFFFFULL))
-#define FMR_OWNER_TYPE(owner)	((__u32)((__u64)owner >> 32))
-#define FMR_OWNER_CODE(owner)	((__u32)(((__u64)owner & 0xFFFFFFFFULL)))
-#define FMR_OWN_FREE		FMR_OWNER(0, 1) /* free space */
-#define FMR_OWN_UNKNOWN		FMR_OWNER(0, 2) /* unknown owner */
-#define FMR_OWN_METADATA	FMR_OWNER(0, 3) /* metadata */
-
-#define FS_IOC_GETFSMAP		_IOWR('X', 59, struct fsmap_head)
-
-#define HAVE_GETFSMAP
-#endif /* HAVE_GETFSMAP */
-
 #ifndef HAVE_MAP_SYNC
 #define MAP_SYNC 0
 #define MAP_SHARED_VALIDATE 0
diff --git a/io/Makefile b/io/Makefile
index 71741926..498174cf 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -104,10 +104,9 @@ LLDLIBS += $(LIBDEVMAPPER)
 LCFLAGS += -DHAVE_DEVMAPPER
 endif
 
-# On linux we get fsmap from the system or define it ourselves
-# so include this unconditionally.  If this reverts to only
-# the autoconf check w/o local definition, test HAVE_GETFSMAP
+ifeq ($(HAVE_GETFSMAP),yes)
 CFILES += fsmap.c
+endif
 
 ifeq ($(HAVE_STATFS_FLAGS),yes)
 LCFLAGS += -DHAVE_STATFS_FLAGS
diff --git a/io/fsmap.c b/io/fsmap.c
index f540a7c0..9dd19cc0 100644
--- a/io/fsmap.c
+++ b/io/fsmap.c
@@ -4,6 +4,7 @@
  * Author: Darrick J. Wong <darrick.wong@oracle.com>
  */
 #include "platform_defs.h"
+#include <linux/fsmap.h>
 #include "command.h"
 #include "init.h"
 #include "libfrog/paths.h"
diff --git a/scrub/Makefile b/scrub/Makefile
index fd6bb679..335e1e8d 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -6,12 +6,9 @@ TOPDIR = ..
 builddefs=$(TOPDIR)/include/builddefs
 include $(builddefs)
 
-# On linux we get fsmap from the system or define it ourselves
-# so include this based on platform type.  If this reverts to only
-# the autoconf check w/o local definition, change to testing HAVE_GETFSMAP
-SCRUB_PREREQS=$(HAVE_OPENAT)$(HAVE_FSTATAT)
+SCRUB_PREREQS=$(HAVE_OPENAT)$(HAVE_FSTATAT)$(HAVE_GETFSMAP)
 
-ifeq ($(SCRUB_PREREQS),yesyes)
+ifeq ($(SCRUB_PREREQS),yesyesyes)
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 XFS_SCRUB_ALL_PROG = xfs_scrub_all
diff --git a/scrub/phase6.c b/scrub/phase6.c
index 87828b60..afdb16b6 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -7,6 +7,7 @@
 #include <stdint.h>
 #include <dirent.h>
 #include <sys/statvfs.h>
+#include <linux/fsmap.h>
 #include "handle.h"
 #include "libfrog/paths.h"
 #include "libfrog/workqueue.h"
diff --git a/scrub/phase7.c b/scrub/phase7.c
index bc652ab6..84546b1c 100644
--- a/scrub/phase7.c
+++ b/scrub/phase7.c
@@ -7,6 +7,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <sys/statvfs.h>
+#include <linux/fsmap.h>
 #include "libfrog/paths.h"
 #include "libfrog/ptvar.h"
 #include "list.h"
diff --git a/scrub/spacemap.c b/scrub/spacemap.c
index a5508d56..03440d3a 100644
--- a/scrub/spacemap.c
+++ b/scrub/spacemap.c
@@ -8,6 +8,7 @@
 #include <string.h>
 #include <pthread.h>
 #include <sys/statvfs.h>
+#include <linux/fsmap.h>
 #include "libfrog/workqueue.h"
 #include "libfrog/paths.h"
 #include "xfs_scrub.h"
diff --git a/spaceman/Makefile b/spaceman/Makefile
index 2a366918..1f048d54 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -18,10 +18,9 @@ ifeq ($(ENABLE_EDITLINE),yes)
 LLDLIBS += $(LIBEDITLINE) $(LIBTERMCAP)
 endif
 
-# On linux we get fsmap from the system or define it ourselves
-# so include this unconditionally.  If this reverts to only
-# the autoconf check w/o local definition, test HAVE_GETFSMAP
+ifeq ($(HAVE_GETFSMAP),yes)
 CFILES += freesp.c
+endif
 
 default: depend $(LTCOMMAND)
 
diff --git a/spaceman/freesp.c b/spaceman/freesp.c
index de301c19..423568a4 100644
--- a/spaceman/freesp.c
+++ b/spaceman/freesp.c
@@ -8,6 +8,7 @@
 
 #include "libxfs.h"
 #include <linux/fiemap.h>
+#include <linux/fsmap.h>
 #include "libfrog/fsgeom.h"
 #include "command.h"
 #include "init.h"

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog
  2022-02-07 21:32           ` Eric Sandeen
@ 2022-02-10  3:33             ` Dave Chinner
  0 siblings, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2022-02-10  3:33 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Darrick J. Wong, linux-xfs, allison.henderson

On Mon, Feb 07, 2022 at 03:32:10PM -0600, Eric Sandeen wrote:
> On 2/7/22 11:09 AM, Darrick J. Wong wrote:
> > On Mon, Feb 07, 2022 at 12:05:41PM +1100, Dave Chinner wrote:
> >> On Fri, Feb 04, 2022 at 04:36:18PM -0800, Darrick J. Wong wrote:
> >>> On Fri, Feb 04, 2022 at 05:18:12PM -0600, Eric Sandeen wrote:
> 
> ...
> 
> >>>> Do we /need/ to build fully functional xfsprogs on old userspace?
> >>>> (really: systems with old kernel headers?)  How far back do we go,
> >>>> I wonder?  Anyway...
> >>>
> >>> TBH we could probably get rid of these entirely, assuming nobody is
> >>> building xfsprogs with old kernel headers for a system with a newer
> >>> kernel?
> >>
> >> Just fiddle the autoconf rules to refuse to build if the system
> >> headers we need aren't present. It just means that build systems
> >> need to have the userspace they intend to target installed in the
> >> build environment.
> > 
> > GETFSMAP premiered in 4.12, so I'm going to take this response (and the
> > lack of any others) as a sign that I can respin this patch to require
> > recent kernel headers instead of providing our own copy.
> 
> Sounds reasonable, thanks. Maybe in the future when we add stuff like
> this for bleeding edge interfaces we can mark the date, and mark another
> one in what, a year or two, as a reminder for removal.

That's what we've done in the past - provide our own copy until the
system headers catch up and then remove our copy. Note that this may
cause angst with lesser used C libraries (like musl), but they need
to keep up with new kernel APIs to be really useful, anyway...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions
  2022-01-20  0:22 ` [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions Darrick J. Wong
@ 2022-02-25 21:45   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 21:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:22 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Use the proper macro to convert ino4 and ino8 field byte sizes to a bit
> count in the functions that navigate shortform directories.  This just
> happens to work correctly for ino4 entries, but omits the upper 4 bytes
> of an ino8 entry.  Note that the entries display correctly; it's just
> the command "addr u3.sfdir3.list[X].inumber.i8" that won't.
> 
> Found by running smatch.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> ---
>  db/faddr.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/db/faddr.c b/db/faddr.c
> index 81d69c94..0127c5d1 100644
> --- a/db/faddr.c
> +++ b/db/faddr.c
> @@ -353,7 +353,8 @@ fa_ino4(
>  	xfs_ino_t	ino;
>  
>  	ASSERT(next == TYP_INODE);
> -	ino = (xfs_ino_t)getbitval(obj, bit, bitsz(XFS_INO32_SIZE), BVUNSIGNED);
> +	ino = (xfs_ino_t)getbitval(obj, bit, bitize(XFS_INO32_SIZE),
> +			BVUNSIGNED);
>  	if (ino == NULLFSINO) {
>  		dbprintf(_("null inode number, cannot set new addr\n"));
>  		return;
> @@ -370,7 +371,8 @@ fa_ino8(
>  	xfs_ino_t	ino;
>  
>  	ASSERT(next == TYP_INODE);
> -	ino = (xfs_ino_t)getbitval(obj, bit, bitsz(XFS_INO64_SIZE), BVUNSIGNED);
> +	ino = (xfs_ino_t)getbitval(obj, bit, bitize(XFS_INO64_SIZE),
> +			BVUNSIGNED);
>  	if (ino == NULLFSINO) {
>  		dbprintf(_("null inode number, cannot set new addr\n"));
>  		return;
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn
  2022-01-20  0:22 ` [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn Darrick J. Wong
@ 2022-02-25 21:46   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 21:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:22 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Explicitly cast the ondisk dquot counter argument to do_warn when
> complaining about incorrect quota counts.  This avoids build warnings on
> ppc64le.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Hate this stuff, but sure ;) (and thanks for fixing)

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers in do_warn
  2022-01-20  0:22 ` [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers " Darrick J. Wong
@ 2022-02-25 21:48   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 21:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:22 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Explicitly cast the ondisk directory inode argument to do_warn when
> complaining about corrupt directories.  This avoids build warnings on
> armv7l.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

The fix is fine but there's no casting going on, you're using
PRIu64... *shrug* did you intend to cast it? Probably not since
every other instance uses PRIu64.

I'll fix up the commit log if I remember.

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

> ---
>  repair/dir2.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> 
> diff --git a/repair/dir2.c b/repair/dir2.c
> index fdf91532..946e729e 100644
> --- a/repair/dir2.c
> +++ b/repair/dir2.c
> @@ -1358,7 +1358,7 @@ _("can't read block %" PRIu64 " for directory inode %" PRIu64 "\n"),
>  		}
>  		if (bp->b_error == -EFSCORRUPTED) {
>  			do_warn(
> -_("corrupt directory data block %lu for inode %" PRIu64 "\n"),
> +_("corrupt directory data block %" PRIu64 " for inode %" PRIu64 "\n"),
>  				dbno, ino);
>  			libxfs_buf_relse(bp);
>  			continue;
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem
  2022-01-20  0:22 ` [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem Darrick J. Wong
@ 2022-02-25 21:53   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 21:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:22 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Indentation is supposed to be tabs, not spaces.  Fix that, and unindent
> the bwrite clause because do_error aborts the program.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Heh, whoops. I'm goad to see that I'm not the only one who
occasionally messes up whitespace :P

Reviewed-by: Eric Sandeen <sandeen@redhat.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/17] xfs_repair: update secondary superblocks after changing features
  2022-01-20  0:22 ` [PATCH 11/17] xfs_repair: update secondary superblocks after changing features Darrick J. Wong
@ 2022-02-25 21:57   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 21:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:22 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When we add features to an existing filesystem, make sure we update the
> secondary superblocks to reflect the new geometry so that if we lose the
> primary super in the future, repair will recover correctly.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

whoops, yep.

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 12/17] xfs_scrub: report optional features in version string
  2022-01-20  1:32   ` [PATCH v2 " Darrick J. Wong
@ 2022-02-25 22:14     ` Eric Sandeen
  2022-02-26  0:04       ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 22:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

On 1/19/22 7:32 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Ted Ts'o reported brittleness in the fstests logic in generic/45[34] to
> detect whether or not xfs_scrub is capable of detecting Unicode mischief
> in directory and xattr names.  This is a compile-time feature, since we
> do not assume that all distros will want to ship xfsprogs with libicu.
> 
> Rather than relying on ldd tests (which don't work at all if xfs_scrub
> is compiled statically), let's have -V print whether or not the feature
> is built into the tool.  Phase 5 still requires the presence of "UTF-8"
> in LC_MESSAGES to enable Unicode confusable detection; this merely makes
> the feature easier to discover.
> 
> Reported-by: Theodore Ts'o <tytso@mit.edu>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> v2: correct the name of the reporter
> ---

Hum, every single other utility just does "$progname version $version"
and I'm not that keen to tack on something for everyone, if it won't
really mean anything to anyone except xfstests scripts ;)

What about adding an "-F" to display features, and xfstests can use that,
and xfs_scrub -V will keep acting like every other utility?

Other utilities could use this too if we ever cared (though xfs_db
and xfs_io already have an "-F" option ... we could choose -Z for
featureZ, which is unused as a primary option anywhere ...)

like so:

===

diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
index e881ae76..65d8f4a2 100644
--- a/man/man8/xfs_scrub.8
+++ b/man/man8/xfs_scrub.8
@@ -8,7 +8,7 @@ xfs_scrub \- check and repair the contents of a mounted XFS filesystem
 ]
 .I mount-point
 .br
-.B xfs_scrub \-V
+.B xfs_scrub \-V | \-F
 .SH DESCRIPTION
 .B xfs_scrub
 attempts to check and repair all metadata in a mounted XFS filesystem.
@@ -76,6 +76,9 @@ If
 is given, no action is taken if errors are found; this is the default
 behavior.
 .TP
+.B \-F
+Prints the version number along with optional build-time features and exits.
+.TP
 .B \-k
 Do not call TRIM on the free space.
 .TP
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index bc2e84a7..9e9a098c 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -582,6 +582,13 @@ report_outcome(
 	}
 }
 
+/* Compile-time features discoverable via version strings */
+#ifdef HAVE_LIBICU
+# define XFS_SCRUB_HAVE_UNICODE	"+"
+#else
+# define XFS_SCRUB_HAVE_UNICODE	"-"
+#endif
+
 int
 main(
 	int			argc,
@@ -613,7 +620,7 @@ main(
 	pthread_mutex_init(&ctx.lock, NULL);
 	ctx.mode = SCRUB_MODE_REPAIR;
 	ctx.error_action = ERRORS_CONTINUE;
-	while ((c = getopt(argc, argv, "a:bC:de:km:nTvxV")) != EOF) {
+	while ((c = getopt(argc, argv, "a:bC:de:Fkm:nTvxV")) != EOF) {
 		switch (c) {
 		case 'a':
 			ctx.max_errors = cvt_u64(optarg, 10);
@@ -654,6 +661,12 @@ main(
 				usage();
 			}
 			break;
+		case 'F':
+			fprintf(stdout, _("%s version %s %sUnicode\n"),
+					progname, VERSION,
+					XFS_SCRUB_HAVE_UNICODE);
+			fflush(stdout);
+			return SCRUB_RET_SUCCESS;
 		case 'k':
 			want_fstrim = false;
 			break;

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems
  2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
@ 2022-02-25 22:21   ` Eric Sandeen
  2022-02-26  2:38     ` Darrick J. Wong
  2022-02-26  2:52   ` [PATCH v2 " Darrick J. Wong
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We have a handful of users who continually ping the maintainer with
> questions about why pmem and dax don't work quite the way they want
> (which is to say 2MB extents and PMD mappings) because they copy-pasted
> some garbage from Google that's wrong.  Encode the correct defaults into
> a mkfs config file and ship that.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  mkfs/Makefile        |    1 +
>  mkfs/dax_x86_64.conf |   19 +++++++++++++++++++
>  2 files changed, 20 insertions(+)
>  create mode 100644 mkfs/dax_x86_64.conf
> 
> 
> diff --git a/mkfs/Makefile b/mkfs/Makefile
> index 0aaf9d06..55d9362f 100644
> --- a/mkfs/Makefile
> +++ b/mkfs/Makefile
> @@ -10,6 +10,7 @@ LTCOMMAND = mkfs.xfs
>  HFILES =
>  CFILES = proto.c xfs_mkfs.c
>  CFGFILES = \
> +	dax_x86_64.conf \
>  	lts_4.19.conf \
>  	lts_5.4.conf \
>  	lts_5.10.conf \
> diff --git a/mkfs/dax_x86_64.conf b/mkfs/dax_x86_64.conf
> new file mode 100644
> index 00000000..bc3f3c9a
> --- /dev/null
> +++ b/mkfs/dax_x86_64.conf
> @@ -0,0 +1,19 @@
> +# mkfs.xfs configuration file for persistent memory on x86_64.
> +# Block size must match page size (4K) and we require V5 for the DAX inode
> +# flag.  Set extent size hints and stripe units to encourage the filesystem to
> +# allocate PMD sized (2MB) blocks.
> +
> +[block]
> +size=4096
> +
> +[metadata]
> +crc=1
> +
> +[data]
> +sunit=4096
> +swidth=4096

How would you feel about:

su=2m
sw=1

instead, because I think that explicit units are far more obvious than
"4096 <handwave> 512-byte units" ?

> +extszinherit=512

... though I guess this can only be specified in fsblocks, LOLZ :(

> +daxinherit=1
> +
> +[realtime]
> +extsize=2097152

Pretty weird to set this if you don't have a realtime device but I guess
it works.

-Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 17/17] mkfs: enable inobtcount and bigtime by default
  2022-01-20  0:23 ` [PATCH 17/17] mkfs: enable inobtcount and bigtime by default Darrick J. Wong
@ 2022-02-25 22:22   ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 22:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 1/19/22 6:23 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Enable the inode btree counters and large timestamp features by default.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

I think that's a fantastic idea ;)

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions
  2022-02-08 16:46   ` [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions Darrick J. Wong
@ 2022-02-25 22:35     ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-25 22:35 UTC (permalink / raw)
  To: Darrick J. Wong, sandeen; +Cc: linux-xfs, allison.henderson, Dave Chinner

On 2/8/22 10:46 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The GETFSMAP ioctl has been a part of the kernel since 4.12.  We have no
> business shipping a stale copy of kernel header contents in the xfslibs
> package, so get rid of it.  This means that xfs_scrub now has a hard
> dependency on the build system having new kernel headers.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Eric Sandeen <sandeen@redhat.com>


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 12/17] xfs_scrub: report optional features in version string
  2022-02-25 22:14     ` Eric Sandeen
@ 2022-02-26  0:04       ` Darrick J. Wong
  2022-02-26  2:48         ` Darrick J. Wong
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-26  0:04 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

On Fri, Feb 25, 2022 at 04:14:13PM -0600, Eric Sandeen wrote:
> On 1/19/22 7:32 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Ted Ts'o reported brittleness in the fstests logic in generic/45[34] to
> > detect whether or not xfs_scrub is capable of detecting Unicode mischief
> > in directory and xattr names.  This is a compile-time feature, since we
> > do not assume that all distros will want to ship xfsprogs with libicu.
> > 
> > Rather than relying on ldd tests (which don't work at all if xfs_scrub
> > is compiled statically), let's have -V print whether or not the feature
> > is built into the tool.  Phase 5 still requires the presence of "UTF-8"
> > in LC_MESSAGES to enable Unicode confusable detection; this merely makes
> > the feature easier to discover.
> > 
> > Reported-by: Theodore Ts'o <tytso@mit.edu>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > v2: correct the name of the reporter
> > ---
> 
> Hum, every single other utility just does "$progname version $version"
> and I'm not that keen to tack on something for everyone, if it won't
> really mean anything to anyone except xfstests scripts ;)
> 
> What about adding an "-F" to display features, and xfstests can use that,
> and xfs_scrub -V will keep acting like every other utility?
> 
> Other utilities could use this too if we ever cared (though xfs_db
> and xfs_io already have an "-F" option ... we could choose -Z for
> featureZ, which is unused as a primary option anywhere ...)
> 
> like so:
> 
> ===
> 
> diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
> index e881ae76..65d8f4a2 100644
> --- a/man/man8/xfs_scrub.8
> +++ b/man/man8/xfs_scrub.8
> @@ -8,7 +8,7 @@ xfs_scrub \- check and repair the contents of a mounted XFS filesystem
>  ]
>  .I mount-point
>  .br
> -.B xfs_scrub \-V
> +.B xfs_scrub \-V | \-F
>  .SH DESCRIPTION
>  .B xfs_scrub
>  attempts to check and repair all metadata in a mounted XFS filesystem.
> @@ -76,6 +76,9 @@ If
>  is given, no action is taken if errors are found; this is the default
>  behavior.
>  .TP
> +.B \-F
> +Prints the version number along with optional build-time features and exits.
> +.TP
>  .B \-k
>  Do not call TRIM on the free space.
>  .TP
> diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
> index bc2e84a7..9e9a098c 100644
> --- a/scrub/xfs_scrub.c
> +++ b/scrub/xfs_scrub.c
> @@ -582,6 +582,13 @@ report_outcome(
>  	}
>  }
>  
> +/* Compile-time features discoverable via version strings */
> +#ifdef HAVE_LIBICU
> +# define XFS_SCRUB_HAVE_UNICODE	"+"
> +#else
> +# define XFS_SCRUB_HAVE_UNICODE	"-"
> +#endif
> +
>  int
>  main(
>  	int			argc,
> @@ -613,7 +620,7 @@ main(
>  	pthread_mutex_init(&ctx.lock, NULL);
>  	ctx.mode = SCRUB_MODE_REPAIR;
>  	ctx.error_action = ERRORS_CONTINUE;
> -	while ((c = getopt(argc, argv, "a:bC:de:km:nTvxV")) != EOF) {
> +	while ((c = getopt(argc, argv, "a:bC:de:Fkm:nTvxV")) != EOF) {
>  		switch (c) {
>  		case 'a':
>  			ctx.max_errors = cvt_u64(optarg, 10);
> @@ -654,6 +661,12 @@ main(
>  				usage();
>  			}
>  			break;
> +		case 'F':
> +			fprintf(stdout, _("%s version %s %sUnicode\n"),
> +					progname, VERSION,
> +					XFS_SCRUB_HAVE_UNICODE);
> +			fflush(stdout);
> +			return SCRUB_RET_SUCCESS;

Works for me!

--D

>  		case 'k':
>  			want_fstrim = false;
>  			break;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems
  2022-02-25 22:21   ` Eric Sandeen
@ 2022-02-26  2:38     ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-26  2:38 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs, allison.henderson

On Fri, Feb 25, 2022 at 04:21:59PM -0600, Eric Sandeen wrote:
> On 1/19/22 6:23 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > We have a handful of users who continually ping the maintainer with
> > questions about why pmem and dax don't work quite the way they want
> > (which is to say 2MB extents and PMD mappings) because they copy-pasted
> > some garbage from Google that's wrong.  Encode the correct defaults into
> > a mkfs config file and ship that.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  mkfs/Makefile        |    1 +
> >  mkfs/dax_x86_64.conf |   19 +++++++++++++++++++
> >  2 files changed, 20 insertions(+)
> >  create mode 100644 mkfs/dax_x86_64.conf
> > 
> > 
> > diff --git a/mkfs/Makefile b/mkfs/Makefile
> > index 0aaf9d06..55d9362f 100644
> > --- a/mkfs/Makefile
> > +++ b/mkfs/Makefile
> > @@ -10,6 +10,7 @@ LTCOMMAND = mkfs.xfs
> >  HFILES =
> >  CFILES = proto.c xfs_mkfs.c
> >  CFGFILES = \
> > +	dax_x86_64.conf \
> >  	lts_4.19.conf \
> >  	lts_5.4.conf \
> >  	lts_5.10.conf \
> > diff --git a/mkfs/dax_x86_64.conf b/mkfs/dax_x86_64.conf
> > new file mode 100644
> > index 00000000..bc3f3c9a
> > --- /dev/null
> > +++ b/mkfs/dax_x86_64.conf
> > @@ -0,0 +1,19 @@
> > +# mkfs.xfs configuration file for persistent memory on x86_64.
> > +# Block size must match page size (4K) and we require V5 for the DAX inode
> > +# flag.  Set extent size hints and stripe units to encourage the filesystem to
> > +# allocate PMD sized (2MB) blocks.
> > +
> > +[block]
> > +size=4096
> > +
> > +[metadata]
> > +crc=1
> > +
> > +[data]
> > +sunit=4096
> > +swidth=4096
> 
> How would you feel about:
> 
> su=2m
> sw=1
> 
> instead, because I think that explicit units are far more obvious than
> "4096 <handwave> 512-byte units" ?

Fine with me. :)

> > +extszinherit=512
> 
> ... though I guess this can only be specified in fsblocks, LOLZ :(
> 
> > +daxinherit=1
> > +
> > +[realtime]
> > +extsize=2097152
> 
> Pretty weird to set this if you don't have a realtime device but I guess
> it works.

Yeah, everything rt is weird. :)

--D

> 
> -Eric

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 12/17] xfs_scrub: report optional features in version string
  2022-02-26  0:04       ` Darrick J. Wong
@ 2022-02-26  2:48         ` Darrick J. Wong
  0 siblings, 0 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-26  2:48 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

On Fri, Feb 25, 2022 at 04:04:21PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 25, 2022 at 04:14:13PM -0600, Eric Sandeen wrote:
> > On 1/19/22 7:32 PM, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Ted Ts'o reported brittleness in the fstests logic in generic/45[34] to
> > > detect whether or not xfs_scrub is capable of detecting Unicode mischief
> > > in directory and xattr names.  This is a compile-time feature, since we
> > > do not assume that all distros will want to ship xfsprogs with libicu.
> > > 
> > > Rather than relying on ldd tests (which don't work at all if xfs_scrub
> > > is compiled statically), let's have -V print whether or not the feature
> > > is built into the tool.  Phase 5 still requires the presence of "UTF-8"
> > > in LC_MESSAGES to enable Unicode confusable detection; this merely makes
> > > the feature easier to discover.
> > > 
> > > Reported-by: Theodore Ts'o <tytso@mit.edu>
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > > v2: correct the name of the reporter
> > > ---
> > 
> > Hum, every single other utility just does "$progname version $version"
> > and I'm not that keen to tack on something for everyone, if it won't
> > really mean anything to anyone except xfstests scripts ;)
> > 
> > What about adding an "-F" to display features, and xfstests can use that,
> > and xfs_scrub -V will keep acting like every other utility?
> > 
> > Other utilities could use this too if we ever cared (though xfs_db
> > and xfs_io already have an "-F" option ... we could choose -Z for
> > featureZ, which is unused as a primary option anywhere ...)
> > 
> > like so:
> > 
> > ===
> > 
> > diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
> > index e881ae76..65d8f4a2 100644
> > --- a/man/man8/xfs_scrub.8
> > +++ b/man/man8/xfs_scrub.8
> > @@ -8,7 +8,7 @@ xfs_scrub \- check and repair the contents of a mounted XFS filesystem
> >  ]
> >  .I mount-point
> >  .br
> > -.B xfs_scrub \-V
> > +.B xfs_scrub \-V | \-F
> >  .SH DESCRIPTION
> >  .B xfs_scrub
> >  attempts to check and repair all metadata in a mounted XFS filesystem.
> > @@ -76,6 +76,9 @@ If
> >  is given, no action is taken if errors are found; this is the default
> >  behavior.
> >  .TP
> > +.B \-F
> > +Prints the version number along with optional build-time features and exits.
> > +.TP
> >  .B \-k
> >  Do not call TRIM on the free space.
> >  .TP
> > diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
> > index bc2e84a7..9e9a098c 100644
> > --- a/scrub/xfs_scrub.c
> > +++ b/scrub/xfs_scrub.c
> > @@ -582,6 +582,13 @@ report_outcome(
> >  	}
> >  }
> >  
> > +/* Compile-time features discoverable via version strings */
> > +#ifdef HAVE_LIBICU
> > +# define XFS_SCRUB_HAVE_UNICODE	"+"
> > +#else
> > +# define XFS_SCRUB_HAVE_UNICODE	"-"
> > +#endif
> > +
> >  int
> >  main(
> >  	int			argc,
> > @@ -613,7 +620,7 @@ main(
> >  	pthread_mutex_init(&ctx.lock, NULL);
> >  	ctx.mode = SCRUB_MODE_REPAIR;
> >  	ctx.error_action = ERRORS_CONTINUE;
> > -	while ((c = getopt(argc, argv, "a:bC:de:km:nTvxV")) != EOF) {
> > +	while ((c = getopt(argc, argv, "a:bC:de:Fkm:nTvxV")) != EOF) {
> >  		switch (c) {
> >  		case 'a':
> >  			ctx.max_errors = cvt_u64(optarg, 10);
> > @@ -654,6 +661,12 @@ main(
> >  				usage();
> >  			}
> >  			break;
> > +		case 'F':
> > +			fprintf(stdout, _("%s version %s %sUnicode\n"),
> > +					progname, VERSION,
> > +					XFS_SCRUB_HAVE_UNICODE);
> > +			fflush(stdout);
> > +			return SCRUB_RET_SUCCESS;
> 
> Works for me!

Actually, I take it back, let's keep -F unused for now and simply make
'-VV' the magic command that triggers the feature reporting.  I'll send
a v3 with this.

--D

> 
> --D
> 
> >  		case 'k':
> >  			want_fstrim = false;
> >  			break;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 16/17] mkfs: add a config file for x86_64 pmem filesystems
  2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
  2022-02-25 22:21   ` Eric Sandeen
@ 2022-02-26  2:52   ` Darrick J. Wong
  2022-02-28 21:37     ` Eric Sandeen
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-26  2:52 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

We have a handful of users who continually ping the maintainer with
questions about why pmem and dax don't work quite the way they want
(which is to say 2MB extents and PMD mappings) because they copy-pasted
some garbage from Google that's wrong.  Encode the correct defaults into
a mkfs config file and ship that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v2: s/sunit/su/
---
 mkfs/Makefile        |    1 +
 mkfs/dax_x86_64.conf |   19 +++++++++++++++++++
 2 files changed, 20 insertions(+)
 create mode 100644 mkfs/dax_x86_64.conf

diff --git a/mkfs/Makefile b/mkfs/Makefile
index 0aaf9d06..55d9362f 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -10,6 +10,7 @@ LTCOMMAND = mkfs.xfs
 HFILES =
 CFILES = proto.c xfs_mkfs.c
 CFGFILES = \
+	dax_x86_64.conf \
 	lts_4.19.conf \
 	lts_5.4.conf \
 	lts_5.10.conf \
diff --git a/mkfs/dax_x86_64.conf b/mkfs/dax_x86_64.conf
new file mode 100644
index 00000000..3a6ae988
--- /dev/null
+++ b/mkfs/dax_x86_64.conf
@@ -0,0 +1,19 @@
+# mkfs.xfs configuration file for persistent memory on x86_64.
+# Block size must match page size (4K) and we require V5 for the DAX inode
+# flag.  Set extent size hints and stripe units to encourage the filesystem to
+# allocate PMD sized (2MB) blocks.
+
+[block]
+size=4096
+
+[metadata]
+crc=1
+
+[data]
+su=2m
+sw=1
+extszinherit=512
+daxinherit=1
+
+[realtime]
+extsize=2097152

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v3 12/17] xfs_scrub: report optional features in version string
  2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
  2022-01-20  1:16   ` Theodore Ts'o
  2022-01-20  1:32   ` [PATCH v2 " Darrick J. Wong
@ 2022-02-26  2:53   ` Darrick J. Wong
  2022-02-28 21:38     ` Eric Sandeen
  2 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-26  2:53 UTC (permalink / raw)
  To: sandeen; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Ted Ts'o reported brittleness in the fstests logic in generic/45[34] to
detect whether or not xfs_scrub is capable of detecting Unicode mischief
in directory and xattr names.  This is a compile-time feature, since we
do not assume that all distros will want to ship xfsprogs with libicu.

Rather than relying on ldd tests (which don't work at all if xfs_scrub
is compiled statically), let's have -V print whether or not the feature
is built into the tool.  Phase 5 still requires the presence of "UTF-8"
in LC_MESSAGES to enable Unicode confusable detection; this merely makes
the feature easier to discover.

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v2: correct the name of the reporter
v3: only report if -VV specified
---
 scrub/xfs_scrub.c |   26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index bc2e84a7..41839c26 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -582,6 +582,13 @@ report_outcome(
 	}
 }
 
+/* Compile-time features discoverable via version strings */
+#ifdef HAVE_LIBICU
+# define XFS_SCRUB_HAVE_UNICODE	"+"
+#else
+# define XFS_SCRUB_HAVE_UNICODE	"-"
+#endif
+
 int
 main(
 	int			argc,
@@ -592,6 +599,7 @@ main(
 	char			*mtab = NULL;
 	FILE			*progress_fp = NULL;
 	struct fs_path		*fsp;
+	int			vflag = 0;
 	int			c;
 	int			fd;
 	int			ret = SCRUB_RET_SUCCESS;
@@ -670,10 +678,8 @@ main(
 			verbose = true;
 			break;
 		case 'V':
-			fprintf(stdout, _("%s version %s\n"), progname,
-					VERSION);
-			fflush(stdout);
-			return SCRUB_RET_SUCCESS;
+			vflag++;
+			break;
 		case 'x':
 			scrub_data = true;
 			break;
@@ -682,6 +688,18 @@ main(
 		}
 	}
 
+	if (vflag) {
+		if (vflag == 1)
+			fprintf(stdout, _("%s version %s\n"),
+					progname, VERSION);
+		else
+			fprintf(stdout, _("%s version %s %sUnicode\n"),
+					progname, VERSION,
+					XFS_SCRUB_HAVE_UNICODE);
+		fflush(stdout);
+		return SCRUB_RET_SUCCESS;
+	}
+
 	/* Override thread count if debugger */
 	if (debug_tweak_on("XFS_SCRUB_THREADS")) {
 		unsigned int	x;

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
                   ` (17 preceding siblings ...)
  2022-01-28 22:44 ` [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices Darrick J. Wong
@ 2022-02-26  2:54 ` Darrick J. Wong
  2022-02-26 21:37   ` Dave Chinner
  2022-02-28 21:44   ` Eric Sandeen
  18 siblings, 2 replies; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-26  2:54 UTC (permalink / raw)
  To: sandeen
  Cc: Christoph Hellwig, Dave Chinner, Theodore Ts'o, linux-xfs,
	allison.henderson

From: Darrick J. Wong <djwong@kernel.org>

Recently, the upstream kernel maintainer has been taking a lot of heat on
account of writer threads encountering high latency when asking for log
grant space when the log is small.  The reported use case is a heavily
threaded indexing product logging trace information to a filesystem
ranging in size between 20 and 250GB.  The meetings that result from the
complaints about latency and stall warnings in dmesg both from this use
case and also a large well known cloud product are now consuming 25% of
the maintainer's weekly time and have been for months.

For small filesystems, the log is small by default because we have
defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
this is even worse, because big filesystems generate big metadata.
However, the log size is still insufficient even if it is formatted at
the larger size.

Therefore, if we're writing a new filesystem format (aka bigtime), bump
the ratio unconditionally from 1:2048 to 1:256.  On a 220GB filesystem,
the 99.95% latencies observed with a 200-writer file synchronous append
workload running on a 44-AG filesystem (with 44 CPUs) spread across 4
hard disks showed:

Log Size (MB)	Latency (ms)	Throughput (MB/s)
10		520		243
20		220		308
40		140		360
80		92		363
160		86		364

For 4 NVME, the results were:

10		201		409
20		177		488
40		122		550
80		120		549
160		121		545

Hence we increase the ratio by 16x because there doesn't seem to be much
improvement beyond that, and we don't want the log to grow /too/ large.
This change does not affect filesystems larger than 4TB, nor does it
affect formatting to older formats.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 96682f9a..7178d798 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3308,7 +3308,17 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
 
 	/* internal log - if no size specified, calculate automatically */
 	if (!cfg->logblocks) {
-		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
+		if (cfg->sb_feat.bigtime) {
+			/*
+			 * Starting with bigtime, everybody gets a 256:1 ratio
+			 * of fs size to log size unless they say otherwise.
+			 * Larger logs reduce contention for log grant space,
+			 * which is now a problem with the advent of small
+			 * non-rotational storage devices.
+			 */
+			cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 256;
+			cfg->logblocks = cfg->logblocks >> cfg->blocklog;
+		} else if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
 			/* tiny filesystems get minimum sized logs. */
 			cfg->logblocks = min_logblocks;
 		} else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
@ 2022-02-26 21:37   ` Dave Chinner
  2022-02-28 23:22     ` Darrick J. Wong
  2022-02-28 21:44   ` Eric Sandeen
  1 sibling, 1 reply; 60+ messages in thread
From: Dave Chinner @ 2022-02-26 21:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sandeen, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	linux-xfs, allison.henderson

On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Recently, the upstream kernel maintainer has been taking a lot of heat on
> account of writer threads encountering high latency when asking for log
> grant space when the log is small.  The reported use case is a heavily
> threaded indexing product logging trace information to a filesystem
> ranging in size between 20 and 250GB.  The meetings that result from the
> complaints about latency and stall warnings in dmesg both from this use
> case and also a large well known cloud product are now consuming 25% of
> the maintainer's weekly time and have been for months.

Is the transaction reservation space exhaustion caused by, as I
pointed out in another thread yesterday, the unbound concurrency in
IO completion? i.e. we have hundreds of active concurrent
transactions that then block on common objects between them (e.g.
inode locks) and serialise? Hence only handful of completions can
actually run concurrently, depsite every completion holding a full
reservation of log space to allow them to run concurrently?

> For small filesystems, the log is small by default because we have
> defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
> this is even worse, because big filesystems generate big metadata.
> However, the log size is still insufficient even if it is formatted at
> the larger size.
> 
> Therefore, if we're writing a new filesystem format (aka bigtime), bump
> the ratio unconditionally from 1:2048 to 1:256.  On a 220GB filesystem,
> the 99.95% latencies observed with a 200-writer file synchronous append
> workload running on a 44-AG filesystem (with 44 CPUs) spread across 4
> hard disks showed:
> 
> Log Size (MB)	Latency (ms)	Throughput (MB/s)
> 10		520		243
> 20		220		308
> 40		140		360
> 80		92		363
> 160		86		364
> 
> For 4 NVME, the results were:
> 
> 10		201		409
> 20		177		488
> 40		122		550
> 80		120		549
> 160		121		545
> 
> Hence we increase the ratio by 16x because there doesn't seem to be much
> improvement beyond that, and we don't want the log to grow /too/ large.

1:2048 -> 1:256 is an 8x bump, yes?  Which means we'll get a 2GB log
on a 512GB filesystem, and the 220GB log you tested is getting a
~1GB log?

I also wonder if the right thing to do here is just set a minimum
log size of 32MB? The worst of the long tail latencies are mitigated
by this point, and so even small filesystems grown out to 200GB will
have a log size that results in decent performance for this sort of
workload.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 16/17] mkfs: add a config file for x86_64 pmem filesystems
  2022-02-26  2:52   ` [PATCH v2 " Darrick J. Wong
@ 2022-02-28 21:37     ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-28 21:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, allison.henderson

On 2/25/22 8:52 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We have a handful of users who continually ping the maintainer with
> questions about why pmem and dax don't work quite the way they want
> (which is to say 2MB extents and PMD mappings) because they copy-pasted
> some garbage from Google that's wrong.  Encode the correct defaults into
> a mkfs config file and ship that.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v3 12/17] xfs_scrub: report optional features in version string
  2022-02-26  2:53   ` [PATCH v3 " Darrick J. Wong
@ 2022-02-28 21:38     ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-02-28 21:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-xfs, allison.henderson

On 2/25/22 8:53 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Ted Ts'o reported brittleness in the fstests logic in generic/45[34] to
> detect whether or not xfs_scrub is capable of detecting Unicode mischief
> in directory and xattr names.  This is a compile-time feature, since we
> do not assume that all distros will want to ship xfsprogs with libicu.
> 
> Rather than relying on ldd tests (which don't work at all if xfs_scrub
> is compiled statically), let's have -V print whether or not the feature
> is built into the tool.  Phase 5 still requires the presence of "UTF-8"
> in LC_MESSAGES to enable Unicode confusable detection; this merely makes
> the feature easier to discover.
> 
> Reported-by: Theodore Ts'o <tytso@mit.edu>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> v2: correct the name of the reporter
> v3: only report if -VV specified

Thanks.

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
  2022-02-26 21:37   ` Dave Chinner
@ 2022-02-28 21:44   ` Eric Sandeen
  2022-03-01  2:21     ` Darrick J. Wong
  1 sibling, 1 reply; 60+ messages in thread
From: Eric Sandeen @ 2022-02-28 21:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, Theodore Ts'o, linux-xfs,
	allison.henderson

On 2/25/22 8:54 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>

...

> Hence we increase the ratio by 16x because there doesn't seem to be much
> improvement beyond that, and we don't want the log to grow /too/ large.
> This change does not affect filesystems larger than 4TB, nor does it
> affect formatting to older formats.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  mkfs/xfs_mkfs.c |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 96682f9a..7178d798 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -3308,7 +3308,17 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
>  
>  	/* internal log - if no size specified, calculate automatically */
>  	if (!cfg->logblocks) {
> -		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
> +		if (cfg->sb_feat.bigtime) {

I'm not very keen on conditioning this on bigtime; it seems very ad-hoc and
unexpected. Future maintainers will look at this and wonder why bigtime is
in any way related to log size...

If we make this change, why not just make it regardless of other features?
Is there some other risk to doing so?

Thanks,
-Eric

> +			/*
> +			 * Starting with bigtime, everybody gets a 256:1 ratio
> +			 * of fs size to log size unless they say otherwise.
> +			 * Larger logs reduce contention for log grant space,
> +			 * which is now a problem with the advent of small
> +			 * non-rotational storage devices.
> +			 */
> +			cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 256;
> +			cfg->logblocks = cfg->logblocks >> cfg->blocklog;
> +		} else if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
>  			/* tiny filesystems get minimum sized logs. */
>  			cfg->logblocks = min_logblocks;
>  		} else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-02-26 21:37   ` Dave Chinner
@ 2022-02-28 23:22     ` Darrick J. Wong
  2022-03-01  0:42       ` Dave Chinner
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-02-28 23:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	linux-xfs, allison.henderson

On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > account of writer threads encountering high latency when asking for log
> > grant space when the log is small.  The reported use case is a heavily
> > threaded indexing product logging trace information to a filesystem
> > ranging in size between 20 and 250GB.  The meetings that result from the
> > complaints about latency and stall warnings in dmesg both from this use
> > case and also a large well known cloud product are now consuming 25% of
> > the maintainer's weekly time and have been for months.
> 
> Is the transaction reservation space exhaustion caused by, as I
> pointed out in another thread yesterday, the unbound concurrency in
> IO completion?

No.  They're using synchronous directio writes to write trace data in 4k
chunks.  The number of files does not exceed the number of writer
threads, and the number of writer threads can be up to 10*NR_CPUS (~400
on the test system).  If I'm reading the iomap directio code correctly,
the writer threads block and do not issue more IO until the first IO
completes...

> i.e. we have hundreds of active concurrent
> transactions that then block on common objects between them (e.g.
> inode locks) and serialise?

...so yes, there are hundreds of active transactions, but (AFAICT) they
mostly don't share objects, other than the log itself.  Once we made the
log bigger, the hotspot moved to the AGF buffers.  I'm not sure what to
do about /that/, since a 5GB AG is pretty small.  That aside...

> Hence only handful of completions can
> actually run concurrently, depsite every completion holding a full
> reservation of log space to allow them to run concurrently?

...this is still an issue for different scenarios.  I would still be
interested in experimenting with constraining the number of writeback
completion workers that get started, even though that isn't at play
here.

> > For small filesystems, the log is small by default because we have
> > defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
> > this is even worse, because big filesystems generate big metadata.
> > However, the log size is still insufficient even if it is formatted at
> > the larger size.
> > 
> > Therefore, if we're writing a new filesystem format (aka bigtime), bump
> > the ratio unconditionally from 1:2048 to 1:256.  On a 220GB filesystem,
> > the 99.95% latencies observed with a 200-writer file synchronous append
> > workload running on a 44-AG filesystem (with 44 CPUs) spread across 4
> > hard disks showed:
> > 
> > Log Size (MB)	Latency (ms)	Throughput (MB/s)
> > 10		520		243w
> > 20		220		308
> > 40		140		360
> > 80		92		363
> > 160		86		364
> > 
> > For 4 NVME, the results were:
> > 
> > 10		201		409
> > 20		177		488
> > 40		122		550
> > 80		120		549
> > 160		121		545
> > 
> > Hence we increase the ratio by 16x because there doesn't seem to be much
> > improvement beyond that, and we don't want the log to grow /too/ large.
> 
> 1:2048 -> 1:256 is an 8x bump, yes?  Which means we'll get a 2GB log
> on a 512GB filesystem, and the 220GB log you tested is getting a
> ~1GB log?

Right.

> I also wonder if the right thing to do here is just set a minimum
> log size of 32MB? The worst of the long tail latencies are mitigated
> by this point, and so even small filesystems grown out to 200GB will
> have a log size that results in decent performance for this sort of
> workload.

Are you asking for a second patch where mkfs refuses to format a log
smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
that cranks the minimum log size up to 32MB, even if that leads to
absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-02-28 23:22     ` Darrick J. Wong
@ 2022-03-01  0:42       ` Dave Chinner
  2022-03-01  2:38         ` Darrick J. Wong
  2022-03-01  3:10         ` Dave Chinner
  0 siblings, 2 replies; 60+ messages in thread
From: Dave Chinner @ 2022-03-01  0:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sandeen, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	linux-xfs, allison.henderson

On Mon, Feb 28, 2022 at 03:22:11PM -0800, Darrick J. Wong wrote:
> On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> > On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > > account of writer threads encountering high latency when asking for log
> > > grant space when the log is small.  The reported use case is a heavily
> > > threaded indexing product logging trace information to a filesystem
> > > ranging in size between 20 and 250GB.  The meetings that result from the
> > > complaints about latency and stall warnings in dmesg both from this use
> > > case and also a large well known cloud product are now consuming 25% of
> > > the maintainer's weekly time and have been for months.
> > 
> > Is the transaction reservation space exhaustion caused by, as I
> > pointed out in another thread yesterday, the unbound concurrency in
> > IO completion?
> 
> No.  They're using synchronous directio writes to write trace data in 4k

synchronous as in O_(D)SYNC or as in "not using AIO"? Is is also
append data, and is it one file per logging thread or multiple
threads writing to a single file?

> chunks.  The number of files does not exceed the number of writer
> threads, and the number of writer threads can be up to 10*NR_CPUS (~400
> on the test system).  If I'm reading the iomap directio code correctly,
> the writer threads block and do not issue more IO until the first IO
> completes...

So, up to 400 threads concurrently issuing IO that does block
allocation and performing unwritten extent conversion, so up to ~800
concurrent running allocation related transactions at a time?

> > i.e. we have hundreds of active concurrent
> > transactions that then block on common objects between them (e.g.
> > inode locks) and serialise?
> 
> ...so yes, there are hundreds of active transactions, but (AFAICT) they
> mostly don't share objects, other than the log itself.  Once we made the
> log bigger, the hotspot moved to the AGF buffers.  I'm not sure what to
> do about /that/, since a 5GB AG is pretty small.  That aside...

No surprise, AG selection is based on the is based on trying to get
an adjacent extent for file extension. Hence assuming random
distribution because of contention and skipping done by the search
algorithm, then if we have ~50 AGs and 400 writers trying to
allocate at the same time then you've got, on average, 8 allocations
per AG being attempted roughly concurrently.

Of course, append write workloads tend to respond really well to
extent size hints - make sure you allocate a large chunk that
extents beyond EOF on the first write, then subsequent extending
writes only need unwritten extent conversion which shouldn't need
AGF access because it won't require BMBT block allocation during
conversion because it's just taking away from the unwritten extent
and putting the space into the adjacent written extent.

> > Hence only handful of completions can
> > actually run concurrently, depsite every completion holding a full
> > reservation of log space to allow them to run concurrently?
> 
> ...this is still an issue for different scenarios.  I would still be
> interested in experimenting with constraining the number of writeback
> completion workers that get started, even though that isn't at play
> here.

Well, the "running out of log space" problem is still going to
largely be caused by having up to 400 concurrent unwritten extent
conversion transactions running at any given point in time...

> > I also wonder if the right thing to do here is just set a minimum
> > log size of 32MB? The worst of the long tail latencies are mitigated
> > by this point, and so even small filesystems grown out to 200GB will
> > have a log size that results in decent performance for this sort of
> > workload.
> 
> Are you asking for a second patch where mkfs refuses to format a log
> smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
> that cranks the minimum log size up to 32MB, even if that leads to
> absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?

I'm suggesting the latter.

Along with a refusal to make an XFS filesystem smaller than, say,
256MB, because allowing users to make tiny XFS filesystems seems to
always just lead to future troubles.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-02-28 21:44   ` Eric Sandeen
@ 2022-03-01  2:21     ` Darrick J. Wong
  2022-03-01  2:44       ` Eric Sandeen
  0 siblings, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-03-01  2:21 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Christoph Hellwig, Dave Chinner, Theodore Ts'o, linux-xfs,
	allison.henderson

On Mon, Feb 28, 2022 at 03:44:29PM -0600, Eric Sandeen wrote:
> On 2/25/22 8:54 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> 
> ...
> 
> > Hence we increase the ratio by 16x because there doesn't seem to be much
> > improvement beyond that, and we don't want the log to grow /too/ large.
> > This change does not affect filesystems larger than 4TB, nor does it
> > affect formatting to older formats.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  mkfs/xfs_mkfs.c |   12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> > index 96682f9a..7178d798 100644
> > --- a/mkfs/xfs_mkfs.c
> > +++ b/mkfs/xfs_mkfs.c
> > @@ -3308,7 +3308,17 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
> >  
> >  	/* internal log - if no size specified, calculate automatically */
> >  	if (!cfg->logblocks) {
> > -		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
> > +		if (cfg->sb_feat.bigtime) {
> 
> I'm not very keen on conditioning this on bigtime; it seems very ad-hoc and
> unexpected. Future maintainers will look at this and wonder why bigtime is
> in any way related to log size...
> 
> If we make this change, why not just make it regardless of other features?
> Is there some other risk to doing so?

I wrote it this way to leave the formatting behavior unchanged on older
filesystems, figuring that you'd be wary of anything that could generate
bug reports about old fs formats, e.g. "Why does my cloud deployment
image generator report that the minified filesystem size went up when I
went from X to X+1?"

So now that I've guessed incorrectly, I guess I'll go change this to do
it unconditionally.  Or drop the whole thing entirely.  I don't know.
I'm too burned out to be able to think reasonably anymore.

Frankly, I don't have the time to prove beyond a reasonable doubt that
this the problem is exactly as stated, that the code change is exactly
the correct fix, that no other fix is more appropriate, and that there
are no other possible explanations for the slowness being complained
aobut.  I really just wanted to do this one little thing that we've all
basically agreed is the right thing.

Instead I'll just leave things as they are, and consider whether there
even is a future for me working on XFS.

--D

> Thanks,
> -Eric
> 
> > +			/*
> > +			 * Starting with bigtime, everybody gets a 256:1 ratio
> > +			 * of fs size to log size unless they say otherwise.
> > +			 * Larger logs reduce contention for log grant space,
> > +			 * which is now a problem with the advent of small
> > +			 * non-rotational storage devices.
> > +			 */
> > +			cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 256;
> > +			cfg->logblocks = cfg->logblocks >> cfg->blocklog;
> > +		} else if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
> >  			/* tiny filesystems get minimum sized logs. */
> >  			cfg->logblocks = min_logblocks;
> >  		} else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
> > 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-03-01  0:42       ` Dave Chinner
@ 2022-03-01  2:38         ` Darrick J. Wong
  2022-03-01 15:55           ` Brian Foster
  2022-03-01  3:10         ` Dave Chinner
  1 sibling, 1 reply; 60+ messages in thread
From: Darrick J. Wong @ 2022-03-01  2:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	linux-xfs, allison.henderson

On Tue, Mar 01, 2022 at 11:42:49AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 03:22:11PM -0800, Darrick J. Wong wrote:
> > On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > > > account of writer threads encountering high latency when asking for log
> > > > grant space when the log is small.  The reported use case is a heavily
> > > > threaded indexing product logging trace information to a filesystem
> > > > ranging in size between 20 and 250GB.  The meetings that result from the
> > > > complaints about latency and stall warnings in dmesg both from this use
> > > > case and also a large well known cloud product are now consuming 25% of
> > > > the maintainer's weekly time and have been for months.
> > > 
> > > Is the transaction reservation space exhaustion caused by, as I
> > > pointed out in another thread yesterday, the unbound concurrency in
> > > IO completion?
> > 
> > No.  They're using synchronous directio writes to write trace data in 4k
> 
> synchronous as in O_(D)SYNC or as in "not using AIO"? Is is also
> append data, and is it one file per logging thread or multiple
> threads writing to a single file?

One file per <cough> tenant, and multiple threads spread across all the
tenants on a system.  This means that all the threads can end up
pounding on a single file, or each thread can write to a single file.
The scenario that generated all those numbers (I think) is opening the
file O_DIRECT|O_SYNC and pwrite()ing blocks at EOF.

> > chunks.  The number of files does not exceed the number of writer
> > threads, and the number of writer threads can be up to 10*NR_CPUS (~400
> > on the test system).  If I'm reading the iomap directio code correctly,
> > the writer threads block and do not issue more IO until the first IO
> > completes...
> 
> So, up to 400 threads concurrently issuing IO that does block
> allocation and performing unwritten extent conversion, so up to ~800
> concurrent running allocation related transactions at a time?

Assuming you got 800 from the 400 writers + 400 bmbt allocations, yes.
Though I wouldn't count /quite/ that high for bmbt expansions.

> > > i.e. we have hundreds of active concurrent
> > > transactions that then block on common objects between them (e.g.
> > > inode locks) and serialise?
> > 
> > ...so yes, there are hundreds of active transactions, but (AFAICT) they
> > mostly don't share objects, other than the log itself.  Once we made the
> > log bigger, the hotspot moved to the AGF buffers.  I'm not sure what to
> > do about /that/, since a 5GB AG is pretty small.  That aside...
> 
> No surprise, AG selection is based on the is based on trying to get
> an adjacent extent for file extension. Hence assuming random
> distribution because of contention and skipping done by the search
> algorithm, then if we have ~50 AGs and 400 writers trying to
> allocate at the same time then you've got, on average, 8 allocations
> per AG being attempted roughly concurrently.
> 
> Of course, append write workloads tend to respond really well to
> extent size hints - make sure you allocate a large chunk that
> extents beyond EOF on the first write, then subsequent extending
> writes only need unwritten extent conversion which shouldn't need
> AGF access because it won't require BMBT block allocation during
> conversion because it's just taking away from the unwritten extent
> and putting the space into the adjacent written extent.

Yes, I know, I've been battling $internalgroups for over a year now to
get extent size hints turned on for things like append logs and VM
images.  If the IO sizes get larger (or they turn on extent size hints)
then the heat goes down...

> > > Hence only handful of completions can
> > > actually run concurrently, depsite every completion holding a full
> > > reservation of log space to allow them to run concurrently?
> > 
> > ...this is still an issue for different scenarios.  I would still be
> > interested in experimenting with constraining the number of writeback
> > completion workers that get started, even though that isn't at play
> > here.
> 
> Well, the "running out of log space" problem is still going to
> largely be caused by having up to 400 concurrent unwritten extent
> conversion transactions running at any given point in time...

I know!  But right now the default log size you get with a 220G volume
is 110MB.  An 800MB log just pushes the scaling problems around the
system, but on the plus side the latency is no longer so high that the
heartbeat timers trip, which causes node evictions, which leads to even
/worse/ escalations from customers whose systems experience high rates
of failure.  Every single customer who escalates this causes another
round of bug reports and another round of meetings for me, even though
the causes and the bandaids are the same.

> > > I also wonder if the right thing to do here is just set a minimum
> > > log size of 32MB? The worst of the long tail latencies are mitigated
> > > by this point, and so even small filesystems grown out to 200GB will
> > > have a log size that results in decent performance for this sort of
> > > workload.
> > 
> > Are you asking for a second patch where mkfs refuses to format a log
> > smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
> > that cranks the minimum log size up to 32MB, even if that leads to
> > absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?
> 
> I'm suggesting the latter.
> 
> Along with a refusal to make an XFS filesystem smaller than, say,
> 256MB, because allowing users to make tiny XFS filesystems seems to
> always just lead to future troubles.

I'm going to drop this patch, because I don't have the strength to push
back against you /and/ Eric.  I'm once again in a crunch because I've
spent so much of the past few weeks in meetings about bugs that I didn't
have time to try out Allison's LARP patches so I basically have nothing
to push for 5.18 because we're past -rc6 and it's too late, too late for
anything.  Hell, I can't even get that capable() thing reviewed, and
(right now) that's causing me even more meetingsgalore pain.

Is it so dangerous to turn this on so that real users can experiment
with the new setting?

Frankly I'm also thinking about throwing away six years of online repair
work because I just don't see myself being able to summon the mental
energy to run the review process when simple things are this hard.
Every week I sit down and ask myself if I really have what it takes to
keep going, and I've reached the point where the answer is NO.  I was
really hoping that some of our discussions about process improvements
would have made things better, but instead I realize that I have failed
as maintainer.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-03-01  2:21     ` Darrick J. Wong
@ 2022-03-01  2:44       ` Eric Sandeen
  0 siblings, 0 replies; 60+ messages in thread
From: Eric Sandeen @ 2022-03-01  2:44 UTC (permalink / raw)
  To: Darrick J. Wong, Eric Sandeen
  Cc: Christoph Hellwig, Dave Chinner, Theodore Ts'o, linux-xfs,
	allison.henderson

On 2/28/22 8:21 PM, Darrick J. Wong wrote:
> On Mon, Feb 28, 2022 at 03:44:29PM -0600, Eric Sandeen wrote:
>> On 2/25/22 8:54 PM, Darrick J. Wong wrote:
>>> From: Darrick J. Wong <djwong@kernel.org>
>>
>> ...
>>
>>> Hence we increase the ratio by 16x because there doesn't seem to be much
>>> improvement beyond that, and we don't want the log to grow /too/ large.
>>> This change does not affect filesystems larger than 4TB, nor does it
>>> affect formatting to older formats.
>>>
>>> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>>> ---
>>>  mkfs/xfs_mkfs.c |   12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
>>> index 96682f9a..7178d798 100644
>>> --- a/mkfs/xfs_mkfs.c
>>> +++ b/mkfs/xfs_mkfs.c
>>> @@ -3308,7 +3308,17 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
>>>  
>>>  	/* internal log - if no size specified, calculate automatically */
>>>  	if (!cfg->logblocks) {
>>> -		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
>>> +		if (cfg->sb_feat.bigtime) {
>>
>> I'm not very keen on conditioning this on bigtime; it seems very ad-hoc and
>> unexpected. Future maintainers will look at this and wonder why bigtime is
>> in any way related to log size...
>>
>> If we make this change, why not just make it regardless of other features?
>> Is there some other risk to doing so?
> 
> I wrote it this way to leave the formatting behavior unchanged on older
> filesystems, figuring that you'd be wary of anything that could generate
> bug reports about old fs formats, e.g. "Why does my cloud deployment
> image generator report that the minified filesystem size went up when I
> went from X to X+1?"

We might run into that, but I'm perhaps naively willing to tell people that
if they were hard-coding reverse-engineered filesystem heuristics to the
nearest kilobyte, too bad so sad, they were doing it wrong.

> So now that I've guessed incorrectly, I guess I'll go change this to do
> it unconditionally.  Or drop the whole thing entirely.  I don't know.
> I'm too burned out to be able to think reasonably anymore.

I (maybe also naively) think it's reasonable to increase the log size for
small filesystems, given that they often as not become large filesystems
these days, and the ultimate % increase in size will be negligible.

(It's somewhere on my todo list to figure out how various products and
provisioning mechanisms actually create, transport, and expand these
minimized images, to see what the requirements really are ...)

> Frankly, I don't have the time to prove beyond a reasonable doubt that
> this the problem is exactly as stated, that the code change is exactly
> the correct fix, that no other fix is more appropriate, and that there
> are no other possible explanations for the slowness being complained
> aobut.  I really just wanted to do this one little thing that we've all
> basically agreed is the right thing.

I do agree. I just think that if the concern is a distro noticing the
difference, then the distro can carry the patch to get rid of that difference.
That's something I have had to do as well a times. With my upstream hat on, I
would rather keep distro version concerns out of the upstream patch, is all.

> Instead I'll just leave things as they are, and consider whether there
> even is a future for me working on XFS.

I'd miss you a lot if that happened. :(

-Eric

> --D
> 
>> Thanks,
>> -Eric


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-03-01  0:42       ` Dave Chinner
  2022-03-01  2:38         ` Darrick J. Wong
@ 2022-03-01  3:10         ` Dave Chinner
  1 sibling, 0 replies; 60+ messages in thread
From: Dave Chinner @ 2022-03-01  3:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: sandeen, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	linux-xfs, allison.henderson

On Tue, Mar 01, 2022 at 11:42:49AM +1100, Dave Chinner wrote:
> On Mon, Feb 28, 2022 at 03:22:11PM -0800, Darrick J. Wong wrote:
> > On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > > > account of writer threads encountering high latency when asking for log
> > > > grant space when the log is small.  The reported use case is a heavily
> > > > threaded indexing product logging trace information to a filesystem
> > > > ranging in size between 20 and 250GB.  The meetings that result from the
> > > > complaints about latency and stall warnings in dmesg both from this use
> > > > case and also a large well known cloud product are now consuming 25% of
> > > > the maintainer's weekly time and have been for months.
> > > 
> > > Is the transaction reservation space exhaustion caused by, as I
> > > pointed out in another thread yesterday, the unbound concurrency in
> > > IO completion?
> > 
> > No.  They're using synchronous directio writes to write trace data in 4k
> 
> synchronous as in O_(D)SYNC or as in "not using AIO"? Is is also
> append data, and is it one file per logging thread or multiple
> threads writing to a single file?
> 
> > chunks.  The number of files does not exceed the number of writer
> > threads, and the number of writer threads can be up to 10*NR_CPUS (~400
> > on the test system).  If I'm reading the iomap directio code correctly,
> > the writer threads block and do not issue more IO until the first IO
> > completes...
> 
> So, up to 400 threads concurrently issuing IO that does block
> allocation and performing unwritten extent conversion, so up to ~800
> concurrent running allocation related transactions at a time?

Let me put this a different way: optimal log size is determined by
workload transaction concurrency, not filesystem capacity.

I just did a quick check of the write reservation that unwritten
extent conversion uses, and it was 150kB on my test system. If there
are going to be 400 of these unwritten completions running at once,
then for them to not block on each other, there must be more than
60MB of free log space available at all times.

If we assume the AIL tail pushing is keeping up with incoming
changes, that means we are keeping 25% of the log space is free for
in-memory transaction reservations. That means for this IO
completion transaction workload, we need a log size of at least
240MB to ensure free log space keeps ahead of transaction
concurrency. So even for really small filesysetms, this workload
needs a large log size.

For default sizing, we use filesystem size as a proxy for
concurrency - the larger the filesystem, the more workload
concurrency it can support, so the more AGs and log space is
requires. We can't get this right for every single type of storage
or workload that is out there, and we've always known this.  This is
one of the reasons behind developing mkfs config file support.

That is, if you have an application that doesn't fit the
general case and has specific log size requirements, you can write
a mkfs recipe for it and ship that so that users don't need to know
the details, just mkfs using the specific config file for that
application. In this case, the workload specific mkfs config file
simply needs to have "logsize = 500MB" and it will work for all
installations regardless of the filesystem capacity it is running
on.

Hence I'm not sure that tying the default log size to "we have a
problematic workload where small filesystems need X concurrency" is
exactly the right way to proceed here. It's the messenger, but it
should not dictate a specific solution to a much more general
problem.

Instead, I think we should be looking at things like the rotational
flag on the storage device. If that's not set, we can substantially
increase the AG count and log size scaling rate because those
devices will support much more IO and transaction concurrency.

We should be looking at CPU count, and scaling log size and AG count
based on the number of CPUs. The more CPUs we have, the more likely
we are going to see large amounts of concurrency, and so AG count
and log size should be adjusted to suit.

These are generic scaling factors that will improve the situation
for the workload being described, as well as improve the situation
for any other workload that runs on hardware that has high
concurrency capability. We still use capacity as a proxy for
scalability, we just adjust the ramp up ratio based on the
capabilities of the system presented to us.

I know, this doesn't solve the crazy "start tiny, grow 1000x before
deploy" issues, but that's a separate problem that really needs to
be addressed with independent changes such as updating minimum log
and AG sizes to mitigate those sorts of growth issues.

> > > I also wonder if the right thing to do here is just set a minimum
> > > log size of 32MB? The worst of the long tail latencies are mitigated
> > > by this point, and so even small filesystems grown out to 200GB will
> > > have a log size that results in decent performance for this sort of
> > > workload.
> > 
> > Are you asking for a second patch where mkfs refuses to format a log
> > smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
> > that cranks the minimum log size up to 32MB, even if that leads to
> > absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?
> 
> I'm suggesting the latter.
> 
> Along with a refusal to make an XFS filesystem smaller than, say,
> 256MB, because allowing users to make tiny XFS filesystems seems to
> always just lead to future troubles.

When I say stuff like this, keep in mind that I'm trying to suggest
things that will stand for years, not just alleviate the immediate
concerns. Really small XFS filesystems are at the end of the scale
we really don't care about, so let's just stop with the make-beleive
that we'll support and optimise for them. Let's just set some new
limits that are appropriate for the sort of storage and minimum
sizes we actually see XFS used for.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems
  2022-03-01  2:38         ` Darrick J. Wong
@ 2022-03-01 15:55           ` Brian Foster
  0 siblings, 0 replies; 60+ messages in thread
From: Brian Foster @ 2022-03-01 15:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, sandeen, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, linux-xfs, allison.henderson

On Mon, Feb 28, 2022 at 06:38:22PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 01, 2022 at 11:42:49AM +1100, Dave Chinner wrote:
> > On Mon, Feb 28, 2022 at 03:22:11PM -0800, Darrick J. Wong wrote:
> > > On Sun, Feb 27, 2022 at 08:37:20AM +1100, Dave Chinner wrote:
> > > > On Fri, Feb 25, 2022 at 06:54:50PM -0800, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > Recently, the upstream kernel maintainer has been taking a lot of heat on
> > > > > account of writer threads encountering high latency when asking for log
> > > > > grant space when the log is small.  The reported use case is a heavily
> > > > > threaded indexing product logging trace information to a filesystem
> > > > > ranging in size between 20 and 250GB.  The meetings that result from the
> > > > > complaints about latency and stall warnings in dmesg both from this use
> > > > > case and also a large well known cloud product are now consuming 25% of
> > > > > the maintainer's weekly time and have been for months.
> > > > 
> > > > Is the transaction reservation space exhaustion caused by, as I
> > > > pointed out in another thread yesterday, the unbound concurrency in
> > > > IO completion?
> > > 
> > > No.  They're using synchronous directio writes to write trace data in 4k
> > 
> > synchronous as in O_(D)SYNC or as in "not using AIO"? Is is also
> > append data, and is it one file per logging thread or multiple
> > threads writing to a single file?
> 
> One file per <cough> tenant, and multiple threads spread across all the
> tenants on a system.  This means that all the threads can end up
> pounding on a single file, or each thread can write to a single file.
> The scenario that generated all those numbers (I think) is opening the
> file O_DIRECT|O_SYNC and pwrite()ing blocks at EOF.
> 
> > > chunks.  The number of files does not exceed the number of writer
> > > threads, and the number of writer threads can be up to 10*NR_CPUS (~400
> > > on the test system).  If I'm reading the iomap directio code correctly,
> > > the writer threads block and do not issue more IO until the first IO
> > > completes...
> > 
> > So, up to 400 threads concurrently issuing IO that does block
> > allocation and performing unwritten extent conversion, so up to ~800
> > concurrent running allocation related transactions at a time?
> 
> Assuming you got 800 from the 400 writers + 400 bmbt allocations, yes.
> Though I wouldn't count /quite/ that high for bmbt expansions.
> 
> > > > i.e. we have hundreds of active concurrent
> > > > transactions that then block on common objects between them (e.g.
> > > > inode locks) and serialise?
> > > 
> > > ...so yes, there are hundreds of active transactions, but (AFAICT) they
> > > mostly don't share objects, other than the log itself.  Once we made the
> > > log bigger, the hotspot moved to the AGF buffers.  I'm not sure what to
> > > do about /that/, since a 5GB AG is pretty small.  That aside...
> > 
> > No surprise, AG selection is based on the is based on trying to get
> > an adjacent extent for file extension. Hence assuming random
> > distribution because of contention and skipping done by the search
> > algorithm, then if we have ~50 AGs and 400 writers trying to
> > allocate at the same time then you've got, on average, 8 allocations
> > per AG being attempted roughly concurrently.
> > 
> > Of course, append write workloads tend to respond really well to
> > extent size hints - make sure you allocate a large chunk that
> > extents beyond EOF on the first write, then subsequent extending
> > writes only need unwritten extent conversion which shouldn't need
> > AGF access because it won't require BMBT block allocation during
> > conversion because it's just taking away from the unwritten extent
> > and putting the space into the adjacent written extent.
> 
> Yes, I know, I've been battling $internalgroups for over a year now to
> get extent size hints turned on for things like append logs and VM
> images.  If the IO sizes get larger (or they turn on extent size hints)
> then the heat goes down...
> 
> > > > Hence only handful of completions can
> > > > actually run concurrently, depsite every completion holding a full
> > > > reservation of log space to allow them to run concurrently?
> > > 
> > > ...this is still an issue for different scenarios.  I would still be
> > > interested in experimenting with constraining the number of writeback
> > > completion workers that get started, even though that isn't at play
> > > here.
> > 
> > Well, the "running out of log space" problem is still going to
> > largely be caused by having up to 400 concurrent unwritten extent
> > conversion transactions running at any given point in time...
> 
> I know!  But right now the default log size you get with a 220G volume
> is 110MB.  An 800MB log just pushes the scaling problems around the
> system, but on the plus side the latency is no longer so high that the
> heartbeat timers trip, which causes node evictions, which leads to even
> /worse/ escalations from customers whose systems experience high rates
> of failure.  Every single customer who escalates this causes another
> round of bug reports and another round of meetings for me, even though
> the causes and the bandaids are the same.
> 
> > > > I also wonder if the right thing to do here is just set a minimum
> > > > log size of 32MB? The worst of the long tail latencies are mitigated
> > > > by this point, and so even small filesystems grown out to 200GB will
> > > > have a log size that results in decent performance for this sort of
> > > > workload.
> > > 
> > > Are you asking for a second patch where mkfs refuses to format a log
> > > smaller than 32MB (e.g. 8GB with the x86 defaults)?  Or a second patch
> > > that cranks the minimum log size up to 32MB, even if that leads to
> > > absurd results (e.g. 66MB filesystems with 2 AGs and a 32MB log)?
> > 
> > I'm suggesting the latter.
> > 
> > Along with a refusal to make an XFS filesystem smaller than, say,
> > 256MB, because allowing users to make tiny XFS filesystems seems to
> > always just lead to future troubles.
> 
> I'm going to drop this patch, because I don't have the strength to push
> back against you /and/ Eric.  I'm once again in a crunch because I've
> spent so much of the past few weeks in meetings about bugs that I didn't
> have time to try out Allison's LARP patches so I basically have nothing
> to push for 5.18 because we're past -rc6 and it's too late, too late for
> anything.  Hell, I can't even get that capable() thing reviewed, and
> (right now) that's causing me even more meetingsgalore pain.
> 
> Is it so dangerous to turn this on so that real users can experiment
> with the new setting?
> 
> Frankly I'm also thinking about throwing away six years of online repair
> work because I just don't see myself being able to summon the mental
> energy to run the review process when simple things are this hard.
> Every week I sit down and ask myself if I really have what it takes to
> keep going, and I've reached the point where the answer is NO.  I was
> really hoping that some of our discussions about process improvements
> would have made things better, but instead I realize that I have failed
> as maintainer.
> 

What process improvement discussions? I vaguely recall seeing something
around per-release planning or some such (??) in the not so distant
past, but not necessarily dealing with our ad hoc and wildly
inconsistent development process overall. For whatever it's worth, I've
lost a fair amount of patience with upstream XFS over probably the past
year or so, to the point where I basically consider it a dysfunctional
process. My only recourse to preserve sanity has been to limit the
amount of time and energy I contribute and refocus it elsewhere, so I
certainly sympathize with your sentiment.

However, I also think that an open development process is independent of
individual maintainership and much more of a function of what each
regular developer contributes to the process (in terms of code, review,
tone and quality of discussion, willingness to compromise at a level
commensurate with expectations of others, etc. etc.). IOW, we all share
responsibility in this regard and probably need to consider some
adjustments in behaviors and expectations if things are to improve.

I don't want to get too deep into the weeds here; I'm not cognizant of
all the various challenges that might impact folks differently, whether
it be maintainer, architect or developers of varying levels of
experience, etc. Suffice it to say that I'm open to any discussion aimed
toward trying to make the broader development experience smoother and
more consistent going forward.

Brian

> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2022-03-01 15:55 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-20  0:21 [PATCHSET 00/17] xfsprogs: various 5.15 fixes Darrick J. Wong
2022-01-20  0:21 ` [PATCH 01/17] libxcmd: use emacs mode for command history editing Darrick J. Wong
2022-01-20  0:21 ` [PATCH 02/17] libxfs: shut down filesystem if we xfs_trans_cancel with deferred work items Darrick J. Wong
2022-02-04 21:36   ` Eric Sandeen
2022-02-04 21:47     ` Darrick J. Wong
2022-01-20  0:21 ` [PATCH 03/17] libxfs: don't leave dangling perag references from xfs_buf Darrick J. Wong
2022-02-04 22:05   ` Eric Sandeen
2022-01-20  0:21 ` [PATCH 04/17] libfrog: move the GETFSMAP definitions into libfrog Darrick J. Wong
2022-02-04 23:18   ` Eric Sandeen
2022-02-05  0:36     ` Darrick J. Wong
2022-02-07  1:05       ` Dave Chinner
2022-02-07 17:09         ` Darrick J. Wong
2022-02-07 21:32           ` Eric Sandeen
2022-02-10  3:33             ` Dave Chinner
2022-02-08 16:46   ` [PATCH v1.1 04/17] libfrog: always use the kernel GETFSMAP definitions Darrick J. Wong
2022-02-25 22:35     ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 05/17] misc: add a crc32c self test to mkfs and repair Darrick J. Wong
2022-02-04 23:23   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 06/17] libxfs-apply: support filterdiff >= 0.4.2 only Darrick J. Wong
2022-01-20  0:22 ` [PATCH 07/17] xfs_db: fix nbits parameter in fa_ino[48] functions Darrick J. Wong
2022-02-25 21:45   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 08/17] xfs_repair: explicitly cast resource usage counts in do_warn Darrick J. Wong
2022-02-25 21:46   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 09/17] xfs_repair: explicitly cast directory inode numbers " Darrick J. Wong
2022-02-25 21:48   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 10/17] xfs_repair: fix indentation problems in upgrade_filesystem Darrick J. Wong
2022-02-25 21:53   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 11/17] xfs_repair: update secondary superblocks after changing features Darrick J. Wong
2022-02-25 21:57   ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 12/17] xfs_scrub: report optional features in version string Darrick J. Wong
2022-01-20  1:16   ` Theodore Ts'o
2022-01-20  1:28     ` Darrick J. Wong
2022-01-20  1:32   ` [PATCH v2 " Darrick J. Wong
2022-02-25 22:14     ` Eric Sandeen
2022-02-26  0:04       ` Darrick J. Wong
2022-02-26  2:48         ` Darrick J. Wong
2022-02-26  2:53   ` [PATCH v3 " Darrick J. Wong
2022-02-28 21:38     ` Eric Sandeen
2022-01-20  0:22 ` [PATCH 13/17] mkfs: prevent corruption of passed-in suboption string values Darrick J. Wong
2022-01-20  0:22 ` [PATCH 14/17] mkfs: add configuration files for the last few LTS kernels Darrick J. Wong
2022-01-20  0:22 ` [PATCH 15/17] mkfs: document sample configuration file location Darrick J. Wong
2022-01-20  0:23 ` [PATCH 16/17] mkfs: add a config file for x86_64 pmem filesystems Darrick J. Wong
2022-02-25 22:21   ` Eric Sandeen
2022-02-26  2:38     ` Darrick J. Wong
2022-02-26  2:52   ` [PATCH v2 " Darrick J. Wong
2022-02-28 21:37     ` Eric Sandeen
2022-01-20  0:23 ` [PATCH 17/17] mkfs: enable inobtcount and bigtime by default Darrick J. Wong
2022-02-25 22:22   ` Eric Sandeen
2022-01-28 22:44 ` [PATCH 18/17] xfs_scrub: fix reporting if we can't open raw block devices Darrick J. Wong
2022-01-31 12:28   ` Christoph Hellwig
2022-02-26  2:54 ` [PATCH 19/17] mkfs: increase default log size for new (aka bigtime) filesystems Darrick J. Wong
2022-02-26 21:37   ` Dave Chinner
2022-02-28 23:22     ` Darrick J. Wong
2022-03-01  0:42       ` Dave Chinner
2022-03-01  2:38         ` Darrick J. Wong
2022-03-01 15:55           ` Brian Foster
2022-03-01  3:10         ` Dave Chinner
2022-02-28 21:44   ` Eric Sandeen
2022-03-01  2:21     ` Darrick J. Wong
2022-03-01  2:44       ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.