All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs_db: add extent count and file size histograms
@ 2019-05-14 18:50 Jorge Guerra
  2019-05-14 19:52 ` Eric Sandeen
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jorge Guerra @ 2019-05-14 18:50 UTC (permalink / raw)
  To: linux-xfs; +Cc: osandov, Jorge Guerra

From: Jorge Guerra <jorgeguerra@fb.com>

In this change we add two feature to the xfs_db 'frag' command:

1) Extent count histogram [-e]: This option enables tracking the
   number of extents per inode (file) as the we traverse the file
   system.  The end result is a histogram of the number of extents per
   file in power of 2 buckets.

2) File size histogram and file system internal fragmentation stats
   [-s]: This option enables tracking file sizes both in terms of what
   has been physically allocated and how much has been written to the
   file.  In addition, we track the amount of internal fragmentation
   seen per file.  This is particularly useful in the case of real
   time devices where space is allocated in units of fixed sized
   extents.

The man page for xfs_db has been updated to reflect these new command
line arguments.

Tests:

We tested this change on several XFS file systems with different
configurations:

1) regular XFS:

[root@m1 ~]# xfs_info /mnt/d0
meta-data=/dev/sdb1              isize=256    agcount=10, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2441608704, imaxpct=100
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@m1 ~]# echo "frag -e -s" | xfs_db -r /dev/sdb1
xfs_db> actual 494393, ideal 489246, fragmentation factor 1.04%
Note, this number is largely meaningless.
Files on this filesystem average 1.01 extents per file
Maximum extents in a file 17
Histogram of number of extents per file:
    bucket =       count        % of total
<=       1 =      486157        99.573 %
<=       2 =         768        0.157 %
<=       4 =         371        0.076 %
<=       8 =         947        0.194 %
<=      16 =           0        0.000 %
<=      32 =           1        0.000 %
Maximum file size 64.512 MB
Histogram of file size:
    bucket =        used        overhead(bytes)
<=    4 KB =      180515                   0 0.00%
<=    8 KB =       23604          4666970112 44.31%
<=   16 KB =        2712          1961668608 18.62%
<=   32 KB =        1695           612319232 5.81%
<=   64 KB =         290           473210880 4.49%
<=  128 KB =         214           270184448 2.56%
<=  256 KB =         186           269856768 2.56%
<=  512 KB =         201            67203072 0.64%
<=    1 MB =         325           267558912 2.54%
<=    2 MB =         419           596860928 5.67%
<=    4 MB =         436           454148096 4.31%
<=    8 MB =        1864           184532992 1.75%
<=   16 MB =       16084           111964160 1.06%
<=   32 MB =      258910           395116544 3.75%
<=   64 MB =          61           202104832 1.92%
<=  128 MB =         728                   0 0.00%
capacity used (bytes): 7210847514624 (6.558 TB)
block overhead (bytes): 10533699584 (0.146 %)
xfs_db>

2) XFS with a realtime device configured with 256 KiB extents:

[root@m2 ~]# xfs_info /mnt/d0
meta-data=/dev/nvme0n1p1         isize=2048   agcount=15, agsize=434112 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=6104576, imaxpct=100
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =/dev/sdaa1             extsz=262144 blocks=2439872256, rtextents=38123004

[root@m2 ~]# echo "frag -s -e" | xfs_db -r /dev/nvme0n1p1
xfs_db> actual 11851552, ideal 1264416, fragmentation factor 89.33%
Note, this number is largely meaningless.
Files on this filesystem average 9.37 extents per file
Maximum extents in a file 129956
Histogram of number of extents per file:
    bucket =       count        % of total
<=       1 =      331951        26.295 %
<=       2 =       82720        6.553 %
<=       4 =      160041        12.677 %
<=       8 =      205312        16.263 %
<=      16 =      267145        21.161 %
<=      32 =      197625        15.655 %
<=      64 =       17610        1.395 %
<=     128 =           8        0.001 %
<=     256 =           1        0.000 %
<=     512 =           0        0.000 %
<=    1024 =           0        0.000 %
<=    2048 =           0        0.000 %
<=    4096 =           0        0.000 %
<=    8192 =           0        0.000 %
<=   16384 =           0        0.000 %
<=   32768 =           0        0.000 %
<=   65536 =           0        0.000 %
<=  131072 =           1        0.000 %
Maximum file size 15.522 GB
Histogram of file size:
    bucket =    allocated           used        overhead(bytes)
<=    4 KB =           0            2054          8924143616 3.80%
<=    8 KB =           0           57684         14648967168 6.23%
<=   16 KB =           0           24280          6032441344 2.57%
<=   32 KB =           0           18351          4340473856 1.85%
<=   64 KB =           0           20064          4280770560 1.82%
<=  128 KB =        1002           25287          4138127360 1.76%
<=  256 KB =      163110           17548          1264742400 0.54%
<=  512 KB =       19898           19863          2843152384 1.21%
<=    1 MB =       32687           32617          4361404416 1.86%
<=    2 MB =       38395           38324          5388206080 2.29%
<=    4 MB =       82700           82633         10549821440 4.49%
<=    8 MB =      208576          208477         34238386176 14.57%
<=   16 MB =      715937          715092        134046113792 57.02%
<=   32 MB =         107             107             6332416 0.00%
<=   64 MB =           0               0                   0 0.00%
<=  128 MB =           1               1              157611 0.00%
<=  256 MB =           0               0                   0 0.00%
<=  512 MB =           0               0                   0 0.00%
<=    1 GB =           0               0                   0 0.00%
<=    2 GB =           0               0                   0 0.00%
<=    4 GB =           0               0                   0 0.00%
<=    8 GB =           0               0                   0 0.00%
<=   16 GB =           1               1                   0 0.00%
capacity used (bytes): 7679537216535 (6.984 TB)
capacity allocated (bytes): 7914608582656 (7.198 TB)
block overhead (bytes): 235071366121 (3.061 %)
xfs_db>

3) XFS with a realtime device configured with 1044 KiB extents:

[root@m3 ~]# xfs_info /mnt/d0
meta-data=/dev/sdb1              isize=2048   agcount=4, agsize=1041728 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=4166912, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =/dev/sdc1              extsz=1069056 blocks=1949338880, rtextents=7468731
[root@m3 ~]# echo "frag -s -e" | /tmp/xfs_db -r /dev/sdc1
xfs_db: /dev/sdc1 is not a valid XFS filesystem (unexpected SB magic number 0x68656164)
Use -F to force a read attempt.
[root@m3 ~]# echo "frag -s -e" | /tmp/xfs_db -r /dev/sdb1
xfs_db> actual 732480, ideal 360707, fragmentation factor 50.76%
Note, this number is largely meaningless.
Files on this filesystem average 2.03 extents per file
Maximum extents in a file 14
Histogram of number of extents per file:
    bucket =       count        % of total
<=       1 =      350934        97.696 %
<=       2 =        6231        1.735 %
<=       4 =        1001        0.279 %
<=       8 =         953        0.265 %
<=      16 =          92        0.026 %
Maximum file size 26.508 MB
Histogram of file size:
    bucket =    allocated           used        overhead(bytes)
<=    4 KB =           0              62           314048512 0.13%
<=    8 KB =           0          119911        127209263104 53.28%
<=   16 KB =           0           14543         15350194176 6.43%
<=   32 KB =         909           12330         11851161600 4.96%
<=   64 KB =          92            6704          6828642304 2.86%
<=  128 KB =           1            7132          6933372928 2.90%
<=  256 KB =           0           10013          8753799168 3.67%
<=  512 KB =           0           13616          9049227264 3.79%
<=    1 MB =           1           15056          4774912000 2.00%
<=    2 MB =      198662           17168          9690226688 4.06%
<=    4 MB =       28639           21073         11806654464 4.94%
<=    8 MB =       35169           29878         14200553472 5.95%
<=   16 MB =       95667           91633         11939287040 5.00%
<=   32 MB =          71              62            28471742 0.01%
capacity used (bytes): 1097735533058 (1022.346 GB)
capacity allocated (bytes): 1336497410048 (1.216 TB)
block overhead (bytes): 238761885182 (21.750 %)
xfs_db>

Signed-off-by: Jorge Guerra <jorgeguerra@fb.com>
---
 db/frag.c         | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 man/man8/xfs_db.8 |   8 ++-
 2 files changed, 211 insertions(+), 7 deletions(-)

diff --git a/db/frag.c b/db/frag.c
index 91395234..5d569325 100644
--- a/db/frag.c
+++ b/db/frag.c
@@ -15,6 +15,31 @@
 #include "init.h"
 #include "malloc.h"
 
+#define	PERCENT(x, y)	(((double)(x) * 100)/(y))
+//#define	ARRAY_SIZE(a)	(sizeof((a))/sizeof((a)[0]))
+#define	BLOCKS_2_BYTES(b)	((b) << 12)
+#define	CLZ(n)	(__builtin_clzl(n))
+#define	CTZ(n)	(__builtin_ctzl(n))
+
+#define	N_BUCKETS	64
+
+typedef struct extentstats {
+	uint64_t	allocsize[N_BUCKETS + 1];
+	uint64_t	usedsize[N_BUCKETS + 1];
+	uint64_t	wastedsize[N_BUCKETS + 1];
+	uint64_t	maxfilesize;
+	uint64_t	logicalused;
+	uint64_t	physicalused;
+	uint64_t	wastedspace;
+	bool		realtime;
+} extentstats_t;
+
+typedef struct fileextstats {
+	uint64_t	extsbuckets[N_BUCKETS + 1];
+	uint64_t	maxexts;
+	uint64_t	numfiles;
+} fileextstats_t;
+
 typedef struct extent {
 	xfs_fileoff_t	startoff;
 	xfs_filblks_t	blockcount;
@@ -38,6 +63,10 @@ static int		qflag;
 static int		Rflag;
 static int		rflag;
 static int		vflag;
+static int		eflag;
+static extentstats_t	extstats;
+static int		sflag;
+static fileextstats_t	festats;
 
 typedef void	(*scan_lbtree_f_t)(struct xfs_btree_block *block,
 				   int			level,
@@ -49,7 +78,7 @@ typedef void	(*scan_sbtree_f_t)(struct xfs_btree_block *block,
 				   xfs_agf_t		*agf);
 
 static extmap_t		*extmap_alloc(xfs_extnum_t nex);
-static xfs_extnum_t	extmap_ideal(extmap_t *extmap);
+static xfs_extnum_t	extmap_ideal(extmap_t *extmap, uint64_t *fallocsize);
 static void		extmap_set_ext(extmap_t **extmapp, xfs_fileoff_t o,
 				       xfs_extlen_t c);
 static int		frag_f(int argc, char **argv);
@@ -77,9 +106,46 @@ static void		scanfunc_ino(struct xfs_btree_block *block, int level,
 
 static const cmdinfo_t	frag_cmd =
 	{ "frag", NULL, frag_f, 0, -1, 0,
-	  "[-a] [-d] [-f] [-l] [-q] [-R] [-r] [-v]",
+	  "[-a] [-d] [-e] [-f] [-l] [-q] [-R] [-r] [-s] [-v]",
 	  "get file fragmentation data", NULL };
 
+// IEC 2^10 standard prefixes
+static const char	iec_prefixes[] =
+	{ ' ', 'K', 'M', 'G', 'T', 'P', 'E', 'Z'};
+
+static double
+bytes_2_human(
+	uint64_t bytes,
+	int *iecprefix)
+{
+	double answer;
+	int i;
+
+	for (i = 0, answer = (double)bytes;
+		answer > 1024 && i < ARRAY_SIZE(iec_prefixes);
+		i++, answer /= 1024);
+	*iecprefix = i;
+
+	return answer;
+}
+
+static uint8_t
+get_bucket(
+	uint64_t	val)
+{
+	uint8_t	bucket;
+	uint8_t	msbidx = 63 - CLZ(val);
+	uint8_t	lsbidx = CTZ(val);
+
+	/*
+	 * The bucket is computed as ceiling(s, 2^CLZ(s)), but this method is
+	 * faster.
+	 */
+	bucket = msbidx + (msbidx != lsbidx ? 1 : 0);
+
+	return MIN(bucket, N_BUCKETS);
+}
+
 static extmap_t *
 extmap_alloc(
 	xfs_extnum_t	nex)
@@ -96,18 +162,23 @@ extmap_alloc(
 
 static xfs_extnum_t
 extmap_ideal(
-	extmap_t	*extmap)
+	extmap_t	*extmap,
+	uint64_t	*fallocsize)
 {
 	extent_t	*ep;
 	xfs_extnum_t	rval;
+	uint64_t	fsize = 0;
 
 	for (ep = &extmap->ents[0], rval = 0;
 	     ep < &extmap->ents[extmap->nents];
 	     ep++) {
+		fsize += BLOCKS_2_BYTES(ep->blockcount);
 		if (ep == &extmap->ents[0] ||
 		    ep->startoff != ep[-1].startoff + ep[-1].blockcount)
 			rval++;
 	}
+	*fallocsize = fsize;
+
 	return rval;
 }
 
@@ -133,6 +204,80 @@ extmap_set_ext(
 }
 
 void
+print_extents_histo(void)
+{
+	int		i;
+	int		nfiles = 0;
+
+	dbprintf(_("Maximum extents in a file %lu\n"), festats.maxexts);
+	dbprintf(_("Histogram of number of extents per file:\n"));
+	dbprintf(_("   %7s =\t%8s\t%s\n"), "bucket", "count", "\% of total");
+	for (i = 0;
+		i <= N_BUCKETS && nfiles < festats.numfiles; i++) {
+		nfiles += festats.extsbuckets[i];
+		if (nfiles == 0)
+			continue;
+		dbprintf(_("<= %7u = \t%8u\t%.3f \%\n"), 1 << i, festats.extsbuckets[i],
+			PERCENT(festats.extsbuckets[i], festats.numfiles));
+	}
+}
+
+void
+print_file_size_histo(void)
+{
+	double		answer;
+	int		i;
+	int		nfiles = 0;
+	int		ufiles = 0;
+
+	answer = bytes_2_human(extstats.maxfilesize, &i);
+	dbprintf(_("Maximum file size %.3f %cB\n"), answer, iec_prefixes[i]);
+	dbprintf(_("Histogram of file size:\n"));
+	if (extstats.realtime) {
+		dbprintf(_("   %7s =\t%8s\t%8s\t%12s\n"),
+		 "bucket", "allocated", "used", "overhead(bytes)");
+		for (i = 10; i <= N_BUCKETS && nfiles < festats.numfiles; i++) {
+			nfiles += extstats.allocsize[i];
+			ufiles += extstats.usedsize[i];
+			if (ufiles == 0)
+				continue;
+		dbprintf(_("<= %4u %cB =\t%8lu\t%8lu\t%12lu %.2f\%\n"), 1 << (i % 10),
+			iec_prefixes[i/10],
+			extstats.allocsize[i], extstats.usedsize[i],
+			extstats.wastedsize[i],
+			PERCENT(extstats.wastedsize[i], extstats.wastedspace));
+		}
+		answer = bytes_2_human(extstats.logicalused, &i);
+		dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
+		extstats.logicalused, answer, iec_prefixes[i]);
+		answer = bytes_2_human(extstats.physicalused, &i);
+		dbprintf(_("capacity allocated (bytes): %llu (%.3f %cB)\n"),
+			extstats.physicalused, answer, iec_prefixes[i]);
+		answer = PERCENT(extstats.wastedspace, extstats.logicalused);
+	} else {
+		dbprintf(_("   %7s =\t%8s\t%12s\n"),
+		 "bucket", "used", "overhead(bytes)");
+		for (i = 10; i <= N_BUCKETS && nfiles < festats.numfiles; i++) {
+			nfiles += extstats.allocsize[i];
+			ufiles += extstats.usedsize[i];
+			if (ufiles == 0)
+				continue;
+		dbprintf(_("<= %4u %cB =\t%8lu\t%12lu %.2f\%\n"), 1 << (i % 10),
+			iec_prefixes[i/10],
+			extstats.allocsize[i],
+			extstats.wastedsize[i],
+			PERCENT(extstats.wastedsize[i], extstats.wastedspace));
+		}
+		answer = bytes_2_human(extstats.physicalused, &i);
+		dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
+			extstats.physicalused, answer, iec_prefixes[i]);
+		answer = PERCENT(extstats.wastedspace, extstats.physicalused);
+	}
+	dbprintf(_("block overhead (bytes): %llu (%.3f \%)\n"),
+		extstats.wastedspace, answer);
+}
+
+void
 frag_init(void)
 {
 	add_command(&frag_cmd);
@@ -164,6 +309,12 @@ frag_f(
 	answer = (double)extcount_actual / (double)extcount_ideal;
 	dbprintf(_("Files on this filesystem average %.2f extents per file\n"),
 		answer);
+	if (eflag) {
+		print_extents_histo();
+	}
+	if (sflag) {
+		print_file_size_histo();
+	}
 	return 0;
 }
 
@@ -174,9 +325,10 @@ init(
 {
 	int		c;
 
-	aflag = dflag = fflag = lflag = qflag = Rflag = rflag = vflag = 0;
+	aflag = dflag = eflag = fflag = lflag = qflag = Rflag =
+		rflag = sflag = vflag = 0;
 	optind = 0;
-	while ((c = getopt(argc, argv, "adflqRrv")) != EOF) {
+	while ((c = getopt(argc, argv, "adeflqRrsv")) != EOF) {
 		switch (c) {
 		case 'a':
 			aflag = 1;
@@ -184,6 +336,9 @@ init(
 		case 'd':
 			dflag = 1;
 			break;
+		case 'e':
+			eflag = 1;
+			break;
 		case 'f':
 			fflag = 1;
 			break;
@@ -199,6 +354,9 @@ init(
 		case 'r':
 			rflag = 1;
 			break;
+		case 's':
+			sflag = 1;
+			break;
 		case 'v':
 			vflag = 1;
 			break;
@@ -210,6 +368,8 @@ init(
 	if (!aflag && !dflag && !fflag && !lflag && !qflag && !Rflag && !rflag)
 		aflag = dflag = fflag = lflag = qflag = Rflag = rflag = 1;
 	extcount_actual = extcount_ideal = 0;
+	memset(&extstats, 0 , sizeof(extstats));
+	memset(&festats, 0 , sizeof(festats));
 	return 1;
 }
 
@@ -274,6 +434,10 @@ process_fork(
 {
 	extmap_t	*extmap;
 	int		nex;
+	int	bucket;
+	uint64_t	fallocsize;
+	uint64_t	fusedsize;
+	uint64_t	fwastedsize;
 
 	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
 	if (!nex)
@@ -288,7 +452,41 @@ process_fork(
 		break;
 	}
 	extcount_actual += extmap->nents;
-	extcount_ideal += extmap_ideal(extmap);
+	extcount_ideal += extmap_ideal(extmap, &fallocsize);
+
+	if (sflag) {
+		// Record file size stats
+		fusedsize = be64_to_cpu(dip->di_size);
+		bucket = get_bucket(fallocsize);
+		extstats.allocsize[bucket]++;
+		bucket = get_bucket(fusedsize);
+		extstats.usedsize[bucket]++;
+
+		if (fallocsize > fusedsize) {
+			fwastedsize = fallocsize - fusedsize;
+			extstats.wastedspace += fwastedsize;
+			extstats.wastedsize[bucket] += fwastedsize;
+		}
+		extstats.logicalused += fusedsize;
+		extstats.physicalused += fallocsize;
+		extstats.maxfilesize = MAX(extstats.maxfilesize, fallocsize);
+		if (be16_to_cpu(dip->di_flags) & XFS_DIFLAG_REALTIME) {
+			extstats.realtime = true;
+		}
+	}
+
+	if (eflag) {
+		// Record file extent stats
+		bucket = get_bucket(extmap->nents);
+		if (be16_to_cpu(dip->di_flags) & XFS_DIFLAG_REALTIME) {
+			// Realtime inodes have an additional extent
+			bucket = get_bucket(MAX(extmap->nents - 1, 1));
+		}
+		festats.extsbuckets[bucket]++;
+		festats.maxexts = MAX(festats.maxexts, extmap->nents);
+	}
+	festats.numfiles++;
+
 	xfree(extmap);
 }
 
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index a1ee3514..52d5f18a 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -489,7 +489,7 @@ command.
 .B forward
 Move forward to the next entry in the position ring.
 .TP
-.B frag [\-adflqRrv]
+.B frag [\-adeflqRrsv]
 Get file fragmentation data. This prints information about fragmentation
 of file data in the filesystem (as opposed to fragmentation of freespace,
 for which see the
@@ -510,6 +510,9 @@ enables processing of attribute data.
 .B \-d
 enables processing of directory data.
 .TP
+.B \-e
+enables computing extent count per inode histogram.
+.TP
 .B \-f
 enables processing of regular file data.
 .TP
@@ -524,6 +527,9 @@ enables processing of realtime control file data.
 .TP
 .B \-r
 enables processing of realtime file data.
+.TP
+.B \-s
+enables computing file size histogram and file system overheads.
 .RE
 .TP
 .BI "freesp [\-bcds] [\-A " alignment "] [\-a " ag "] ... [\-e " i "] [\-h " h1 "] ... [\-m " m ]
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-14 18:50 [PATCH] xfs_db: add extent count and file size histograms Jorge Guerra
@ 2019-05-14 19:52 ` Eric Sandeen
  2019-05-14 20:02 ` Eric Sandeen
  2019-05-14 23:31 ` Dave Chinner
  2 siblings, 0 replies; 14+ messages in thread
From: Eric Sandeen @ 2019-05-14 19:52 UTC (permalink / raw)
  To: Jorge Guerra, linux-xfs; +Cc: osandov, Jorge Guerra

On 5/14/19 1:50 PM, Jorge Guerra wrote:
> From: Jorge Guerra <jorgeguerra@fb.com>
> 
> In this change we add two feature to the xfs_db 'frag' command:
> 
> 1) Extent count histogram [-e]: This option enables tracking the
>    number of extents per inode (file) as the we traverse the file
>    system.  The end result is a histogram of the number of extents per
>    file in power of 2 buckets.
> 
> 2) File size histogram and file system internal fragmentation stats
>    [-s]: This option enables tracking file sizes both in terms of what
>    has been physically allocated and how much has been written to the
>    file.  In addition, we track the amount of internal fragmentation
>    seen per file.  This is particularly useful in the case of real
>    time devices where space is allocated in units of fixed sized
>    extents.
> 
> The man page for xfs_db has been updated to reflect these new command
> line arguments.
> 
> Tests:
> 
> We tested this change on several XFS file systems with different
> configurations:
> 
> 1) regular XFS:
> 
> [root@m1 ~]# xfs_info /mnt/d0
> meta-data=/dev/sdb1              isize=256    agcount=10, agsize=268435455 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=2441608704, imaxpct=100
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> [root@m1 ~]# echo "frag -e -s" | xfs_db -r /dev/sdb1
> xfs_db> actual 494393, ideal 489246, fragmentation factor 1.04%
> Note, this number is largely meaningless.
> Files on this filesystem average 1.01 extents per file
> Maximum extents in a file 17
> Histogram of number of extents per file:
>     bucket =       count        % of total
> <=       1 =      486157        99.573 %
> <=       2 =         768        0.157 %
> <=       4 =         371        0.076 %
> <=       8 =         947        0.194 %
> <=      16 =           0        0.000 %
> <=      32 =           1        0.000 %
> Maximum file size 64.512 MB

One thing to note here is that by default, frag is collecting stats on everything -
files, dirs, symlinks, and even attributes.  That may not be obvious, and it
may do interesting things to your stats.  You can always pick "only file data"
with the -f argument.


mostly cosmetic nitpicks below, though on technical point is that it's wrong
to assume 4k blocks as you seem to have done.

> Histogram of file size:
>     bucket =        used        overhead(bytes)
> <=    4 KB =      180515                   0 0.00%
> <=    8 KB =       23604          4666970112 44.31%
> <=   16 KB =        2712          1961668608 18.62%
> <=   32 KB =        1695           612319232 5.81%
> <=   64 KB =         290           473210880 4.49%
> <=  128 KB =         214           270184448 2.56%
> <=  256 KB =         186           269856768 2.56%
> <=  512 KB =         201            67203072 0.64%
> <=    1 MB =         325           267558912 2.54%
> <=    2 MB =         419           596860928 5.67%
> <=    4 MB =         436           454148096 4.31%
> <=    8 MB =        1864           184532992 1.75%
> <=   16 MB =       16084           111964160 1.06%
> <=   32 MB =      258910           395116544 3.75%
> <=   64 MB =          61           202104832 1.92%
> <=  128 MB =         728                   0 0.00%
> capacity used (bytes): 7210847514624 (6.558 TB)
> block overhead (bytes): 10533699584 (0.146 %)
> xfs_db>
> 
> 2) XFS with a realtime device configured with 256 KiB extents:
> 
> [root@m2 ~]# xfs_info /mnt/d0
> meta-data=/dev/nvme0n1p1         isize=2048   agcount=15, agsize=434112 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=6104576, imaxpct=100
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =/dev/sdaa1             extsz=262144 blocks=2439872256, rtextents=38123004
> 
> [root@m2 ~]# echo "frag -s -e" | xfs_db -r /dev/nvme0n1p1
> xfs_db> actual 11851552, ideal 1264416, fragmentation factor 89.33%
> Note, this number is largely meaningless.
> Files on this filesystem average 9.37 extents per file
> Maximum extents in a file 129956
> Histogram of number of extents per file:
>     bucket =       count        % of total
> <=       1 =      331951        26.295 %
> <=       2 =       82720        6.553 %
> <=       4 =      160041        12.677 %
> <=       8 =      205312        16.263 %
> <=      16 =      267145        21.161 %
> <=      32 =      197625        15.655 %
> <=      64 =       17610        1.395 %
> <=     128 =           8        0.001 %
> <=     256 =           1        0.000 %
> <=     512 =           0        0.000 %
> <=    1024 =           0        0.000 %
> <=    2048 =           0        0.000 %
> <=    4096 =           0        0.000 %
> <=    8192 =           0        0.000 %
> <=   16384 =           0        0.000 %
> <=   32768 =           0        0.000 %
> <=   65536 =           0        0.000 %
> <=  131072 =           1        0.000 %
> Maximum file size 15.522 GB
> Histogram of file size:
>     bucket =    allocated           used        overhead(bytes)
> <=    4 KB =           0            2054          8924143616 3.80%
> <=    8 KB =           0           57684         14648967168 6.23%
> <=   16 KB =           0           24280          6032441344 2.57%
> <=   32 KB =           0           18351          4340473856 1.85%
> <=   64 KB =           0           20064          4280770560 1.82%
> <=  128 KB =        1002           25287          4138127360 1.76%
> <=  256 KB =      163110           17548          1264742400 0.54%
> <=  512 KB =       19898           19863          2843152384 1.21%
> <=    1 MB =       32687           32617          4361404416 1.86%
> <=    2 MB =       38395           38324          5388206080 2.29%
> <=    4 MB =       82700           82633         10549821440 4.49%
> <=    8 MB =      208576          208477         34238386176 14.57%
> <=   16 MB =      715937          715092        134046113792 57.02%
> <=   32 MB =         107             107             6332416 0.00%
> <=   64 MB =           0               0                   0 0.00%
> <=  128 MB =           1               1              157611 0.00%
> <=  256 MB =           0               0                   0 0.00%
> <=  512 MB =           0               0                   0 0.00%
> <=    1 GB =           0               0                   0 0.00%
> <=    2 GB =           0               0                   0 0.00%
> <=    4 GB =           0               0                   0 0.00%
> <=    8 GB =           0               0                   0 0.00%
> <=   16 GB =           1               1                   0 0.00%
> capacity used (bytes): 7679537216535 (6.984 TB)
> capacity allocated (bytes): 7914608582656 (7.198 TB)
> block overhead (bytes): 235071366121 (3.061 %)
> xfs_db>
> 
> 3) XFS with a realtime device configured with 1044 KiB extents:
> 
> [root@m3 ~]# xfs_info /mnt/d0
> meta-data=/dev/sdb1              isize=2048   agcount=4, agsize=1041728 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=4166912, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =/dev/sdc1              extsz=1069056 blocks=1949338880, rtextents=7468731
> [root@m3 ~]# echo "frag -s -e" | /tmp/xfs_db -r /dev/sdc1
> xfs_db: /dev/sdc1 is not a valid XFS filesystem (unexpected SB magic number 0x68656164)
> Use -F to force a read attempt.
> [root@m3 ~]# echo "frag -s -e" | /tmp/xfs_db -r /dev/sdb1
> xfs_db> actual 732480, ideal 360707, fragmentation factor 50.76%
> Note, this number is largely meaningless.
> Files on this filesystem average 2.03 extents per file
> Maximum extents in a file 14
> Histogram of number of extents per file:
>     bucket =       count        % of total
> <=       1 =      350934        97.696 %
> <=       2 =        6231        1.735 %
> <=       4 =        1001        0.279 %
> <=       8 =         953        0.265 %
> <=      16 =          92        0.026 %
> Maximum file size 26.508 MB
> Histogram of file size:
>     bucket =    allocated           used        overhead(bytes)
> <=    4 KB =           0              62           314048512 0.13%
> <=    8 KB =           0          119911        127209263104 53.28%
> <=   16 KB =           0           14543         15350194176 6.43%
> <=   32 KB =         909           12330         11851161600 4.96%
> <=   64 KB =          92            6704          6828642304 2.86%
> <=  128 KB =           1            7132          6933372928 2.90%
> <=  256 KB =           0           10013          8753799168 3.67%
> <=  512 KB =           0           13616          9049227264 3.79%
> <=    1 MB =           1           15056          4774912000 2.00%
> <=    2 MB =      198662           17168          9690226688 4.06%
> <=    4 MB =       28639           21073         11806654464 4.94%
> <=    8 MB =       35169           29878         14200553472 5.95%
> <=   16 MB =       95667           91633         11939287040 5.00%
> <=   32 MB =          71              62            28471742 0.01%
> capacity used (bytes): 1097735533058 (1022.346 GB)
> capacity allocated (bytes): 1336497410048 (1.216 TB)
> block overhead (bytes): 238761885182 (21.750 %)

ok if you do this you have to handle the RFE for file tail packing too ;)

> xfs_db>
> 
> Signed-off-by: Jorge Guerra <jorgeguerra@fb.com>
> ---
>  db/frag.c         | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  man/man8/xfs_db.8 |   8 ++-
>  2 files changed, 211 insertions(+), 7 deletions(-)
> 
> diff --git a/db/frag.c b/db/frag.c
> index 91395234..5d569325 100644
> --- a/db/frag.c
> +++ b/db/frag.c
> @@ -15,6 +15,31 @@
>  #include "init.h"
>  #include "malloc.h"
>  
> +#define	PERCENT(x, y)	(((double)(x) * 100)/(y))
> +//#define	ARRAY_SIZE(a)	(sizeof((a))/sizeof((a)[0]))

no need to add commented-out new #defines

> +#define	BLOCKS_2_BYTES(b)	((b) << 12)

only for 4k blocks, right?

I think you want to use

XFS_FSB_TO_B(mp,fsbno) though getting mp might be fun

> +#define	CLZ(n)	(__builtin_clzl(n))
> +#define	CTZ(n)	(__builtin_ctzl(n))
> +
> +#define	N_BUCKETS	64
> +
> +typedef struct extentstats {
> +	uint64_t	allocsize[N_BUCKETS + 1];
> +	uint64_t	usedsize[N_BUCKETS + 1];
> +	uint64_t	wastedsize[N_BUCKETS + 1];
> +	uint64_t	maxfilesize;
> +	uint64_t	logicalused;
> +	uint64_t	physicalused;
> +	uint64_t	wastedspace;
> +	bool		realtime;
> +} extentstats_t;
> +
> +typedef struct fileextstats {
> +	uint64_t	extsbuckets[N_BUCKETS + 1];
> +	uint64_t	maxexts;
> +	uint64_t	numfiles;
> +} fileextstats_t;
> +
>  typedef struct extent {
>  	xfs_fileoff_t	startoff;
>  	xfs_filblks_t	blockcount;
> @@ -38,6 +63,10 @@ static int		qflag;
>  static int		Rflag;
>  static int		rflag;
>  static int		vflag;

up until here it seems like declarations were ~alphabetical; I'd keep
it that way, or group all the flags together, vs. the randomness
you've introduced below.  </nitpick>

> +static int		eflag;
> +static extentstats_t	extstats;
> +static int		sflag;
> +static fileextstats_t	festats;

We've been trying to avoid typedefs where we don't need them,
I /think/ just "struct extentstats	extstats;" would
be preferred here.  (yes there are typedefs in the code but we've
been trying to move the other way)
 
>  typedef void	(*scan_lbtree_f_t)(struct xfs_btree_block *block,
>  				   int			level,
> @@ -49,7 +78,7 @@ typedef void	(*scan_sbtree_f_t)(struct xfs_btree_block *block,
>  				   xfs_agf_t		*agf);
>  
>  static extmap_t		*extmap_alloc(xfs_extnum_t nex);
> -static xfs_extnum_t	extmap_ideal(extmap_t *extmap);
> +static xfs_extnum_t	extmap_ideal(extmap_t *extmap, uint64_t *fallocsize);
>  static void		extmap_set_ext(extmap_t **extmapp, xfs_fileoff_t o,
>  				       xfs_extlen_t c);
>  static int		frag_f(int argc, char **argv);
> @@ -77,9 +106,46 @@ static void		scanfunc_ino(struct xfs_btree_block *block, int level,
>  
>  static const cmdinfo_t	frag_cmd =
>  	{ "frag", NULL, frag_f, 0, -1, 0,
> -	  "[-a] [-d] [-f] [-l] [-q] [-R] [-r] [-v]",
> +	  "[-a] [-d] [-e] [-f] [-l] [-q] [-R] [-r] [-s] [-v]",

Heh, it might be time for

> +	  "[-adeflgRrsv]",

:)

>  	  "get file fragmentation data", NULL };
>  
> +// IEC 2^10 standard prefixes

/* C comments please */

> +static const char	iec_prefixes[] =
> +	{ ' ', 'K', 'M', 'G', 'T', 'P', 'E', 'Z'};
> +
> +static double
> +bytes_2_human(
> +	uint64_t bytes,
> +	int *iecprefix)
> +{
> +	double answer;
> +	int i;

+bytes_2_human(
+	uint64_t	bytes,
+	int		*iecprefix)
+{
+	double		answer;
+	int		i;

> +
> +	for (i = 0, answer = (double)bytes;
> +		answer > 1024 && i < ARRAY_SIZE(iec_prefixes);
> +		i++, answer /= 1024);
> +	*iecprefix = i;
> +
> +	return answer;
> +}
> +
> +static uint8_t
> +get_bucket(
> +	uint64_t	val)
> +{
> +	uint8_t	bucket;
> +	uint8_t	msbidx = 63 - CLZ(val);
> +	uint8_t	lsbidx = CTZ(val);

+	uint8_t		bucket;
+	uint8_t		msbidx = 63 - CLZ(val);
+	uint8_t		lsbidx = CTZ(val);

> +
> +	/*
> +	 * The bucket is computed as ceiling(s, 2^CLZ(s)), but this method is
> +	 * faster.
> +	 */
> +	bucket = msbidx + (msbidx != lsbidx ? 1 : 0);
> +
> +	return MIN(bucket, N_BUCKETS);
> +}
> +
>  static extmap_t *
>  extmap_alloc(
>  	xfs_extnum_t	nex)
> @@ -96,18 +162,23 @@ extmap_alloc(
>  
>  static xfs_extnum_t
>  extmap_ideal(
> -	extmap_t	*extmap)
> +	extmap_t	*extmap,
> +	uint64_t	*fallocsize)

"fallocsize" is a little bit of an odd choice given the existence
of "falloc" and "fallocate" - which are unrelated here.

maybe f_allocsize / f_usedsize / f_wastedsize?  Not sure.

>  {
>  	extent_t	*ep;
>  	xfs_extnum_t	rval;
> +	uint64_t	fsize = 0;
>  
>  	for (ep = &extmap->ents[0], rval = 0;
>  	     ep < &extmap->ents[extmap->nents];
>  	     ep++) {
> +		fsize += BLOCKS_2_BYTES(ep->blockcount);

XFS_FSB_TO_B(mp, ep->blockcount) except of course you don't have mp...
Could also use a file stat to get allocated blocks all at once, but
it's otherwise convenient here I suppose...

>  		if (ep == &extmap->ents[0] ||
>  		    ep->startoff != ep[-1].startoff + ep[-1].blockcount)
>  			rval++;
>  	}
> +	*fallocsize = fsize;
> +
>  	return rval;
>  }
>  
> @@ -133,6 +204,80 @@ extmap_set_ext(
>  }
>  
>  void
> +print_extents_histo(void)
> +{
> +	int		i;
> +	int		nfiles = 0;
> +
> +	dbprintf(_("Maximum extents in a file %lu\n"), festats.maxexts);
> +	dbprintf(_("Histogram of number of extents per file:\n"));
> +	dbprintf(_("   %7s =\t%8s\t%s\n"), "bucket", "count", "\% of total");
> +	for (i = 0;
> +		i <= N_BUCKETS && nfiles < festats.numfiles; i++) {
> +		nfiles += festats.extsbuckets[i];
> +		if (nfiles == 0)
> +			continue;
> +		dbprintf(_("<= %7u = \t%8u\t%.3f \%\n"), 1 << i, festats.extsbuckets[i],

<= 80 cols please

> +			PERCENT(festats.extsbuckets[i], festats.numfiles));
> +	}
> +}
> +
> +void
> +print_file_size_histo(void)
> +{
> +	double		answer;
> +	int		i;
> +	int		nfiles = 0;
> +	int		ufiles = 0;
> +
> +	answer = bytes_2_human(extstats.maxfilesize, &i);
> +	dbprintf(_("Maximum file size %.3f %cB\n"), answer, iec_prefixes[i]);
> +	dbprintf(_("Histogram of file size:\n"));
> +	if (extstats.realtime) {
> +		dbprintf(_("   %7s =\t%8s\t%8s\t%12s\n"),
> +		 "bucket", "allocated", "used", "overhead(bytes)");

+		dbprintf(_("   %7s =\t%8s\t%8s\t%12s\n"),
+			 "bucket", "allocated", "used", "overhead(bytes)");

> +		for (i = 10; i <= N_BUCKETS && nfiles < festats.numfiles; i++) {
> +			nfiles += extstats.allocsize[i];
> +			ufiles += extstats.usedsize[i];
> +			if (ufiles == 0)
> +				continue;
> +		dbprintf(_("<= %4u %cB =\t%8lu\t%8lu\t%12lu %.2f\%\n"), 1 << (i % 10),

Please do your best to keep lines <= 80 cols

> +			iec_prefixes[i/10],
> +			extstats.allocsize[i], extstats.usedsize[i],
> +			extstats.wastedsize[i],
> +			PERCENT(extstats.wastedsize[i], extstats.wastedspace));
> +		}
> +		answer = bytes_2_human(extstats.logicalused, &i);
> +		dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
> +		extstats.logicalused, answer, iec_prefixes[i]);
> +		answer = bytes_2_human(extstats.physicalused, &i);
> +		dbprintf(_("capacity allocated (bytes): %llu (%.3f %cB)\n"),
> +			extstats.physicalused, answer, iec_prefixes[i]);
> +		answer = PERCENT(extstats.wastedspace, extstats.logicalused);
> +	} else {
> +		dbprintf(_("   %7s =\t%8s\t%12s\n"),
> +		 "bucket", "used", "overhead(bytes)");

+		dbprintf(_("   %7s =\t%8s\t%12s\n"),
+			 "bucket", "used", "overhead(bytes)");

(keep the continued printf lines indented enough to make it obvious)

> +		for (i = 10; i <= N_BUCKETS && nfiles < festats.numfiles; i++) {
> +			nfiles += extstats.allocsize[i];
> +			ufiles += extstats.usedsize[i];
> +			if (ufiles == 0)
> +				continue;
> +		dbprintf(_("<= %4u %cB =\t%8lu\t%12lu %.2f\%\n"), 1 << (i % 10),
> +			iec_prefixes[i/10],
> +			extstats.allocsize[i],
> +			extstats.wastedsize[i],
> +			PERCENT(extstats.wastedsize[i], extstats.wastedspace));
> +		}
> +		answer = bytes_2_human(extstats.physicalused, &i);
> +		dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
> +			extstats.physicalused, answer, iec_prefixes[i]);
> +		answer = PERCENT(extstats.wastedspace, extstats.physicalused);
> +	}
> +	dbprintf(_("block overhead (bytes): %llu (%.3f \%)\n"),
> +		extstats.wastedspace, answer);
> +}
> +
> +void
>  frag_init(void)
>  {
>  	add_command(&frag_cmd);
> @@ -164,6 +309,12 @@ frag_f(
>  	answer = (double)extcount_actual / (double)extcount_ideal;
>  	dbprintf(_("Files on this filesystem average %.2f extents per file\n"),
>  		answer);
> +	if (eflag) {
> +		print_extents_histo();
> +	}
> +	if (sflag) {
> +		print_file_size_histo();
> +	}

+	if (eflag)
+		print_extents_histo();
+	if (sflag)
+		print_file_size_histo();

is fine, we don't generally curly-brace single lines (when in Rome...)

>  	return 0;
>  }
>  
> @@ -174,9 +325,10 @@ init(
>  {
>  	int		c;
>  
> -	aflag = dflag = fflag = lflag = qflag = Rflag = rflag = vflag = 0;
> +	aflag = dflag = eflag = fflag = lflag = qflag = Rflag =
> +		rflag = sflag = vflag = 0;

I'd prefer to not split the line:

+	aflag = dflag = eflag = fflag = lflag = qflag = Rflag = 0;
+	rflag = sflag = vflag = 0;

>  	optind = 0;
> -	while ((c = getopt(argc, argv, "adflqRrv")) != EOF) {
> +	while ((c = getopt(argc, argv, "adeflqRrsv")) != EOF) {
>  		switch (c) {
>  		case 'a':
>  			aflag = 1;
> @@ -184,6 +336,9 @@ init(
>  		case 'd':
>  			dflag = 1;
>  			break;
> +		case 'e':
> +			eflag = 1;
> +			break;
>  		case 'f':
>  			fflag = 1;
>  			break;
> @@ -199,6 +354,9 @@ init(
>  		case 'r':
>  			rflag = 1;
>  			break;
> +		case 's':
> +			sflag = 1;
> +			break;
>  		case 'v':
>  			vflag = 1;
>  			break;
> @@ -210,6 +368,8 @@ init(
>  	if (!aflag && !dflag && !fflag && !lflag && !qflag && !Rflag && !rflag)
>  		aflag = dflag = fflag = lflag = qflag = Rflag = rflag = 1;
>  	extcount_actual = extcount_ideal = 0;
> +	memset(&extstats, 0 , sizeof(extstats));
> +	memset(&festats, 0 , sizeof(festats));

No space before , :

+	memset(&extstats, 0, sizeof(extstats));
+	memset(&festats, 0, sizeof(festats));

>  	return 1;
>  }
>  
> @@ -274,6 +434,10 @@ process_fork(
>  {
>  	extmap_t	*extmap;
>  	int		nex;
> +	int	bucket;
> +	uint64_t	fallocsize;
> +	uint64_t	fusedsize;
> +	uint64_t	fwastedsize;

+	int		bucket;
+	uint64_t	fallocsize;
+	uint64_t	fusedsize;
+	uint64_t	fwastedsize;

>  
>  	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
>  	if (!nex)
> @@ -288,7 +452,41 @@ process_fork(
>  		break;
>  	}
>  	extcount_actual += extmap->nents;
> -	extcount_ideal += extmap_ideal(extmap);
> +	extcount_ideal += extmap_ideal(extmap, &fallocsize);
> +
> +	if (sflag) {
> +		// Record file size stats

/* C comments please */

> +		fusedsize = be64_to_cpu(dip->di_size);
> +		bucket = get_bucket(fallocsize);
> +		extstats.allocsize[bucket]++;
> +		bucket = get_bucket(fusedsize);
> +		extstats.usedsize[bucket]++;
> +
> +		if (fallocsize > fusedsize) {
> +			fwastedsize = fallocsize - fusedsize;
> +			extstats.wastedspace += fwastedsize;
> +			extstats.wastedsize[bucket] += fwastedsize;
> +		}
> +		extstats.logicalused += fusedsize;
> +		extstats.physicalused += fallocsize;
> +		extstats.maxfilesize = MAX(extstats.maxfilesize, fallocsize);
> +		if (be16_to_cpu(dip->di_flags) & XFS_DIFLAG_REALTIME) {
> +			extstats.realtime = true;
> +		}
> +	}
> +
> +	if (eflag) {
> +		// Record file extent stats

/* C comments ... */

> +		bucket = get_bucket(extmap->nents);
> +		if (be16_to_cpu(dip->di_flags) & XFS_DIFLAG_REALTIME) {
> +			// Realtime inodes have an additional extent
> +			bucket = get_bucket(MAX(extmap->nents - 1, 1));
> +		}
> +		festats.extsbuckets[bucket]++;
> +		festats.maxexts = MAX(festats.maxexts, extmap->nents);
> +	}
> +	festats.numfiles++;
> +
>  	xfree(extmap);
>  }
>  
> diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
> index a1ee3514..52d5f18a 100644
> --- a/man/man8/xfs_db.8
> +++ b/man/man8/xfs_db.8
> @@ -489,7 +489,7 @@ command.
>  .B forward
>  Move forward to the next entry in the position ring.
>  .TP
> -.B frag [\-adflqRrv]
> +.B frag [\-adeflqRrsv]
>  Get file fragmentation data. This prints information about fragmentation
>  of file data in the filesystem (as opposed to fragmentation of freespace,
>  for which see the
> @@ -510,6 +510,9 @@ enables processing of attribute data.
>  .B \-d
>  enables processing of directory data.
>  .TP
> +.B \-e
> +enables computing extent count per inode histogram.
> +.TP
>  .B \-f
>  enables processing of regular file data.
>  .TP
> @@ -524,6 +527,9 @@ enables processing of realtime control file data.
>  .TP
>  .B \-r
>  enables processing of realtime file data.
> +.TP
> +.B \-s
> +enables computing file size histogram and file system overheads.
>  .RE
>  .TP
>  .BI "freesp [\-bcds] [\-A " alignment "] [\-a " ag "] ... [\-e " i "] [\-h " h1 "] ... [\-m " m ]
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-14 18:50 [PATCH] xfs_db: add extent count and file size histograms Jorge Guerra
  2019-05-14 19:52 ` Eric Sandeen
@ 2019-05-14 20:02 ` Eric Sandeen
  2019-05-15 15:57   ` Jorge Guerra
  2019-05-14 23:31 ` Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Eric Sandeen @ 2019-05-14 20:02 UTC (permalink / raw)
  To: Jorge Guerra, linux-xfs; +Cc: osandov, Jorge Guerra

On 5/14/19 1:50 PM, Jorge Guerra wrote:
> +		dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
> +		extstats.logicalused, answer, iec_prefixes[i]);

I think I missed this instance of "indent please" and probably others...

(I'm kind of wondering about carrying 'used' in bytes, but I suppose we're
ok until we really get zettabyte filesytems in the wild) ;)

-Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-14 18:50 [PATCH] xfs_db: add extent count and file size histograms Jorge Guerra
  2019-05-14 19:52 ` Eric Sandeen
  2019-05-14 20:02 ` Eric Sandeen
@ 2019-05-14 23:31 ` Dave Chinner
  2019-05-15  0:06   ` Eric Sandeen
  2019-05-15 16:15   ` Jorge Guerra
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2019-05-14 23:31 UTC (permalink / raw)
  To: Jorge Guerra; +Cc: linux-xfs, osandov, Jorge Guerra

On Tue, May 14, 2019 at 11:50:26AM -0700, Jorge Guerra wrote:
> From: Jorge Guerra <jorgeguerra@fb.com>
> 
> In this change we add two feature to the xfs_db 'frag' command:
> 
> 1) Extent count histogram [-e]: This option enables tracking the
>    number of extents per inode (file) as the we traverse the file
>    system.  The end result is a histogram of the number of extents per
>    file in power of 2 buckets.
> 
> 2) File size histogram and file system internal fragmentation stats
>    [-s]: This option enables tracking file sizes both in terms of what
>    has been physically allocated and how much has been written to the
>    file.  In addition, we track the amount of internal fragmentation
>    seen per file.  This is particularly useful in the case of real
>    time devices where space is allocated in units of fixed sized
>    extents.

I can see the usefulness of having such information, but xfs_db is
the wrong tool/interface for generating such usage reports.

> The man page for xfs_db has been updated to reflect these new command
> line arguments.
> 
> Tests:
> 
> We tested this change on several XFS file systems with different
> configurations:
> 
> 1) regular XFS:
> 
> [root@m1 ~]# xfs_info /mnt/d0
> meta-data=/dev/sdb1              isize=256    agcount=10, agsize=268435455 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=2441608704, imaxpct=100
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> [root@m1 ~]# echo "frag -e -s" | xfs_db -r /dev/sdb1
> xfs_db> actual 494393, ideal 489246, fragmentation factor 1.04%

For example, xfs_db is not the right tool for probing online, active
filesystems. It is not coherent with the active kernel filesystem,
and is quite capable of walking off into la-la land as a result of
mis-parsing the inconsistent filesystem that is on disk underneath
active mounted filesystems. This does not make for a robust, usable
tool, let alone one that can make use of things like rmap for
querying usage and ownership information really quickly.

To solve this problem, we now have the xfs_spaceman tool and the
GETFSMAP ioctl for running usage queries on mounted filesystems.
That avoids all the coherency and crash problems, and for rmap
enabled filesystems it does not require scanning the entire
filesystem to work out this information (i.e. it can all be derived
from the contents of the rmap tree).

So I'd much prefer that new online filesystem queries go into
xfs-spaceman and use GETFSMAP so they can be accelerated on rmap
configured filesystems rather than hoping xfs_db will parse the
entire mounted filesystem correctly while it is being actively
changed...

> Maximum extents in a file 14
> Histogram of number of extents per file:
>     bucket =       count        % of total
> <=       1 =      350934        97.696 %
> <=       2 =        6231        1.735 %
> <=       4 =        1001        0.279 %
> <=       8 =         953        0.265 %
> <=      16 =          92        0.026 %
> Maximum file size 26.508 MB
> Histogram of file size:
>     bucket =    allocated           used        overhead(bytes)
> <=    4 KB =           0              62           314048512 0.13%
> <=    8 KB =           0          119911        127209263104 53.28%
> <=   16 KB =           0           14543         15350194176 6.43%
> <=   32 KB =         909           12330         11851161600 4.96%
> <=   64 KB =          92            6704          6828642304 2.86%
> <=  128 KB =           1            7132          6933372928 2.90%
> <=  256 KB =           0           10013          8753799168 3.67%
> <=  512 KB =           0           13616          9049227264 3.79%
> <=    1 MB =           1           15056          4774912000 2.00%
> <=    2 MB =      198662           17168          9690226688 4.06%
> <=    4 MB =       28639           21073         11806654464 4.94%
> <=    8 MB =       35169           29878         14200553472 5.95%
> <=   16 MB =       95667           91633         11939287040 5.00%
> <=   32 MB =          71              62            28471742 0.01%
> capacity used (bytes): 1097735533058 (1022.346 GB)
> capacity allocated (bytes): 1336497410048 (1.216 TB)
> block overhead (bytes): 238761885182 (21.750 %)

BTW, "bytes" as a display unit is stupidly verbose and largely
unnecessary. The byte count is /always/ going to be a multiple of
the filesystem block size, and the first thing anyone who wants to
use this for diagnosis is going to have to do is return the byte
count to filesystem blocks (which is what the filesystem itself
tracks everything in. ANd then when you have PB scale filesystems,
anything more than 3 significant digits is just impossible to read
and compare - that "overhead" column (what the "overhead" even
mean?) is largely impossible to read and determine what the actual
capacity used is without counting individual digits in each number.

FWIW, we already have extent histogram code in xfs_spaceman
(in spaceman/freesp.c) and in xfs_db (db/freesp.c) so we really
don't need re-implementation of the same functionality we already
have duplicate copies of. I'd suggest that the histogram code should
be factored and moved to libfrog/ and then enhanced if new histogram
functionality is required...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-14 23:31 ` Dave Chinner
@ 2019-05-15  0:06   ` Eric Sandeen
  2019-05-15  2:05     ` Dave Chinner
  2019-05-15 16:15   ` Jorge Guerra
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Sandeen @ 2019-05-15  0:06 UTC (permalink / raw)
  To: Dave Chinner, Jorge Guerra; +Cc: linux-xfs, osandov, Jorge Guerra

On 5/14/19 6:31 PM, Dave Chinner wrote:
> On Tue, May 14, 2019 at 11:50:26AM -0700, Jorge Guerra wrote:
>> From: Jorge Guerra <jorgeguerra@fb.com>
>>
>> In this change we add two feature to the xfs_db 'frag' command:
>>
>> 1) Extent count histogram [-e]: This option enables tracking the
>>    number of extents per inode (file) as the we traverse the file
>>    system.  The end result is a histogram of the number of extents per
>>    file in power of 2 buckets.
>>
>> 2) File size histogram and file system internal fragmentation stats
>>    [-s]: This option enables tracking file sizes both in terms of what
>>    has been physically allocated and how much has been written to the
>>    file.  In addition, we track the amount of internal fragmentation
>>    seen per file.  This is particularly useful in the case of real
>>    time devices where space is allocated in units of fixed sized
>>    extents.
> 
> I can see the usefulness of having such information, but xfs_db is
> the wrong tool/interface for generating such usage reports.
> 
>> The man page for xfs_db has been updated to reflect these new command
>> line arguments.
>>
>> Tests:
>>
>> We tested this change on several XFS file systems with different
>> configurations:
>>
>> 1) regular XFS:
>>
>> [root@m1 ~]# xfs_info /mnt/d0
>> meta-data=/dev/sdb1              isize=256    agcount=10, agsize=268435455 blks
>>          =                       sectsz=4096  attr=2, projid32bit=1
>>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>          =                       reflink=0
>> data     =                       bsize=4096   blocks=2441608704, imaxpct=100
>>          =                       sunit=0      swidth=0 blks
>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>> log      =internal log           bsize=4096   blocks=521728, version=2
>>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>> [root@m1 ~]# echo "frag -e -s" | xfs_db -r /dev/sdb1
>> xfs_db> actual 494393, ideal 489246, fragmentation factor 1.04%
> 
> For example, xfs_db is not the right tool for probing online, active
> filesystems.

yes, the usage example is poor.  (I almost wonder if we should disallow
certain operations with -r ...)

> It is not coherent with the active kernel filesystem,
> and is quite capable of walking off into la-la land as a result of
> mis-parsing the inconsistent filesystem that is on disk underneath
> active mounted filesystems. This does not make for a robust, usable
> tool, let alone one that can make use of things like rmap for
> querying usage and ownership information really quickly.
> 
> To solve this problem, we now have the xfs_spaceman tool and the
> GETFSMAP ioctl for running usage queries on mounted filesystems.
> That avoids all the coherency and crash problems, and for rmap
> enabled filesystems it does not require scanning the entire
> filesystem to work out this information (i.e. it can all be derived
> from the contents of the rmap tree).
> 
> So I'd much prefer that new online filesystem queries go into
> xfs-spaceman and use GETFSMAP so they can be accelerated on rmap
> configured filesystems rather than hoping xfs_db will parse the
> entire mounted filesystem correctly while it is being actively
> changed...

Yeah fair point.

>> Maximum extents in a file 14
>> Histogram of number of extents per file:
>>     bucket =       count        % of total
>> <=       1 =      350934        97.696 %
>> <=       2 =        6231        1.735 %
>> <=       4 =        1001        0.279 %
>> <=       8 =         953        0.265 %
>> <=      16 =          92        0.026 %
>> Maximum file size 26.508 MB
>> Histogram of file size:
>>     bucket =    allocated           used        overhead(bytes)
>> <=    4 KB =           0              62           314048512 0.13%
>> <=    8 KB =           0          119911        127209263104 53.28%
>> <=   16 KB =           0           14543         15350194176 6.43%
>> <=   32 KB =         909           12330         11851161600 4.96%
>> <=   64 KB =          92            6704          6828642304 2.86%
>> <=  128 KB =           1            7132          6933372928 2.90%
>> <=  256 KB =           0           10013          8753799168 3.67%
>> <=  512 KB =           0           13616          9049227264 3.79%
>> <=    1 MB =           1           15056          4774912000 2.00%
>> <=    2 MB =      198662           17168          9690226688 4.06%
>> <=    4 MB =       28639           21073         11806654464 4.94%
>> <=    8 MB =       35169           29878         14200553472 5.95%
>> <=   16 MB =       95667           91633         11939287040 5.00%
>> <=   32 MB =          71              62            28471742 0.01%
>> capacity used (bytes): 1097735533058 (1022.346 GB)
>> capacity allocated (bytes): 1336497410048 (1.216 TB)
>> block overhead (bytes): 238761885182 (21.750 %)
> 
> BTW, "bytes" as a display unit is stupidly verbose and largely
> unnecessary. The byte count is /always/ going to be a multiple of
> the filesystem block size, and the first thing anyone who wants to
> use this for diagnosis is going to have to do is return the byte
> count to filesystem blocks (which is what the filesystem itself
> tracks everything in. ANd then when you have PB scale filesystems,
> anything more than 3 significant digits is just impossible to read
> and compare - that "overhead" column (what the "overhead" even
> mean?) is largely impossible to read and determine what the actual
> capacity used is without counting individual digits in each number.

But if the whole point is trying to figure out "internal fragmentation"
then it's the only unit that makes sense, right?  This is the "15 bytes"
of a 15 byte file (or extent) allocated into a 4k block.

OTOH, for any random file distribution it's going to trend towards half
a block, so I'm not sure how useful this is in the end.

(however your example seems to show roughly 200x the waste expected,
so I kind of wonder if that points to a bug somewhere in your patch...)

> FWIW, we already have extent histogram code in xfs_spaceman
> (in spaceman/freesp.c) and in xfs_db (db/freesp.c) so we really
> don't need re-implementation of the same functionality we already
> have duplicate copies of. I'd suggest that the histogram code should
> be factored and moved to libfrog/ and then enhanced if new histogram
> functionality is required...

Also a fair point, I had forgotten about that.

Thanks,
-Eric

> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15  0:06   ` Eric Sandeen
@ 2019-05-15  2:05     ` Dave Chinner
  2019-05-15 16:39       ` Jorge Guerra
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2019-05-15  2:05 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Jorge Guerra, linux-xfs, osandov, Jorge Guerra

On Tue, May 14, 2019 at 07:06:52PM -0500, Eric Sandeen wrote:
> On 5/14/19 6:31 PM, Dave Chinner wrote:
> > On Tue, May 14, 2019 at 11:50:26AM -0700, Jorge Guerra wrote:
> >> Maximum extents in a file 14
> >> Histogram of number of extents per file:
> >>     bucket =       count        % of total
> >> <=       1 =      350934        97.696 %
> >> <=       2 =        6231        1.735 %
> >> <=       4 =        1001        0.279 %
> >> <=       8 =         953        0.265 %
> >> <=      16 =          92        0.026 %
> >> Maximum file size 26.508 MB
> >> Histogram of file size:
> >>     bucket =    allocated           used        overhead(bytes)
> >> <=    4 KB =           0              62           314048512 0.13%
> >> <=    8 KB =           0          119911        127209263104 53.28%
> >> <=   16 KB =           0           14543         15350194176 6.43%
> >> <=   32 KB =         909           12330         11851161600 4.96%
> >> <=   64 KB =          92            6704          6828642304 2.86%
> >> <=  128 KB =           1            7132          6933372928 2.90%
> >> <=  256 KB =           0           10013          8753799168 3.67%
> >> <=  512 KB =           0           13616          9049227264 3.79%
> >> <=    1 MB =           1           15056          4774912000 2.00%
> >> <=    2 MB =      198662           17168          9690226688 4.06%
> >> <=    4 MB =       28639           21073         11806654464 4.94%
> >> <=    8 MB =       35169           29878         14200553472 5.95%
> >> <=   16 MB =       95667           91633         11939287040 5.00%
> >> <=   32 MB =          71              62            28471742 0.01%
> >> capacity used (bytes): 1097735533058 (1022.346 GB)
> >> capacity allocated (bytes): 1336497410048 (1.216 TB)
> >> block overhead (bytes): 238761885182 (21.750 %)
> > 
> > BTW, "bytes" as a display unit is stupidly verbose and largely
> > unnecessary. The byte count is /always/ going to be a multiple of
> > the filesystem block size, and the first thing anyone who wants to
> > use this for diagnosis is going to have to do is return the byte
> > count to filesystem blocks (which is what the filesystem itself
> > tracks everything in. ANd then when you have PB scale filesystems,
> > anything more than 3 significant digits is just impossible to read
> > and compare - that "overhead" column (what the "overhead" even
> > mean?) is largely impossible to read and determine what the actual
> > capacity used is without counting individual digits in each number.
> 
> But if the whole point is trying to figure out "internal fragmentation"
> then it's the only unit that makes sense, right?  This is the "15 bytes"
> of a 15 byte file (or extent) allocated into a 4k block.

Urk. I missed that - I saw "-s" and assumed that, like the other
extent histogram printing commands we have, it meant "print summary
information". i.e. the last 3 lines in the above output.

But the rest of it? It comes back to my comment "what does overhead
even mean"?  All it is a measure of how many bytes are allocated in
extents vs the file size. It assumes that if there is more bytes
allocated in extents than the file size, then the excess is "wasted
space".

This is not a measure of "internal fragmentation". It doesn't take
into account the fact we can (and do) allocate extents beyond EOF
that are there (temporarily or permanently) for the file to be
extended into without physically fragmenting the file. These can go
away at any time, so one scan might show massive "internal
fragmentation" and then a minute later after the EOF block scanner
runs there is none. i.e. without changing the file data, the layout
of the file within EOF, or file size, "internal fragmentation" can
just magically disappear.

It doesn't take into account sparse files. Well, it does by
ignoring them which is another flag that this isn't measuring
internal fragmentation because even sparse files can be internally
fragmented.

Which is another thing this doesn't take into account: the amount of
data actually written to the files. e.g. a preallocated, zero length
file is "internally fragmented" by this criteria, but the same empty
file with a file size that matches the preallocation is not
"internally fragmented". Yet an actual internally fragmented file
(e.g. preallocate 1MB, set size to 1MB, write 4k at 256k) will not
actually be noticed by this code....

IOWs, what is being reported here is exactly the same information
that "stat(blocks) vs stat(size)" will tell you, which makes me
wonder why the method of gathering it (full fs scan via xfs_db) is
being used when this could be done with a simple script based around
this:

$ find /mntpt -type f -exec stat -c "%s %b" {} \; | histogram_script

I have no problems with adding analysis and reporting functionality
to the filesystem tools, but they have to be done the right way, and
not duplicate functionality and information that can be trivially
obtained from userspace with a script and basic utilities. IMO,
there has to be some substantial benefit from implementing the
functionality using deep, dark filesystem gubbins that can't be
acheived in any other way for it be worth the additional code
maintenance burden....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-14 20:02 ` Eric Sandeen
@ 2019-05-15 15:57   ` Jorge Guerra
  2019-05-15 16:02     ` Eric Sandeen
  0 siblings, 1 reply; 14+ messages in thread
From: Jorge Guerra @ 2019-05-15 15:57 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs, Omar Sandoval, Jorge Guerra

Thanks Eric,

I'm addressing these comments.  Will send an update once we have an
agreement with Dave into how and where to implement this.

On Tue, May 14, 2019 at 1:02 PM Eric Sandeen <sandeen@sandeen.net> wrote:
>
> On 5/14/19 1:50 PM, Jorge Guerra wrote:
> > +             dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
> > +             extstats.logicalused, answer, iec_prefixes[i]);
>
> I think I missed this instance of "indent please" and probably others...
>
> (I'm kind of wondering about carrying 'used' in bytes, but I suppose we're
> ok until we really get zettabyte filesytems in the wild) ;)
>
> -Eric



-- 
Jorge E Guerra D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15 15:57   ` Jorge Guerra
@ 2019-05-15 16:02     ` Eric Sandeen
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Sandeen @ 2019-05-15 16:02 UTC (permalink / raw)
  To: Jorge Guerra; +Cc: linux-xfs, Omar Sandoval, Jorge Guerra

On 5/15/19 10:57 AM, Jorge Guerra wrote:
> Thanks Eric,
> 
> I'm addressing these comments.  Will send an update once we have an
> agreement with Dave into how and where to implement this.

Might want to give Dave's concerns thought before doing too much
editing of this patch, but it's up to you.  :)

(same style comments will apply to any solution, though)

Thanks,
-Eric

> On Tue, May 14, 2019 at 1:02 PM Eric Sandeen <sandeen@sandeen.net> wrote:
>>
>> On 5/14/19 1:50 PM, Jorge Guerra wrote:
>>> +             dbprintf(_("capacity used (bytes): %llu (%.3f %cB)\n"),
>>> +             extstats.logicalused, answer, iec_prefixes[i]);
>>
>> I think I missed this instance of "indent please" and probably others...
>>
>> (I'm kind of wondering about carrying 'used' in bytes, but I suppose we're
>> ok until we really get zettabyte filesytems in the wild) ;)
>>
>> -Eric
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-14 23:31 ` Dave Chinner
  2019-05-15  0:06   ` Eric Sandeen
@ 2019-05-15 16:15   ` Jorge Guerra
  2019-05-15 16:24     ` Eric Sandeen
  1 sibling, 1 reply; 14+ messages in thread
From: Jorge Guerra @ 2019-05-15 16:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Omar Sandoval, Jorge Guerra

Thanks Dave,

I appreciate you taking the time to review and comment.

On Tue, May 14, 2019 at 4:31 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 14, 2019 at 11:50:26AM -0700, Jorge Guerra wrote:
> > From: Jorge Guerra <jorgeguerra@fb.com>
> >
> > In this change we add two feature to the xfs_db 'frag' command:
> >
> > 1) Extent count histogram [-e]: This option enables tracking the
> >    number of extents per inode (file) as the we traverse the file
> >    system.  The end result is a histogram of the number of extents per
> >    file in power of 2 buckets.
> >
> > 2) File size histogram and file system internal fragmentation stats
> >    [-s]: This option enables tracking file sizes both in terms of what
> >    has been physically allocated and how much has been written to the
> >    file.  In addition, we track the amount of internal fragmentation
> >    seen per file.  This is particularly useful in the case of real
> >    time devices where space is allocated in units of fixed sized
> >    extents.
>
> I can see the usefulness of having such information, but xfs_db is
> the wrong tool/interface for generating such usage reports.
>
> > The man page for xfs_db has been updated to reflect these new command
> > line arguments.
> >
> > Tests:
> >
> > We tested this change on several XFS file systems with different
> > configurations:
> >
> > 1) regular XFS:
> >
> > [root@m1 ~]# xfs_info /mnt/d0
> > meta-data=/dev/sdb1              isize=256    agcount=10, agsize=268435455 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=0        finobt=0, sparse=0, rmapbt=0
> >          =                       reflink=0
> > data     =                       bsize=4096   blocks=2441608704, imaxpct=100
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > [root@m1 ~]# echo "frag -e -s" | xfs_db -r /dev/sdb1
> > xfs_db> actual 494393, ideal 489246, fragmentation factor 1.04%
>
> For example, xfs_db is not the right tool for probing online, active
> filesystems. It is not coherent with the active kernel filesystem,
> and is quite capable of walking off into la-la land as a result of
> mis-parsing the inconsistent filesystem that is on disk underneath
> active mounted filesystems. This does not make for a robust, usable
> tool, let alone one that can make use of things like rmap for
> querying usage and ownership information really quickly.

I see your point, that the FS is constantly changing and that we might
see an inconsistent view.  But if we are generating bucketed
histograms we are anyways approximating the stats.

> To solve this problem, we now have the xfs_spaceman tool and the
> GETFSMAP ioctl for running usage queries on mounted filesystems.
> That avoids all the coherency and crash problems, and for rmap
> enabled filesystems it does not require scanning the entire
> filesystem to work out this information (i.e. it can all be derived
> from the contents of the rmap tree).
>
> So I'd much prefer that new online filesystem queries go into
> xfs-spaceman and use GETFSMAP so they can be accelerated on rmap
> configured filesystems rather than hoping xfs_db will parse the
> entire mounted filesystem correctly while it is being actively
> changed...

Good to know, I wasn't aware of this tool.  However I seems like I
don't have that ioctl in my systems yet :(

# xfs_spaceman /mnt/d0
xfs_spaceman> frespc
command "frespc" not found
xfs_spaceman> fresp
command "fresp" not found
xfs_spaceman> freesp
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
xfs_spaceman: FS_IOC_GETFSMAP ["/mnt/d0"]: Inappropriate ioctl for device
   from      to extents  blocks    pct
xfs_spaceman>

One other thing.  If we go this route then, we would need to issue an
ioctl for every file right? wouldn't this be much slower?

>
> > Maximum extents in a file 14
> > Histogram of number of extents per file:
> >     bucket =       count        % of total
> > <=       1 =      350934        97.696 %
> > <=       2 =        6231        1.735 %
> > <=       4 =        1001        0.279 %
> > <=       8 =         953        0.265 %
> > <=      16 =          92        0.026 %
> > Maximum file size 26.508 MB
> > Histogram of file size:
> >     bucket =    allocated           used        overhead(bytes)
> > <=    4 KB =           0              62           314048512 0.13%
> > <=    8 KB =           0          119911        127209263104 53.28%
> > <=   16 KB =           0           14543         15350194176 6.43%
> > <=   32 KB =         909           12330         11851161600 4.96%
> > <=   64 KB =          92            6704          6828642304 2.86%
> > <=  128 KB =           1            7132          6933372928 2.90%
> > <=  256 KB =           0           10013          8753799168 3.67%
> > <=  512 KB =           0           13616          9049227264 3.79%
> > <=    1 MB =           1           15056          4774912000 2.00%
> > <=    2 MB =      198662           17168          9690226688 4.06%
> > <=    4 MB =       28639           21073         11806654464 4.94%
> > <=    8 MB =       35169           29878         14200553472 5.95%
> > <=   16 MB =       95667           91633         11939287040 5.00%
> > <=   32 MB =          71              62            28471742 0.01%
> > capacity used (bytes): 1097735533058 (1022.346 GB)
> > capacity allocated (bytes): 1336497410048 (1.216 TB)
> > block overhead (bytes): 238761885182 (21.750 %)
>
> BTW, "bytes" as a display unit is stupidly verbose and largely
> unnecessary. The byte count is /always/ going to be a multiple of
> the filesystem block size, and the first thing anyone who wants to
> use this for diagnosis is going to have to do is return the byte
> count to filesystem blocks (which is what the filesystem itself
> tracks everything in. ANd then when you have PB scale filesystems,
> anything more than 3 significant digits is just impossible to read
> and compare - that "overhead" column (what the "overhead" even
> mean?) is largely impossible to read and determine what the actual
> capacity used is without counting individual digits in each number.

Sure, I'll remove the bytes and display one in human readable units.

>
> FWIW, we already have extent histogram code in xfs_spaceman
> (in spaceman/freesp.c) and in xfs_db (db/freesp.c) so we really
> don't need re-implementation of the same functionality we already

Both these tools query the free space, the tool in this patch queries
the opposite, the size of the allocated extents and count of extents
per file.

> have duplicate copies of. I'd suggest that the histogram code should
> be factored and moved to libfrog/ and then enhanced if new histogram
> functionality is required...

Makes sense, will do!

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Jorge E Guerra D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15 16:15   ` Jorge Guerra
@ 2019-05-15 16:24     ` Eric Sandeen
  2019-05-15 16:47       ` Jorge Guerra
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Sandeen @ 2019-05-15 16:24 UTC (permalink / raw)
  To: Jorge Guerra, Dave Chinner; +Cc: linux-xfs, Omar Sandoval, Jorge Guerra

On 5/15/19 11:15 AM, Jorge Guerra wrote:
> Thanks Dave,
> 
> I appreciate you taking the time to review and comment.
> 
> On Tue, May 14, 2019 at 4:31 PM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Tue, May 14, 2019 at 11:50:26AM -0700, Jorge Guerra wrote:
>>> From: Jorge Guerra <jorgeguerra@fb.com>
>>>
>>> In this change we add two feature to the xfs_db 'frag' command:
>>>
>>> 1) Extent count histogram [-e]: This option enables tracking the
>>>    number of extents per inode (file) as the we traverse the file
>>>    system.  The end result is a histogram of the number of extents per
>>>    file in power of 2 buckets.
>>>
>>> 2) File size histogram and file system internal fragmentation stats
>>>    [-s]: This option enables tracking file sizes both in terms of what
>>>    has been physically allocated and how much has been written to the
>>>    file.  In addition, we track the amount of internal fragmentation
>>>    seen per file.  This is particularly useful in the case of real
>>>    time devices where space is allocated in units of fixed sized
>>>    extents.
>>
>> I can see the usefulness of having such information, but xfs_db is
>> the wrong tool/interface for generating such usage reports.
>>
>>> The man page for xfs_db has been updated to reflect these new command
>>> line arguments.
>>>
>>> Tests:
>>>
>>> We tested this change on several XFS file systems with different
>>> configurations:
>>>
>>> 1) regular XFS:
>>>
>>> [root@m1 ~]# xfs_info /mnt/d0
>>> meta-data=/dev/sdb1              isize=256    agcount=10, agsize=268435455 blks
>>>          =                       sectsz=4096  attr=2, projid32bit=1
>>>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>>          =                       reflink=0
>>> data     =                       bsize=4096   blocks=2441608704, imaxpct=100
>>>          =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>> log      =internal log           bsize=4096   blocks=521728, version=2
>>>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>> [root@m1 ~]# echo "frag -e -s" | xfs_db -r /dev/sdb1
>>> xfs_db> actual 494393, ideal 489246, fragmentation factor 1.04%
>>
>> For example, xfs_db is not the right tool for probing online, active
>> filesystems. It is not coherent with the active kernel filesystem,
>> and is quite capable of walking off into la-la land as a result of
>> mis-parsing the inconsistent filesystem that is on disk underneath
>> active mounted filesystems. This does not make for a robust, usable
>> tool, let alone one that can make use of things like rmap for
>> querying usage and ownership information really quickly.
> 
> I see your point, that the FS is constantly changing and that we might
> see an inconsistent view.  But if we are generating bucketed
> histograms we are anyways approximating the stats.

I think that Dave's "inconsistency" concern is literal - if the on-disk
metadata is not consistent, you may wander into what looks like corruption
if you try to traverse every inode while mounted.
 
It's pretty much never valid for userspace to try to traverse or read
the filesystem while mounted.

>> To solve this problem, we now have the xfs_spaceman tool and the
>> GETFSMAP ioctl for running usage queries on mounted filesystems.
>> That avoids all the coherency and crash problems, and for rmap
>> enabled filesystems it does not require scanning the entire
>> filesystem to work out this information (i.e. it can all be derived
>> from the contents of the rmap tree).
>>
>> So I'd much prefer that new online filesystem queries go into
>> xfs-spaceman and use GETFSMAP so they can be accelerated on rmap
>> configured filesystems rather than hoping xfs_db will parse the
>> entire mounted filesystem correctly while it is being actively
>> changed...
> 
> Good to know, I wasn't aware of this tool.  However I seems like I
> don't have that ioctl in my systems yet :(

It was added in 2017, in kernel-4.12 I believe.
What kernel did you test?

-Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15  2:05     ` Dave Chinner
@ 2019-05-15 16:39       ` Jorge Guerra
  2019-05-15 22:55         ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Jorge Guerra @ 2019-05-15 16:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs, Omar Sandoval, Jorge Guerra

On Tue, May 14, 2019 at 7:05 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 14, 2019 at 07:06:52PM -0500, Eric Sandeen wrote:
> > On 5/14/19 6:31 PM, Dave Chinner wrote:
> > > On Tue, May 14, 2019 at 11:50:26AM -0700, Jorge Guerra wrote:
> > >> Maximum extents in a file 14
> > >> Histogram of number of extents per file:
> > >>     bucket =       count        % of total
> > >> <=       1 =      350934        97.696 %
> > >> <=       2 =        6231        1.735 %
> > >> <=       4 =        1001        0.279 %
> > >> <=       8 =         953        0.265 %
> > >> <=      16 =          92        0.026 %
> > >> Maximum file size 26.508 MB
> > >> Histogram of file size:
> > >>     bucket =    allocated           used        overhead(bytes)
> > >> <=    4 KB =           0              62           314048512 0.13%
> > >> <=    8 KB =           0          119911        127209263104 53.28%
> > >> <=   16 KB =           0           14543         15350194176 6.43%
> > >> <=   32 KB =         909           12330         11851161600 4.96%
> > >> <=   64 KB =          92            6704          6828642304 2.86%
> > >> <=  128 KB =           1            7132          6933372928 2.90%
> > >> <=  256 KB =           0           10013          8753799168 3.67%
> > >> <=  512 KB =           0           13616          9049227264 3.79%
> > >> <=    1 MB =           1           15056          4774912000 2.00%
> > >> <=    2 MB =      198662           17168          9690226688 4.06%
> > >> <=    4 MB =       28639           21073         11806654464 4.94%
> > >> <=    8 MB =       35169           29878         14200553472 5.95%
> > >> <=   16 MB =       95667           91633         11939287040 5.00%
> > >> <=   32 MB =          71              62            28471742 0.01%
> > >> capacity used (bytes): 1097735533058 (1022.346 GB)
> > >> capacity allocated (bytes): 1336497410048 (1.216 TB)
> > >> block overhead (bytes): 238761885182 (21.750 %)
> > >
> > > BTW, "bytes" as a display unit is stupidly verbose and largely
> > > unnecessary. The byte count is /always/ going to be a multiple of
> > > the filesystem block size, and the first thing anyone who wants to
> > > use this for diagnosis is going to have to do is return the byte
> > > count to filesystem blocks (which is what the filesystem itself
> > > tracks everything in. ANd then when you have PB scale filesystems,
> > > anything more than 3 significant digits is just impossible to read
> > > and compare - that "overhead" column (what the "overhead" even
> > > mean?) is largely impossible to read and determine what the actual
> > > capacity used is without counting individual digits in each number.
> >
> > But if the whole point is trying to figure out "internal fragmentation"
> > then it's the only unit that makes sense, right?  This is the "15 bytes"
> > of a 15 byte file (or extent) allocated into a 4k block.
>
> Urk. I missed that - I saw "-s" and assumed that, like the other
> extent histogram printing commands we have, it meant "print summary
> information". i.e. the last 3 lines in the above output.
>
> But the rest of it? It comes back to my comment "what does overhead
> even mean"?  All it is a measure of how many bytes are allocated in
> extents vs the file size. It assumes that if there is more bytes
> allocated in extents than the file size, then the excess is "wasted
> space".

Yes, the way I interpret "wasted space" is that if we allocate space
to an inode and the space is not used then it's label as wasted since
at that point we are consuming it and it's not available for immediate
use.

>
> This is not a measure of "internal fragmentation". It doesn't take
> into account the fact we can (and do) allocate extents beyond EOF
> that are there (temporarily or permanently) for the file to be
> extended into without physically fragmenting the file. These can go
> away at any time, so one scan might show massive "internal
> fragmentation" and then a minute later after the EOF block scanner
> runs there is none. i.e. without changing the file data, the layout
> of the file within EOF, or file size, "internal fragmentation" can
> just magically disappear.

I see, how much is do we expect this to be (i.e 1%, 10%? of the file
size?).  In other words what's the order of magnitude of the
"preemtive" allocation compared to the total space in the file system?

>
> It doesn't take into account sparse files. Well, it does by
> ignoring them which is another flag that this isn't measuring
> internal fragmentation because even sparse files can be internally
> fragmented.
>
> Which is another thing this doesn't take into account: the amount of
> data actually written to the files. e.g. a preallocated, zero length
> file is "internally fragmented" by this criteria, but the same empty
> file with a file size that matches the preallocation is not
> "internally fragmented". Yet an actual internally fragmented file
> (e.g. preallocate 1MB, set size to 1MB, write 4k at 256k) will not
> actually be noticed by this code....

Interesting, how can we better account for these?

>
> IOWs, what is being reported here is exactly the same information
> that "stat(blocks) vs stat(size)" will tell you, which makes me
> wonder why the method of gathering it (full fs scan via xfs_db) is
> being used when this could be done with a simple script based around
> this:
>
> $ find /mntpt -type f -exec stat -c "%s %b" {} \; | histogram_script

While this is true that this can be measured via a simply a script,
I'd like to point out that it would be significantly more inefficient,
for instance:

# time find /mnt/pt -type f -exec stat -c "%s %b" {} \; > /tmp/file-sizes

real    27m38.885s
user    3m29.774s
sys     17m9.272s

# echo "frag -s -e" | time /tmp/xfs_db -r /dev/sdb1
[...]
0.44user 2.48system 0:05.42elapsed 53%CPU (0avgtext+0avgdata 996000maxresident)k
2079416inputs+0outputs (0major+248446minor)pagefaults 0swaps

That's 5.4s vs +27 minutes without considering the time to build the histogram.

>
> I have no problems with adding analysis and reporting functionality
> to the filesystem tools, but they have to be done the right way, and
> not duplicate functionality and information that can be trivially
> obtained from userspace with a script and basic utilities. IMO,
> there has to be some substantial benefit from implementing the
> functionality using deep, dark filesystem gubbins that can't be
> acheived in any other way for it be worth the additional code
> maintenance burden....

In my view, the efficiency gain should justify the need to for this
tool.  An in fact this was our main motivation, we where using "du -s
--apparent-size" and comparing that to the result of "df" to estimate
FS overhead, but this method was consuming a lot more IO than what we
had budget for.  With the proposed tool the we reduced IO 15x compared
to the "du vs df" method and collected more information along the way.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Jorge E Guerra D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15 16:24     ` Eric Sandeen
@ 2019-05-15 16:47       ` Jorge Guerra
  2019-05-15 16:51         ` Eric Sandeen
  0 siblings, 1 reply; 14+ messages in thread
From: Jorge Guerra @ 2019-05-15 16:47 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Dave Chinner, linux-xfs, Omar Sandoval, Jorge Guerra

On Wed, May 15, 2019 at 9:24 AM Eric Sandeen <sandeen@sandeen.net> wrote:
> >> For example, xfs_db is not the right tool for probing online, active
> >> filesystems. It is not coherent with the active kernel filesystem,
> >> and is quite capable of walking off into la-la land as a result of
> >> mis-parsing the inconsistent filesystem that is on disk underneath
> >> active mounted filesystems. This does not make for a robust, usable
> >> tool, let alone one that can make use of things like rmap for
> >> querying usage and ownership information really quickly.
> >
> > I see your point, that the FS is constantly changing and that we might
> > see an inconsistent view.  But if we are generating bucketed
> > histograms we are anyways approximating the stats.
>
> I think that Dave's "inconsistency" concern is literal - if the on-disk
> metadata is not consistent, you may wander into what looks like corruption
> if you try to traverse every inode while mounted.
>
> It's pretty much never valid for userspace to try to traverse or read
> the filesystem while mounted.

Sure, I understand this point.  Then can we:

1) Abort scan if the we detect "corrupt" metadata, the user would then
either restart the scan or decide not to.
2) Have a mechanism which detects if the FS changed will scan was in
progress and tell the user the results might be stale?

>
> >> To solve this problem, we now have the xfs_spaceman tool and the
> >> GETFSMAP ioctl for running usage queries on mounted filesystems.
> >> That avoids all the coherency and crash problems, and for rmap
> >> enabled filesystems it does not require scanning the entire
> >> filesystem to work out this information (i.e. it can all be derived
> >> from the contents of the rmap tree).
> >>
> >> So I'd much prefer that new online filesystem queries go into
> >> xfs-spaceman and use GETFSMAP so they can be accelerated on rmap
> >> configured filesystems rather than hoping xfs_db will parse the
> >> entire mounted filesystem correctly while it is being actively
> >> changed...
> >
> > Good to know, I wasn't aware of this tool.  However I seems like I
> > don't have that ioctl in my systems yet :(
>
> It was added in 2017, in kernel-4.12 I believe.
> What kernel did you test?

Yeap, that's it we tested in 4.11.


-- 
Jorge E Guerra D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15 16:47       ` Jorge Guerra
@ 2019-05-15 16:51         ` Eric Sandeen
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Sandeen @ 2019-05-15 16:51 UTC (permalink / raw)
  To: Jorge Guerra; +Cc: Dave Chinner, linux-xfs, Omar Sandoval, Jorge Guerra

On 5/15/19 11:47 AM, Jorge Guerra wrote:
> On Wed, May 15, 2019 at 9:24 AM Eric Sandeen <sandeen@sandeen.net> wrote:
>>>> For example, xfs_db is not the right tool for probing online, active
>>>> filesystems. It is not coherent with the active kernel filesystem,
>>>> and is quite capable of walking off into la-la land as a result of
>>>> mis-parsing the inconsistent filesystem that is on disk underneath
>>>> active mounted filesystems. This does not make for a robust, usable
>>>> tool, let alone one that can make use of things like rmap for
>>>> querying usage and ownership information really quickly.
>>>
>>> I see your point, that the FS is constantly changing and that we might
>>> see an inconsistent view.  But if we are generating bucketed
>>> histograms we are anyways approximating the stats.
>>
>> I think that Dave's "inconsistency" concern is literal - if the on-disk
>> metadata is not consistent, you may wander into what looks like corruption
>> if you try to traverse every inode while mounted.
>>
>> It's pretty much never valid for userspace to try to traverse or read
>> the filesystem while mounted.
> 
> Sure, I understand this point.  Then can we:
> 
> 1) Abort scan if the we detect "corrupt" metadata, the user would then
> either restart the scan or decide not to.
> 2) Have a mechanism which detects if the FS changed will scan was in
> progress and tell the user the results might be stale?

none of that should be shoehorned into xfs_db, tbh.  It's fine to use it
while unmounted.  If you want to gather these stats on a mounted filesystem,
xfs_db is the wrong tool for the job.  It's an offline inspection tool.
The fact that "-r" exists is because developers may need
it, but normal admin-facing tools should not be designed around it.

>>
>>>> To solve this problem, we now have the xfs_spaceman tool and the
>>>> GETFSMAP ioctl for running usage queries on mounted filesystems.
>>>> That avoids all the coherency and crash problems, and for rmap
>>>> enabled filesystems it does not require scanning the entire
>>>> filesystem to work out this information (i.e. it can all be derived
>>>> from the contents of the rmap tree).
>>>>
>>>> So I'd much prefer that new online filesystem queries go into
>>>> xfs-spaceman and use GETFSMAP so they can be accelerated on rmap
>>>> configured filesystems rather than hoping xfs_db will parse the
>>>> entire mounted filesystem correctly while it is being actively
>>>> changed...
>>>
>>> Good to know, I wasn't aware of this tool.  However I seems like I
>>> don't have that ioctl in my systems yet :(
>>
>> It was added in 2017, in kernel-4.12 I believe.
>> What kernel did you test?
> 
> Yeap, that's it we tested in 4.11.

Catch up!  *grin*

-Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs_db: add extent count and file size histograms
  2019-05-15 16:39       ` Jorge Guerra
@ 2019-05-15 22:55         ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2019-05-15 22:55 UTC (permalink / raw)
  To: Jorge Guerra; +Cc: Eric Sandeen, linux-xfs, Omar Sandoval, Jorge Guerra

On Wed, May 15, 2019 at 09:39:01AM -0700, Jorge Guerra wrote:
> On Tue, May 14, 2019 at 7:05 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, May 14, 2019 at 07:06:52PM -0500, Eric Sandeen wrote:
> > > On 5/14/19 6:31 PM, Dave Chinner wrote:
> > This is not a measure of "internal fragmentation". It doesn't take
> > into account the fact we can (and do) allocate extents beyond EOF
> > that are there (temporarily or permanently) for the file to be
> > extended into without physically fragmenting the file. These can go
> > away at any time, so one scan might show massive "internal
> > fragmentation" and then a minute later after the EOF block scanner
> > runs there is none. i.e. without changing the file data, the layout
> > of the file within EOF, or file size, "internal fragmentation" can
> > just magically disappear.
> 
> I see, how much is do we expect this to be (i.e 1%, 10%? of the file
> size?).  In other words what's the order of magnitude of the
> "preemtive" allocation compared to the total space in the file system?

Specualtive delalloc can be up to MAXEXTLEN on large files. It is
typically the size of the file again as the file is growing. i.e. if
the file is 64k, we'll preallocate 64k. if it's 1GB, we'll prealloc
1GB, if it's over 8GB (MAXEXTLEN on a 4k block size filesystem),
then we'll prealloc 8GB.

This typically is not removed when the file is closed - it is
typically removed when the file has not been modified for a few
minutes and the EOF block scanner runs over it, the inode is cycled
out of cache or we hit an ENOSPC condition, in which case the EOF
block scanner is run to clean up such prealloc before we attempt
allocation again. THe amount of speculative prealloc is dialled back
as the filesystem gets nearer to ENOSPC (>95% capacity) or the user
starts to run out of quota space.

So, yes, it can be a large amount of space that is consumed
temporarily, but the amount is workload dependent. The
reality is that almost no-one notices that XFS does this or the
extent to which XFS makes liberal use of free space for
fragmentation avoidance...

Of course, the filesystem has no real control over user directed
preallocation beyond EOF (i.e. fallocate()) and we do not ever
remove that unless the user runs ftruncate(). Hence the space
beyond EOF might be a direct result of the applications that are
running and not filesystem behaviour related at all...

> > It doesn't take into account sparse files. Well, it does by
> > ignoring them which is another flag that this isn't measuring
> > internal fragmentation because even sparse files can be internally
> > fragmented.
> >
> > Which is another thing this doesn't take into account: the amount of
> > data actually written to the files. e.g. a preallocated, zero length
> > file is "internally fragmented" by this criteria, but the same empty
> > file with a file size that matches the preallocation is not
> > "internally fragmented". Yet an actual internally fragmented file
> > (e.g. preallocate 1MB, set size to 1MB, write 4k at 256k) will not
> > actually be noticed by this code....
> 
> Interesting, how can we better account for these?

If it is preallocated space, then you need to scan each file to
determine the ratio of written to unwritten extents in the file.
(i.e. allocated space that contains data vs allocated space that
does not contain data). Basically, you need something similar to
what xfs_fsr is doing to determine if files need defragmentation or
not...

> > IOWs, what is being reported here is exactly the same information
> > that "stat(blocks) vs stat(size)" will tell you, which makes me
> > wonder why the method of gathering it (full fs scan via xfs_db) is
> > being used when this could be done with a simple script based around
> > this:
> >
> > $ find /mntpt -type f -exec stat -c "%s %b" {} \; | histogram_script
> 
> While this is true that this can be measured via a simply a script,
> I'd like to point out that it would be significantly more inefficient,
> for instance:
> 
> # time find /mnt/pt -type f -exec stat -c "%s %b" {} \; > /tmp/file-sizes
> 
> real    27m38.885s
> user    3m29.774s
> sys     17m9.272s

That's close on CPU bound, so I'm betting most of that time is in
fork/exec for the stat binary for each file that find pipes out.

e.g on my workstation, which has ~50,000 read iops @ QD=1 capability
on the root filesystem:

$  time sudo find / -type f > /tmp/files

real	0m8.707s
user	0m1.006s
sys	0m2.571s

$ wc -l /tmp/files
1832634 /tmp/files

So, 1.8m files traversed in under 9 seconds - this didn't stat the
inodes because ftype. How long does the example script take?

$ time sudo find / -type f -exec stat -c "%s %b" {} \; > /tmp/file-sizes
<still waiting after 10 minutes>
.....

While we are waiting, lets just get rid of the fork/exec overhead,
eh?

$ time sudo sh -c 'find / -type f |xargs -d "\n" stat -c "%s %b" > /tmp/files-size-2'

real	0m4.712s
user	0m2.732s
sys	0m5.073s

Ok, a fair bunch of the directory heirarchy and inodes were cached,
but the actual stat takes almost no time at all and almost no CPU
usage.

Back to the fork-exec script, still waiting for it to finish,
despite the entire file set now residing in kernel memory.

Fmeh, I'm just going to kill it.

....
real	26m34.542s
user	18m58.183s
sys	7m38.125s
$ wc -l /tmp/file-sizes
1824062 /tmp/file-sizes

Oh, it was almost done. IOWs, the /implementation/ was the problem

I think by now you understand that I gave an example of how the
information you are querying is already available through normal
POSIX APIs, not that it was the most optimal way of running that
query through those APIs.

It is, however, /trivial/ to do this query traversal query at max IO
speed even using scripts - using xargs to batch arguments to
utilities to avoid fork/exec overhead is sysadmin 101 stuff...

> # echo "frag -s -e" | time /tmp/xfs_db -r /dev/sdb1
> [...]
> 0.44user 2.48system 0:05.42elapsed 53%CPU (0avgtext+0avgdata 996000maxresident)k
> 2079416inputs+0outputs (0major+248446minor)pagefaults 0swaps
> 
> That's 5.4s vs +27 minutes without considering the time to build the histogram.

Yup, but now you've explained why you are trying to use xfs_db in
inappropriate ways: performance.

Despite the fact directory traversal can be fast, it is still not
the most IO efficient way to iterate inodes. xfs_db does that by
reading the inodes in ascending order from the AGI btrees, meaning
it's a single sequential pass across the filesystem to parse indoes
a chunk at a time.

The sad fact about all this is that we've been able to do this from
userspace with XFS since .... 1994.  It's called bulkstat. I say
this is sad because I've lost count of the number of times people
have wasted time trying to re-invent the wheel rather than just
asking the experts a simple question and being told about bulkstat
or GETFSMAP....

Yup, there's even a basic test program in fstests that outputs stat
information from bulkstat that you can filter to report the info you
are generating histograms from:

$ time sudo src/bstat / | grep -A1 "mode 01" | awk -e '/blksize/ { print $4, $6 }' > /tmp/bstat-sizes

real	0m11.317s
user	0m8.686s
sys	0m5.909s

Again, this is /not an optimal implementation/ but just an example
that this functionality is available to userspace. Targeted
implementations can be found in tools like xfs_fsr and xfsdump which
use bulkstat to find the indoes they need to operate on much faster
than a directory walk ever will acheive....

> > I have no problems with adding analysis and reporting functionality
> > to the filesystem tools, but they have to be done the right way, and
> > not duplicate functionality and information that can be trivially
> > obtained from userspace with a script and basic utilities. IMO,
> > there has to be some substantial benefit from implementing the
> > functionality using deep, dark filesystem gubbins that can't be
> > acheived in any other way for it be worth the additional code
> > maintenance burden....
> 
> In my view, the efficiency gain should justify the need to for this
> tool.

As I said, I have no objections to such functionality if it is /done
the right way/. I'm not arguing against providing such functionality
to users, I'm pointing out that the implementation has issues that
will cause problems for users that try to use this functionality
and trying to let you know how to implement it safely without giving
up any of the efficiency gains.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-05-15 22:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-14 18:50 [PATCH] xfs_db: add extent count and file size histograms Jorge Guerra
2019-05-14 19:52 ` Eric Sandeen
2019-05-14 20:02 ` Eric Sandeen
2019-05-15 15:57   ` Jorge Guerra
2019-05-15 16:02     ` Eric Sandeen
2019-05-14 23:31 ` Dave Chinner
2019-05-15  0:06   ` Eric Sandeen
2019-05-15  2:05     ` Dave Chinner
2019-05-15 16:39       ` Jorge Guerra
2019-05-15 22:55         ` Dave Chinner
2019-05-15 16:15   ` Jorge Guerra
2019-05-15 16:24     ` Eric Sandeen
2019-05-15 16:47       ` Jorge Guerra
2019-05-15 16:51         ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.