All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-12  8:34 Dave Chinner
  2014-06-12  8:56 ` [PATCH] [RFC] fs_mark: add asynchronous fsync Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-12  8:34 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We've had plenty of requests for an asynchronous fsync over the past
few years, and we've got the infrastructure there to do it. But
nobody has wired it up to test it. The common request we get from
userspace storage applications is to do a post-write pass over a set
of files that were just written (i.e. bulk background fsync) for
point-in-time checkpointing or flushing purposes.

So, just to see if I could brute force an effective implementation,
wire up aio_fsync, add a workqueue and push all the fsync calls off
to the workqueue. The workqueue will allow parallel dispatch, switch
execution if a fsync blocks for any reason, etc. Brute force and
very effective....

So, I hacked up fs_mark to enable fsync via the libaio io_fsync()
interface to run some tests. The quick test is:

	- write 10000 4k files into the cache
	- run a post write open-fsync-close pass (sync mode 5)
	- run 5 iterations
	- run a single thread, then 4 threads.

First I ran it on a 500TB sparse filesystem on a SSD.

FSUse%        Count         Size    Files/sec     App Overhead
     0        10000         4096        599.1           153855
     0        20000         4096        739.2           151228
     0        30000         4096        672.2           152937
     0        40000         4096        719.9           150615
     0        50000         4096        708.4           154889

real    1m13.121s
user    0m0.825s
sys     0m11.024s

Runs at around 500 log forces a second and 1500 IOPS.

Using io_fsync():

FSUse%        Count         Size    Files/sec     App Overhead
     0        10000         4096       2700.5           130313
     0        20000         4096       3938.8           133602
     0        30000         4096       4608.7           107871
     0        40000         4096       4768.4            82965
     0        50000         4096       4615.0            89220

real    0m12.691s
user    0m0.460s
sys     0m7.389s

Runs at around 4,000 log forces a second and 4500 IOPS. Massive
reduction in runtime through parallel dispatch of the fsync calls.

Run the same workload, 4 threads at a time. Normal fsync:

FSUse%        Count         Size    Files/sec     App Overhead
     0        40000         4096       2151.5           617010
     0        80000         4096       1953.0           613470
     0       120000         4096       1874.4           625027
     0       160000         4096       1907.4           624319
     0       200000         4096       1924.3           627567

real    1m42.243s
user    0m3.552s
sys     0m49.118s

Runs at ~2000 log forces/s and 3,500 IOPS.

Using io_fsync():

FSUse%        Count         Size    Files/sec     App Overhead
     0        40000         4096      11518.9           427666
     0        80000         4096      15668.8           401661
     0       120000         4096      15607.0           382279
     0       160000         4096      14935.0           399097
     0       200000         4096      15198.6           413965

real    0m14.192s
user    0m1.891s
sys     0m30.136s

Almost perfect scaling! ~15,000 log forces a second and ~20,000 IOPS.

Now run the tests on a HW RAID0 of spinning disk:

Threads		files/s	   run time	log force/s	IOPS
 1, fsync	  800	    1m 5.1s	   800		 1500
 1, io_fsync	 6000	       8.4s	  5000		 5500
 4, fsync	 1800	    1m47.1s	  2200		 3500
 4, io_fsync	19000	      10.3s	 21000		26000

Pretty much the same results. Spinning disks don't scale much
further. The SSD can go a bit higher, with 8 threads generating
a consistent 24,000 files/s, but at that point we're starting to see
non-linear system CPU usage (probably lock contention in the log).

But, regardless, there's a massive potential for speed gains for
applications that need to do bulk fsync operations and don't need to
care about the IO latency of individual fsync operations....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 41 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h |  2 ++
 fs/xfs/xfs_super.c |  9 +++++++++
 3 files changed, 52 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 077bcc8..9cdecee 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -45,6 +45,7 @@
 #include <linux/pagevec.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+struct workqueue_struct *xfs_aio_fsync_wq;
 
 /*
  * Locking primitives for read and write IO paths to ensure we consistently use
@@ -228,6 +229,45 @@ xfs_file_fsync(
 	return error;
 }
 
+struct xfs_afsync_args {
+	struct work_struct	work;
+	struct kiocb		*iocb;
+	struct file		*file;
+	int			datasync;
+};
+
+STATIC void
+xfs_file_aio_fsync_work(
+	struct work_struct	*work)
+{
+	struct xfs_afsync_args	*args = container_of(work,
+						struct xfs_afsync_args, work);
+	int			error;
+
+	error = xfs_file_fsync(args->file, 0, -1LL, args->datasync);
+	aio_complete(args->iocb, error, 0);
+	kmem_free(args);
+}
+
+STATIC int
+xfs_file_aio_fsync(
+	struct kiocb		*iocb,
+	int			datasync)
+{
+	struct xfs_afsync_args	*args;
+
+	args = kmem_zalloc(sizeof(struct xfs_afsync_args), KM_SLEEP|KM_MAYFAIL);
+	if (!args)
+		return -ENOMEM;
+
+	INIT_WORK(&args->work, xfs_file_aio_fsync_work);
+	args->iocb = iocb;
+	args->file = iocb->ki_filp;
+	args->datasync = datasync;
+	queue_work(xfs_aio_fsync_wq, &args->work);
+	return -EIOCBQUEUED;
+}
+
 STATIC ssize_t
 xfs_file_aio_read(
 	struct kiocb		*iocb,
@@ -1475,6 +1515,7 @@ const struct file_operations xfs_file_operations = {
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
+	.aio_fsync	= xfs_file_aio_fsync,
 	.fallocate	= xfs_file_fallocate,
 };
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7295a0b..dfcf37b 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -390,6 +390,8 @@ extern int	xfs_dev_is_read_only(struct xfs_mount *, char *);
 
 extern void	xfs_set_low_space_thresholds(struct xfs_mount *);
 
+extern struct workqueue_struct *xfs_aio_fsync_wq;
+
 #endif	/* __KERNEL__ */
 
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f2e5f8a..86d4923 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1718,12 +1718,21 @@ xfs_init_workqueues(void)
 	if (!xfs_alloc_wq)
 		return -ENOMEM;
 
+	xfs_aio_fsync_wq = alloc_workqueue("xfsfsync", 0, 0);
+	if (!xfs_aio_fsync_wq)
+		goto destroy_alloc_wq;
+
 	return 0;
+
+destroy_alloc_wq:
+	destroy_workqueue(xfs_alloc_wq);
+	return -ENOMEM;
 }
 
 STATIC void
 xfs_destroy_workqueues(void)
 {
+	destroy_workqueue(xfs_aio_fsync_wq);
 	destroy_workqueue(xfs_alloc_wq);
 }
 
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH] [RFC] fs_mark: add asynchronous fsync
  2014-06-12  8:34 [PATCH] [RFC] xfs: wire up aio_fsync method Dave Chinner
@ 2014-06-12  8:56 ` Dave Chinner
  2014-06-12 14:13 ` [PATCH] [RFC] xfs: wire up aio_fsync method Christoph Hellwig
  2014-06-12 15:24 ` Brian Foster
  2 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-12  8:56 UTC (permalink / raw)
  To: xfs; +Cc: rwheeler

From: Dave Chinner <dchinner@redhat.com>

to be able to test the performance impact of asynchronous bulk
fsync of newly created files, add an option to fsmark to be able
to use libaio for dispatch and collection of fsync operations. This
requires a filesystem that has wired up IO_CMD_FSYNC in the kernel.

This is all a bit hacky, but it does wait for all fsyncs to
complete. aio contexts have to be set up after the runner thread
have been forked, so it's not really using a global iocb and event
array as the code appears - there's one per child process doing
work, and hence there's no need for locking or other fancy stuff.

No attempt has been made to tune the number of AIOs allowed to be in
flight, nor has there been any attempt to optimise the collection of
completion events nor return errors if an fsync fails. It works well
enough for testing the new XFS code and that's all I need from it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

--- a/fs_mark.c	2011-07-30 10:09:43.000000000 +1000
+++ b/fs_mark.c	2014-06-12 18:01:43.000000000 +1000
@@ -41,6 +41,7 @@
 #include <dirent.h>
 #include <ctype.h>
 #include <time.h>
+#include <libaio.h>
 
 #include <linux/types.h>
 #include <linux/limits.h>
@@ -68,6 +69,7 @@
 		"\t-S Sync Method (0:No Sync, 1:fsyncBeforeClose, "
 		"2:sync/1_fsync, 3:PostReverseFsync, "
 		"4:syncPostReverseFsync, 5:PostFsync, 6:syncPostFsync)\n",
+		"\t-A <use aio_fsync>\n"
 		"\t[-D number (of subdirectories)]\n",
 		"\t[-N number (of files in each subdirectory in Round Robin mode)]\n",
 		"\t[-d dir1 ... -d dirN]\n", "\t[-l log_file_name]\n",
@@ -84,18 +86,101 @@
 }
 
 /*
+ * aio for fsync
+ */
+#define MAX_AIO_EVENTS 1024
+io_context_t ctxp;
+struct iocb *iocbs[MAX_AIO_EVENTS];
+struct io_event ioevents[MAX_AIO_EVENTS];
+int aio_next;
+int aio_flight;
+
+void
+init_aio_fsync(void)
+{
+	int i, r;
+
+        memset(&ctxp, 0, sizeof(ctxp));
+	r = io_setup(MAX_AIO_EVENTS, &ctxp);
+	if (r) {
+		printf("FAIL! io_setup returned %d\n", r);
+		cleanup_exit();
+	}
+	for (i = 0; i < MAX_AIO_EVENTS; ++i) {
+		iocbs[i] = calloc(1, sizeof(struct iocb));
+		if (iocbs[i] == NULL) {
+			printf("failed to allocate an iocb\n");
+			cleanup_exit();
+		}
+	}
+
+}
+
+void
+get_fsync_completions(int threshold)
+{
+	int i, r;
+
+	if (!(sync_method & FSYNC_ASYNC))
+		return;
+
+	while (aio_flight > threshold) {
+		/* gather up some completions */
+		r = io_getevents(ctxp, 1, MAX_AIO_EVENTS, ioevents, NULL);
+		if (r < 0)  {
+			printf("FAIL! io_getevents returned %d\n", r);
+			cleanup_exit();
+		}
+
+		aio_flight -= r;
+		for (i = 0; i < r; ++i) {
+			if (ioevents[i].res)
+				printf("FAIL! aio_fsync returned %d\n",
+					ioevents[i].res);
+		}
+		usleep(1000);
+	}
+}
+
+int
+do_fsync(int fd)
+{
+	int r, i;
+
+	if (!(sync_method & FSYNC_ASYNC))
+		return fsync(fd);
+
+
+	get_fsync_completions(MAX_AIO_EVENTS / 2);
+
+	/* submit the fsync */
+	i = aio_next++ % MAX_AIO_EVENTS;
+	aio_flight++;
+	r = io_fsync(ctxp, iocbs[i], NULL, fd);
+	if (r != 1) {
+		printf("FAIL! io_submit returned %d\n", r);
+		cleanup_exit();
+	}
+
+}
+
+/*
  * Run through the specified arguments and make sure that they make sense.
  */
 void process_args(int argc, char **argv, char **envp)
 {
 	int ret;
+	int fsync_async = 0;
 
 	/*
 	 * Parse all of the options that the user specified.
 	 */
 	while ((ret =
-		getopt(argc, argv, "vhkFr:S:N:D:d:l:L:n:p:s:t:w:")) != EOF) {
+		getopt(argc, argv, "AvhkFr:S:N:D:d:l:L:n:p:s:t:w:")) != EOF) {
 		switch (ret) {
+		case 'A':
+			fsync_async++;
+			break;
 		case 'v':	/* verbose stats */
 			verbose_stats = 1;
 			break;
@@ -250,6 +335,8 @@
 			" for -N num_per_subdir to make sense\n");
 		usage();
 	}
+	if (fsync_async)
+		sync_method |= FSYNC_ASYNC;
 
 	/*
 	 * We need at least one thread per specified directory.
@@ -650,6 +737,8 @@
 	close_usec = max_close_usec = min_close_usec = 0ULL;
 	unlink_usec = max_unlink_usec = min_unlink_usec = 0ULL;
 
+	init_aio_fsync();
+
 	/*
 	 * MAIN FILE WRITE LOOP:
 	 * This loop measures the specific steps in creating files:
@@ -714,7 +803,7 @@
 		if (sync_method & FSYNC_BEFORE_CLOSE) {
 			start(0);
 
-			if (fsync(fd) == -1) {
+			if (do_fsync(fd) == -1) {
 				fprintf(stderr, "fs_mark: fsync failed %s\n",
 					strerror(errno));
 				cleanup_exit();
@@ -775,7 +864,7 @@
 				cleanup_exit();
 			}
 
-			if (fsync(fd) == -1) {
+			if (do_fsync(fd) == -1) {
 				fprintf(stderr, "fs_mark: fsync failed %s\n",
 					strerror(errno));
 				cleanup_exit();
@@ -813,7 +902,7 @@
 				cleanup_exit();
 			}
 
-			if (fsync(fd) == -1) {
+			if (do_fsync(fd) == -1) {
 				fprintf(stderr, "fs_mark: fsync failed %s\n",
 					strerror(errno));
 				cleanup_exit();
@@ -849,7 +938,7 @@
 			cleanup_exit();
 		}
 
-		if (fsync(fd) == -1) {
+		if (do_fsync(fd) == -1) {
 			fprintf(stderr, "fs_mark: fsync failed %s\n",
 				strerror(errno));
 			cleanup_exit();
@@ -859,6 +948,8 @@
 		fsync_usec += stop(0, 0);
 	}
 
+	get_fsync_completions(0);
+
 	/*
 	 * Record the total time spent in the file writing loop - we ignore the time spent unlinking files
 	 */
@@ -1209,6 +1300,8 @@
 		fs_mark_version, num_threads, ctime(&time_run));
 	fprintf(log_fp, "#\tSync method: %s\n",
 		sync_policy_string[sync_method_type]);
+	if (sync_method & FSYNC_ASYNC)
+		fprintf(log_fp, "#\tUsing aio_fsync\n");
 	if (num_subdirs > 1) {
 		fprintf(log_fp,
 			"#\tDirectories:  %s across %d subdirectories with %d %s.\n",
--- a/fs_mark.h	2011-07-30 10:09:43.000000000 +1000
+++ b/fs_mark.h	2014-06-12 18:01:43.000000000 +1000
@@ -70,6 +70,7 @@
 #define FSYNC_FIRST_FILE	(0x4)
 #define FSYNC_POST_REVERSE	(0x8)
 #define FSYNC_POST_IN_ORDER	(0x10)
+#define FSYNC_ASYNC		(0x20)
 
 
 #define SYNC_TEST_NONE		(0)	    					/* -S 0 */

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-12  8:34 [PATCH] [RFC] xfs: wire up aio_fsync method Dave Chinner
  2014-06-12  8:56 ` [PATCH] [RFC] fs_mark: add asynchronous fsync Dave Chinner
@ 2014-06-12 14:13 ` Christoph Hellwig
  2014-06-12 23:44   ` Dave Chinner
  2014-06-12 15:24 ` Brian Foster
  2 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2014-06-12 14:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

There doesn't really seem anything XFS specific here, so instead
of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
based on the workqueue and ->fsync.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-12  8:34 [PATCH] [RFC] xfs: wire up aio_fsync method Dave Chinner
  2014-06-12  8:56 ` [PATCH] [RFC] fs_mark: add asynchronous fsync Dave Chinner
  2014-06-12 14:13 ` [PATCH] [RFC] xfs: wire up aio_fsync method Christoph Hellwig
@ 2014-06-12 15:24 ` Brian Foster
  2014-06-12 23:57   ` Dave Chinner
  2 siblings, 1 reply; 39+ messages in thread
From: Brian Foster @ 2014-06-12 15:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jun 12, 2014 at 06:34:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We've had plenty of requests for an asynchronous fsync over the past
> few years, and we've got the infrastructure there to do it. But
> nobody has wired it up to test it. The common request we get from
> userspace storage applications is to do a post-write pass over a set
> of files that were just written (i.e. bulk background fsync) for
> point-in-time checkpointing or flushing purposes.
> 
> So, just to see if I could brute force an effective implementation,
> wire up aio_fsync, add a workqueue and push all the fsync calls off
> to the workqueue. The workqueue will allow parallel dispatch, switch
> execution if a fsync blocks for any reason, etc. Brute force and
> very effective....
> 
> So, I hacked up fs_mark to enable fsync via the libaio io_fsync()
> interface to run some tests. The quick test is:
> 
> 	- write 10000 4k files into the cache
> 	- run a post write open-fsync-close pass (sync mode 5)
> 	- run 5 iterations
> 	- run a single thread, then 4 threads.
> 
> First I ran it on a 500TB sparse filesystem on a SSD.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        10000         4096        599.1           153855
>      0        20000         4096        739.2           151228
>      0        30000         4096        672.2           152937
>      0        40000         4096        719.9           150615
>      0        50000         4096        708.4           154889
> 
> real    1m13.121s
> user    0m0.825s
> sys     0m11.024s
> 
> Runs at around 500 log forces a second and 1500 IOPS.
> 
> Using io_fsync():
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        10000         4096       2700.5           130313
>      0        20000         4096       3938.8           133602
>      0        30000         4096       4608.7           107871
>      0        40000         4096       4768.4            82965
>      0        50000         4096       4615.0            89220
> 
> real    0m12.691s
> user    0m0.460s
> sys     0m7.389s
> 
> Runs at around 4,000 log forces a second and 4500 IOPS. Massive
> reduction in runtime through parallel dispatch of the fsync calls.
> 
> Run the same workload, 4 threads at a time. Normal fsync:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        40000         4096       2151.5           617010
>      0        80000         4096       1953.0           613470
>      0       120000         4096       1874.4           625027
>      0       160000         4096       1907.4           624319
>      0       200000         4096       1924.3           627567
> 
> real    1m42.243s
> user    0m3.552s
> sys     0m49.118s
> 
> Runs at ~2000 log forces/s and 3,500 IOPS.
> 
> Using io_fsync():
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        40000         4096      11518.9           427666
>      0        80000         4096      15668.8           401661
>      0       120000         4096      15607.0           382279
>      0       160000         4096      14935.0           399097
>      0       200000         4096      15198.6           413965
> 
> real    0m14.192s
> user    0m1.891s
> sys     0m30.136s
> 
> Almost perfect scaling! ~15,000 log forces a second and ~20,000 IOPS.
> 
> Now run the tests on a HW RAID0 of spinning disk:
> 
> Threads		files/s	   run time	log force/s	IOPS
>  1, fsync	  800	    1m 5.1s	   800		 1500
>  1, io_fsync	 6000	       8.4s	  5000		 5500
>  4, fsync	 1800	    1m47.1s	  2200		 3500
>  4, io_fsync	19000	      10.3s	 21000		26000
> 
> Pretty much the same results. Spinning disks don't scale much
> further. The SSD can go a bit higher, with 8 threads generating
> a consistent 24,000 files/s, but at that point we're starting to see
> non-linear system CPU usage (probably lock contention in the log).
> 
> But, regardless, there's a massive potential for speed gains for
> applications that need to do bulk fsync operations and don't need to
> care about the IO latency of individual fsync operations....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

That looks great. This is something that could be quite beneficial to
glusterfs, as a real world example. The replication mechanism does an
xattr dance across servers and required addition of fsync's into the
algorithm to ensure correctness in the case of failures. This had a
notable impact on performance.

We thought a bit about hooking up aio_fsync(), but more along the lines
of waiting for the log to force rather than forcing it explicitly, but
didn't really go anywhere with it. I didn't consider we'd get such a
benefit from simply dropping it into a workqueue. :)

I do like Christoph's idea... perhaps create a generic_file_aio_fsync()
or some such?

Brian

>  fs/xfs/xfs_file.c  | 41 +++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h |  2 ++
>  fs/xfs/xfs_super.c |  9 +++++++++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 077bcc8..9cdecee 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -45,6 +45,7 @@
>  #include <linux/pagevec.h>
>  
>  static const struct vm_operations_struct xfs_file_vm_ops;
> +struct workqueue_struct *xfs_aio_fsync_wq;
>  
>  /*
>   * Locking primitives for read and write IO paths to ensure we consistently use
> @@ -228,6 +229,45 @@ xfs_file_fsync(
>  	return error;
>  }
>  
> +struct xfs_afsync_args {
> +	struct work_struct	work;
> +	struct kiocb		*iocb;
> +	struct file		*file;
> +	int			datasync;
> +};
> +
> +STATIC void
> +xfs_file_aio_fsync_work(
> +	struct work_struct	*work)
> +{
> +	struct xfs_afsync_args	*args = container_of(work,
> +						struct xfs_afsync_args, work);
> +	int			error;
> +
> +	error = xfs_file_fsync(args->file, 0, -1LL, args->datasync);
> +	aio_complete(args->iocb, error, 0);
> +	kmem_free(args);
> +}
> +
> +STATIC int
> +xfs_file_aio_fsync(
> +	struct kiocb		*iocb,
> +	int			datasync)
> +{
> +	struct xfs_afsync_args	*args;
> +
> +	args = kmem_zalloc(sizeof(struct xfs_afsync_args), KM_SLEEP|KM_MAYFAIL);
> +	if (!args)
> +		return -ENOMEM;
> +
> +	INIT_WORK(&args->work, xfs_file_aio_fsync_work);
> +	args->iocb = iocb;
> +	args->file = iocb->ki_filp;
> +	args->datasync = datasync;
> +	queue_work(xfs_aio_fsync_wq, &args->work);
> +	return -EIOCBQUEUED;
> +}
> +
>  STATIC ssize_t
>  xfs_file_aio_read(
>  	struct kiocb		*iocb,
> @@ -1475,6 +1515,7 @@ const struct file_operations xfs_file_operations = {
>  	.open		= xfs_file_open,
>  	.release	= xfs_file_release,
>  	.fsync		= xfs_file_fsync,
> +	.aio_fsync	= xfs_file_aio_fsync,
>  	.fallocate	= xfs_file_fallocate,
>  };
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 7295a0b..dfcf37b 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -390,6 +390,8 @@ extern int	xfs_dev_is_read_only(struct xfs_mount *, char *);
>  
>  extern void	xfs_set_low_space_thresholds(struct xfs_mount *);
>  
> +extern struct workqueue_struct *xfs_aio_fsync_wq;
> +
>  #endif	/* __KERNEL__ */
>  
>  #endif	/* __XFS_MOUNT_H__ */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index f2e5f8a..86d4923 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1718,12 +1718,21 @@ xfs_init_workqueues(void)
>  	if (!xfs_alloc_wq)
>  		return -ENOMEM;
>  
> +	xfs_aio_fsync_wq = alloc_workqueue("xfsfsync", 0, 0);
> +	if (!xfs_aio_fsync_wq)
> +		goto destroy_alloc_wq;
> +
>  	return 0;
> +
> +destroy_alloc_wq:
> +	destroy_workqueue(xfs_alloc_wq);
> +	return -ENOMEM;
>  }
>  
>  STATIC void
>  xfs_destroy_workqueues(void)
>  {
> +	destroy_workqueue(xfs_aio_fsync_wq);
>  	destroy_workqueue(xfs_alloc_wq);
>  }
>  
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-12 14:13 ` [PATCH] [RFC] xfs: wire up aio_fsync method Christoph Hellwig
@ 2014-06-12 23:44   ` Dave Chinner
  2014-06-13 16:23     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2014-06-12 23:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> There doesn't really seem anything XFS specific here, so instead
> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> based on the workqueue and ->fsync.

I really don't know whether the other ->fsync methods in other
filesystems can stand alone like that. I also don't have the
time to test that it works properly on all filesystems right now.

Also, doing this implementation in fs/aio.c would mean we can't
optimise it to reduce things like log forces by splitting up the
work of concurrent fsyncs into a single log force of the highest
LSN of the batch of fsyncs being run. We also want to be able to do
"background fsync" where latency doesn't matter and we only want to
trickle them out rather than issue them as fast as we possibly can.

So I really don't see this as the infrastructure solution that
everyone uses. It could be made a generic method with the filesystem
passing the workqueue to use to generic_aio_fsync(), but for XFS I
see it turning into something much more complex and optimised...

Like I said, this is just a brute force proof of concept.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-12 15:24 ` Brian Foster
@ 2014-06-12 23:57   ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-12 23:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Jun 12, 2014 at 11:24:58AM -0400, Brian Foster wrote:
> On Thu, Jun 12, 2014 at 06:34:07PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We've had plenty of requests for an asynchronous fsync over the past
> > few years, and we've got the infrastructure there to do it. But
> > nobody has wired it up to test it. The common request we get from
> > userspace storage applications is to do a post-write pass over a set
> > of files that were just written (i.e. bulk background fsync) for
> > point-in-time checkpointing or flushing purposes.
> > 
> > So, just to see if I could brute force an effective implementation,
> > wire up aio_fsync, add a workqueue and push all the fsync calls off
> > to the workqueue. The workqueue will allow parallel dispatch, switch
> > execution if a fsync blocks for any reason, etc. Brute force and
> > very effective....
....
> That looks great. This is something that could be quite beneficial to
> glusterfs, as a real world example. The replication mechanism does an
> xattr dance across servers and required addition of fsync's into the
> algorithm to ensure correctness in the case of failures. This had a
> notable impact on performance.

*nod*

> We thought a bit about hooking up aio_fsync(), but more along the lines
> of waiting for the log to force rather than forcing it explicitly, but
> didn't really go anywhere with it. I didn't consider we'd get such a
> benefit from simply dropping it into a workqueue. :)

Neither did I. I was expecting a small performance increase, not a
4-5x improvement.  I'm quite surprised that the log sustained that
level of force operations, too, but it's quite inefficient because
we don't need to do all those log forces and they are causing lock
contention at higher thread counts.

> I do like Christoph's idea... perhaps create a generic_file_aio_fsync()
> or some such?

Perhaps, but XFS won't use it because I really want to optimise away
the majority of the redundant log forces before pushing this
further. This patch is a proof of concept for XFS, not a generic
solution for all filesystems.  That's not to say it can't be made a
generic solution, but that's a different story altogether, and
something that can happen after I've got XFS sorted out...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-12 23:44   ` Dave Chinner
@ 2014-06-13 16:23     ` Christoph Hellwig
  2014-06-15 22:33         ` Dave Chinner
       [not found]       ` <20140613162352.GB23394-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 2 replies; 39+ messages in thread
From: Christoph Hellwig @ 2014-06-13 16:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-man, xfs

On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > There doesn't really seem anything XFS specific here, so instead
> > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > based on the workqueue and ->fsync.
> 
> I really don't know whether the other ->fsync methods in other
> filesystems can stand alone like that. I also don't have the
> time to test that it works properly on all filesystems right now.

Of course they can, as shown by various calls to vfs_fsync_range that
is nothing but a small wrapper around ->fsync.  I'm pretty sure if you
Cc linux-fsdevel you'll find plenty of testers.  -fsdevel and -man
should get a Cc anyway when implementing an ABI that had it's constants
defines but never was implemented properly.

Talking about documentation:  The kernel aio manpages (io_*.2) seems
to not really be very useful, mostly because they don't explain how
to set up the iocbs.  Michael, any idea how to get started to improve
this?

> Also, doing this implementation in fs/aio.c would mean we can't
> optimise it to reduce things like log forces by splitting up the
> work of concurrent fsyncs into a single log force of the highest
> LSN of the batch of fsyncs being run. We also want to be able to do
> "background fsync" where latency doesn't matter and we only want to
> trickle them out rather than issue them as fast as we possibly can.

It didn't really sound like you were aiming for that.  But in that
case the current implementation is still useful as a
generic_file_aio_fsync as suggested by Brian.

> So I really don't see this as the infrastructure solution that
> everyone uses. It could be made a generic method with the filesystem
> passing the workqueue to use to generic_aio_fsync(), but for XFS I
> see it turning into something much more complex and optimised...

Why not have a common workqueue?  In fact we already have a common
workqueue to call ->fsync from aio code to implement aio O_SYNC anyway.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-13 16:23     ` Christoph Hellwig
@ 2014-06-15 22:33         ` Dave Chinner
       [not found]       ` <20140613162352.GB23394-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-15 22:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, linux-fsdevel, linux-man

On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> > On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > > There doesn't really seem anything XFS specific here, so instead
> > > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > > based on the workqueue and ->fsync.
> > 
> > I really don't know whether the other ->fsync methods in other
> > filesystems can stand alone like that. I also don't have the
> > time to test that it works properly on all filesystems right now.
> 
> Of course they can, as shown by various calls to vfs_fsync_range that
> is nothing but a small wrapper around ->fsync.

Sure, but that's not getting 10,000 concurrent callers, is it? And
some fsync methods require journal credits, and others serialise
completely, and so on.

Besides, putting an *unbound, highly concurrent* aio queue into the
kernel for an operation that can serialise the entire filesystem
seems like a pretty nasty user-level DOS vector to me.

> I'm pretty sure if you
> Cc linux-fsdevel you'll find plenty of testers.  -fsdevel and -man
> should get a Cc anyway when implementing an ABI that had it's constants
> defines but never was implemented properly.

When the userspace ABIs are already fully documented and *just
work*, I'm not sure that there's any need for an ABI or man page
discussion...

> Talking about documentation:  The kernel aio manpages (io_*.2) seems
> to not really be very useful, mostly because they don't explain how
> to set up the iocbs.  Michael, any idea how to get started to improve
> this?

$ man 3 io_prep_fsync

or, perhaps you just need to use:

$ man 3 io_fsync

Which does all the prep and submission for you. Yup, I used those
man pages to write the fs_mark modifications....

> > Also, doing this implementation in fs/aio.c would mean we can't
> > optimise it to reduce things like log forces by splitting up the
> > work of concurrent fsyncs into a single log force of the highest
> > LSN of the batch of fsyncs being run. We also want to be able to do
> > "background fsync" where latency doesn't matter and we only want to
> > trickle them out rather than issue them as fast as we possibly can.
> 
> It didn't really sound like you were aiming for that.  But in that
> case the current implementation is still useful as a
> generic_file_aio_fsync as suggested by Brian.

It's an RFC - I'm not proposing it as is, but merely posting it to
see what people think about the approach and where to take it from
the. This is not production ready code, nor is it something we
should implement generically as it stands because of things like the
DOS potential it has.

> > So I really don't see this as the infrastructure solution that
> > everyone uses. It could be made a generic method with the filesystem
> > passing the workqueue to use to generic_aio_fsync(), but for XFS I
> > see it turning into something much more complex and optimised...
> 
> Why not have a common workqueue?  In fact we already have a common
> workqueue to call ->fsync from aio code to implement aio O_SYNC anyway.

Maybe a generic solution will eventually, but this isn't an RFC for
a generic solution to the aio_fsync() hook...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-15 22:33         ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-15 22:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-man, xfs

On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> > On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > > There doesn't really seem anything XFS specific here, so instead
> > > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > > based on the workqueue and ->fsync.
> > 
> > I really don't know whether the other ->fsync methods in other
> > filesystems can stand alone like that. I also don't have the
> > time to test that it works properly on all filesystems right now.
> 
> Of course they can, as shown by various calls to vfs_fsync_range that
> is nothing but a small wrapper around ->fsync.

Sure, but that's not getting 10,000 concurrent callers, is it? And
some fsync methods require journal credits, and others serialise
completely, and so on.

Besides, putting an *unbound, highly concurrent* aio queue into the
kernel for an operation that can serialise the entire filesystem
seems like a pretty nasty user-level DOS vector to me.

> I'm pretty sure if you
> Cc linux-fsdevel you'll find plenty of testers.  -fsdevel and -man
> should get a Cc anyway when implementing an ABI that had it's constants
> defines but never was implemented properly.

When the userspace ABIs are already fully documented and *just
work*, I'm not sure that there's any need for an ABI or man page
discussion...

> Talking about documentation:  The kernel aio manpages (io_*.2) seems
> to not really be very useful, mostly because they don't explain how
> to set up the iocbs.  Michael, any idea how to get started to improve
> this?

$ man 3 io_prep_fsync

or, perhaps you just need to use:

$ man 3 io_fsync

Which does all the prep and submission for you. Yup, I used those
man pages to write the fs_mark modifications....

> > Also, doing this implementation in fs/aio.c would mean we can't
> > optimise it to reduce things like log forces by splitting up the
> > work of concurrent fsyncs into a single log force of the highest
> > LSN of the batch of fsyncs being run. We also want to be able to do
> > "background fsync" where latency doesn't matter and we only want to
> > trickle them out rather than issue them as fast as we possibly can.
> 
> It didn't really sound like you were aiming for that.  But in that
> case the current implementation is still useful as a
> generic_file_aio_fsync as suggested by Brian.

It's an RFC - I'm not proposing it as is, but merely posting it to
see what people think about the approach and where to take it from
the. This is not production ready code, nor is it something we
should implement generically as it stands because of things like the
DOS potential it has.

> > So I really don't see this as the infrastructure solution that
> > everyone uses. It could be made a generic method with the filesystem
> > passing the workqueue to use to generic_aio_fsync(), but for XFS I
> > see it turning into something much more complex and optimised...
> 
> Why not have a common workqueue?  In fact we already have a common
> workqueue to call ->fsync from aio code to implement aio O_SYNC anyway.

Maybe a generic solution will eventually, but this isn't an RFC for
a generic solution to the aio_fsync() hook...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-15 22:33         ` Dave Chinner
@ 2014-06-16  2:00           ` Dave Chinner
  -1 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-16  2:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	axboe-tSWWG44O7X1aa/9Udqfwiw

On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> > On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > > > There doesn't really seem anything XFS specific here, so instead
> > > > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > > > based on the workqueue and ->fsync.
> > > 
> > > I really don't know whether the other ->fsync methods in other
> > > filesystems can stand alone like that. I also don't have the
> > > time to test that it works properly on all filesystems right now.
> > 
> > Of course they can, as shown by various calls to vfs_fsync_range that
> > is nothing but a small wrapper around ->fsync.
> 
> Sure, but that's not getting 10,000 concurrent callers, is it? And
> some fsync methods require journal credits, and others serialise
> completely, and so on.
> 
> Besides, putting an *unbound, highly concurrent* aio queue into the
> kernel for an operation that can serialise the entire filesystem
> seems like a pretty nasty user-level DOS vector to me.

FWIW, the non-linear system CPU overhead of a fs_mark test I've been
running isn't anything related to XFS.  The async fsync workqueue
results in several thousand worker threads dispatching IO
concurrently across 16 CPUs:

$ ps -ef |grep kworker |wc -l
4693
$

Profiles from 3.15 + xfs for-next + xfs aio_fsync show:

-  51.33%  [kernel]            [k] percpu_ida_alloc
   - percpu_ida_alloc
      + 85.73% blk_mq_wait_for_tags
      + 14.23% blk_mq_get_tag
-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 66.26% virtio_queue_rq
         - __blk_mq_run_hw_queue
            - 99.65% blk_mq_run_hw_queue
               + 99.47% blk_mq_insert_requests
               + 0.53% blk_mq_insert_request
.....
-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.59% __schedule
         - 86.49% schedule
            + 47.72% percpu_ida_alloc
            + 21.75% worker_thread
            + 19.12% schedule_timeout
....
      + 18.06% blk_mq_make_request

Runtime:

real    4m1.243s
user    0m47.724s
sys     11m56.724s

Most of the excessive CPU usage is coming from the blk-mq layer, and
XFS is barely showing up in the profiles at all - the IDA tag
allocator is burning 8 CPUs at about 60,000 write IOPS....

I know that the tag allocator has been rewritten, so I tested
against a current a current Linus kernel with the XFS aio-fsync
patch. The results are all over the place - from several sequential
runs of the same test (removing the files in between so each tests
starts from an empty fs):

Wall time	sys time	IOPS	 files/s
4m58.151s	11m12.648s	30,000	 13,500
4m35.075s	12m45.900s	45,000	 15,000
3m10.665s	11m15.804s	65,000	 21,000
3m27.384s	11m54.723s	85,000	 20,000
3m59.574s	11m12.012s	50,000	 16,500
4m12.704s	12m15.720s	50,000	 17,000

The 3.15 based kernel was pretty consistent around the 4m10 mark,
generally only +/-10s in runtime and not much change in system time.
The files/s rate reported by fs_mark doesn't vary that much, either.
So the new tag allocator seems to be no better in terms of IO
dispatch scalability, yet adds significant variability to IO
performance.

What I noticed is a massive jump in context switch overhead: from
around 250,000/s to over 800,000/s and the CPU profiles show that
this comes from the new tag allocator:

-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 58.22% prepare_to_wait
           100.00% bt_get
              blk_mq_get_tag
              __blk_mq_alloc_request
              blk_mq_map_request
              blk_sq_make_request
              generic_make_request
      - 22.51% virtio_queue_rq
           __blk_mq_run_hw_queue
....
-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 58.73% __schedule
         - 53.42% io_schedule
              99.88% bt_get
                 blk_mq_get_tag
                 __blk_mq_alloc_request
                 blk_mq_map_request
                 blk_sq_make_request
                 generic_make_request
         - 35.58% schedule
            + 49.31% worker_thread
            + 32.45% schedule_timeout
            + 10.35% _xfs_log_force_lsn
            + 3.10% xlog_cil_force_lsn
....

The new block-mq tag allocator is hammering the waitqueues and
that's generating a large amount of lock contention. It looks like
the new allocator replaces CPU burn doing work in the IDA allocator
with the same amount of CPU burn from extra context switch
overhead....

Oh, OH. Now I understand!

# echo 4 > /sys/block/vdc/queue/nr_requests

<run workload>

80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 98.49% prepare_to_wait
           bt_get
           blk_mq_get_tag
           __blk_mq_alloc_request
           blk_mq_map_request
           blk_sq_make_request
           generic_make_request
         + submit_bio
      + 1.07% finish_wait
+  13.63%  [kernel]  [k] _raw_spin_unlock_irq
...

It's context switch bound at 800,000 context switches/s, burning all
16 CPUs waking up and going to sleep and doing very little real
work. How little real work? About 3000 IOPS for 2MB/s of IO. That
amount of IO should only take a single digit CPU percentage of one
CPU.

This seems like bad behaviour to have on a congested block device,
even a high performance one....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-16  2:00           ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-16  2:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-man, axboe, xfs

On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> > On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> > > > There doesn't really seem anything XFS specific here, so instead
> > > > of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> > > > based on the workqueue and ->fsync.
> > > 
> > > I really don't know whether the other ->fsync methods in other
> > > filesystems can stand alone like that. I also don't have the
> > > time to test that it works properly on all filesystems right now.
> > 
> > Of course they can, as shown by various calls to vfs_fsync_range that
> > is nothing but a small wrapper around ->fsync.
> 
> Sure, but that's not getting 10,000 concurrent callers, is it? And
> some fsync methods require journal credits, and others serialise
> completely, and so on.
> 
> Besides, putting an *unbound, highly concurrent* aio queue into the
> kernel for an operation that can serialise the entire filesystem
> seems like a pretty nasty user-level DOS vector to me.

FWIW, the non-linear system CPU overhead of a fs_mark test I've been
running isn't anything related to XFS.  The async fsync workqueue
results in several thousand worker threads dispatching IO
concurrently across 16 CPUs:

$ ps -ef |grep kworker |wc -l
4693
$

Profiles from 3.15 + xfs for-next + xfs aio_fsync show:

-  51.33%  [kernel]            [k] percpu_ida_alloc
   - percpu_ida_alloc
      + 85.73% blk_mq_wait_for_tags
      + 14.23% blk_mq_get_tag
-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 66.26% virtio_queue_rq
         - __blk_mq_run_hw_queue
            - 99.65% blk_mq_run_hw_queue
               + 99.47% blk_mq_insert_requests
               + 0.53% blk_mq_insert_request
.....
-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.59% __schedule
         - 86.49% schedule
            + 47.72% percpu_ida_alloc
            + 21.75% worker_thread
            + 19.12% schedule_timeout
....
      + 18.06% blk_mq_make_request

Runtime:

real    4m1.243s
user    0m47.724s
sys     11m56.724s

Most of the excessive CPU usage is coming from the blk-mq layer, and
XFS is barely showing up in the profiles at all - the IDA tag
allocator is burning 8 CPUs at about 60,000 write IOPS....

I know that the tag allocator has been rewritten, so I tested
against a current a current Linus kernel with the XFS aio-fsync
patch. The results are all over the place - from several sequential
runs of the same test (removing the files in between so each tests
starts from an empty fs):

Wall time	sys time	IOPS	 files/s
4m58.151s	11m12.648s	30,000	 13,500
4m35.075s	12m45.900s	45,000	 15,000
3m10.665s	11m15.804s	65,000	 21,000
3m27.384s	11m54.723s	85,000	 20,000
3m59.574s	11m12.012s	50,000	 16,500
4m12.704s	12m15.720s	50,000	 17,000

The 3.15 based kernel was pretty consistent around the 4m10 mark,
generally only +/-10s in runtime and not much change in system time.
The files/s rate reported by fs_mark doesn't vary that much, either.
So the new tag allocator seems to be no better in terms of IO
dispatch scalability, yet adds significant variability to IO
performance.

What I noticed is a massive jump in context switch overhead: from
around 250,000/s to over 800,000/s and the CPU profiles show that
this comes from the new tag allocator:

-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 58.22% prepare_to_wait
           100.00% bt_get
              blk_mq_get_tag
              __blk_mq_alloc_request
              blk_mq_map_request
              blk_sq_make_request
              generic_make_request
      - 22.51% virtio_queue_rq
           __blk_mq_run_hw_queue
....
-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 58.73% __schedule
         - 53.42% io_schedule
              99.88% bt_get
                 blk_mq_get_tag
                 __blk_mq_alloc_request
                 blk_mq_map_request
                 blk_sq_make_request
                 generic_make_request
         - 35.58% schedule
            + 49.31% worker_thread
            + 32.45% schedule_timeout
            + 10.35% _xfs_log_force_lsn
            + 3.10% xlog_cil_force_lsn
....

The new block-mq tag allocator is hammering the waitqueues and
that's generating a large amount of lock contention. It looks like
the new allocator replaces CPU burn doing work in the IDA allocator
with the same amount of CPU burn from extra context switch
overhead....

Oh, OH. Now I understand!

# echo 4 > /sys/block/vdc/queue/nr_requests

<run workload>

80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 98.49% prepare_to_wait
           bt_get
           blk_mq_get_tag
           __blk_mq_alloc_request
           blk_mq_map_request
           blk_sq_make_request
           generic_make_request
         + submit_bio
      + 1.07% finish_wait
+  13.63%  [kernel]  [k] _raw_spin_unlock_irq
...

It's context switch bound at 800,000 context switches/s, burning all
16 CPUs waking up and going to sleep and doing very little real
work. How little real work? About 3000 IOPS for 2MB/s of IO. That
amount of IO should only take a single digit CPU percentage of one
CPU.

This seems like bad behaviour to have on a congested block device,
even a high performance one....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-16  2:00           ` Dave Chinner
@ 2014-06-16  2:58             ` Jens Axboe
  -1 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-16  2:58 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

On 2014-06-15 20:00, Dave Chinner wrote:
> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
>>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
>>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
>>>>> There doesn't really seem anything XFS specific here, so instead
>>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
>>>>> based on the workqueue and ->fsync.
>>>>
>>>> I really don't know whether the other ->fsync methods in other
>>>> filesystems can stand alone like that. I also don't have the
>>>> time to test that it works properly on all filesystems right now.
>>>
>>> Of course they can, as shown by various calls to vfs_fsync_range that
>>> is nothing but a small wrapper around ->fsync.
>>
>> Sure, but that's not getting 10,000 concurrent callers, is it? And
>> some fsync methods require journal credits, and others serialise
>> completely, and so on.
>>
>> Besides, putting an *unbound, highly concurrent* aio queue into the
>> kernel for an operation that can serialise the entire filesystem
>> seems like a pretty nasty user-level DOS vector to me.
>
> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> running isn't anything related to XFS.  The async fsync workqueue
> results in several thousand worker threads dispatching IO
> concurrently across 16 CPUs:
>
> $ ps -ef |grep kworker |wc -l
> 4693
> $
>
> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>
> -  51.33%  [kernel]            [k] percpu_ida_alloc
>     - percpu_ida_alloc
>        + 85.73% blk_mq_wait_for_tags
>        + 14.23% blk_mq_get_tag
> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 66.26% virtio_queue_rq
>           - __blk_mq_run_hw_queue
>              - 99.65% blk_mq_run_hw_queue
>                 + 99.47% blk_mq_insert_requests
>                 + 0.53% blk_mq_insert_request
> .....
> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 69.59% __schedule
>           - 86.49% schedule
>              + 47.72% percpu_ida_alloc
>              + 21.75% worker_thread
>              + 19.12% schedule_timeout
> ....
>        + 18.06% blk_mq_make_request
>
> Runtime:
>
> real    4m1.243s
> user    0m47.724s
> sys     11m56.724s
>
> Most of the excessive CPU usage is coming from the blk-mq layer, and
> XFS is barely showing up in the profiles at all - the IDA tag
> allocator is burning 8 CPUs at about 60,000 write IOPS....
>
> I know that the tag allocator has been rewritten, so I tested
> against a current a current Linus kernel with the XFS aio-fsync
> patch. The results are all over the place - from several sequential
> runs of the same test (removing the files in between so each tests
> starts from an empty fs):
>
> Wall time	sys time	IOPS	 files/s
> 4m58.151s	11m12.648s	30,000	 13,500
> 4m35.075s	12m45.900s	45,000	 15,000
> 3m10.665s	11m15.804s	65,000	 21,000
> 3m27.384s	11m54.723s	85,000	 20,000
> 3m59.574s	11m12.012s	50,000	 16,500
> 4m12.704s	12m15.720s	50,000	 17,000
>
> The 3.15 based kernel was pretty consistent around the 4m10 mark,
> generally only +/-10s in runtime and not much change in system time.
> The files/s rate reported by fs_mark doesn't vary that much, either.
> So the new tag allocator seems to be no better in terms of IO
> dispatch scalability, yet adds significant variability to IO
> performance.
>
> What I noticed is a massive jump in context switch overhead: from
> around 250,000/s to over 800,000/s and the CPU profiles show that
> this comes from the new tag allocator:
>
> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 58.22% prepare_to_wait
>             100.00% bt_get
>                blk_mq_get_tag
>                __blk_mq_alloc_request
>                blk_mq_map_request
>                blk_sq_make_request
>                generic_make_request
>        - 22.51% virtio_queue_rq
>             __blk_mq_run_hw_queue
> ....
> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 58.73% __schedule
>           - 53.42% io_schedule
>                99.88% bt_get
>                   blk_mq_get_tag
>                   __blk_mq_alloc_request
>                   blk_mq_map_request
>                   blk_sq_make_request
>                   generic_make_request
>           - 35.58% schedule
>              + 49.31% worker_thread
>              + 32.45% schedule_timeout
>              + 10.35% _xfs_log_force_lsn
>              + 3.10% xlog_cil_force_lsn
> ....
>
> The new block-mq tag allocator is hammering the waitqueues and
> that's generating a large amount of lock contention. It looks like
> the new allocator replaces CPU burn doing work in the IDA allocator
> with the same amount of CPU burn from extra context switch
> overhead....
>
> Oh, OH. Now I understand!
>
> # echo 4 > /sys/block/vdc/queue/nr_requests
>
> <run workload>
>
> 80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 98.49% prepare_to_wait
>             bt_get
>             blk_mq_get_tag
>             __blk_mq_alloc_request
>             blk_mq_map_request
>             blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        + 1.07% finish_wait
> +  13.63%  [kernel]  [k] _raw_spin_unlock_irq
> ...
>
> It's context switch bound at 800,000 context switches/s, burning all
> 16 CPUs waking up and going to sleep and doing very little real
> work. How little real work? About 3000 IOPS for 2MB/s of IO. That
> amount of IO should only take a single digit CPU percentage of one
> CPU.

With thousands of threads? I think not. Sanely submitted 3000 IOPS, 
correct, I would agree with you.

> This seems like bad behaviour to have on a congested block device,
> even a high performance one....

That is pretty much the suck. How do I reproduce this (eg what are you 
running, and what are the xfs aio fsync patches)? Even if dispatching 
thousands of threads to do IO is a bad idea (it very much is), 
gracefully handling is a must. I haven't seen any bad behavior from the 
new allocator, it seems to be well behaved (for most normal cases, 
anyway). I'd like to take a stab at ensuring this works, too.

If you tell me exactly what you are running, I'll reproduce and get this 
fixed up tomorrow.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-16  2:58             ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-16  2:58 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig; +Cc: linux-fsdevel, linux-man, xfs

On 2014-06-15 20:00, Dave Chinner wrote:
> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
>>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
>>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
>>>>> There doesn't really seem anything XFS specific here, so instead
>>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
>>>>> based on the workqueue and ->fsync.
>>>>
>>>> I really don't know whether the other ->fsync methods in other
>>>> filesystems can stand alone like that. I also don't have the
>>>> time to test that it works properly on all filesystems right now.
>>>
>>> Of course they can, as shown by various calls to vfs_fsync_range that
>>> is nothing but a small wrapper around ->fsync.
>>
>> Sure, but that's not getting 10,000 concurrent callers, is it? And
>> some fsync methods require journal credits, and others serialise
>> completely, and so on.
>>
>> Besides, putting an *unbound, highly concurrent* aio queue into the
>> kernel for an operation that can serialise the entire filesystem
>> seems like a pretty nasty user-level DOS vector to me.
>
> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> running isn't anything related to XFS.  The async fsync workqueue
> results in several thousand worker threads dispatching IO
> concurrently across 16 CPUs:
>
> $ ps -ef |grep kworker |wc -l
> 4693
> $
>
> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>
> -  51.33%  [kernel]            [k] percpu_ida_alloc
>     - percpu_ida_alloc
>        + 85.73% blk_mq_wait_for_tags
>        + 14.23% blk_mq_get_tag
> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 66.26% virtio_queue_rq
>           - __blk_mq_run_hw_queue
>              - 99.65% blk_mq_run_hw_queue
>                 + 99.47% blk_mq_insert_requests
>                 + 0.53% blk_mq_insert_request
> .....
> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 69.59% __schedule
>           - 86.49% schedule
>              + 47.72% percpu_ida_alloc
>              + 21.75% worker_thread
>              + 19.12% schedule_timeout
> ....
>        + 18.06% blk_mq_make_request
>
> Runtime:
>
> real    4m1.243s
> user    0m47.724s
> sys     11m56.724s
>
> Most of the excessive CPU usage is coming from the blk-mq layer, and
> XFS is barely showing up in the profiles at all - the IDA tag
> allocator is burning 8 CPUs at about 60,000 write IOPS....
>
> I know that the tag allocator has been rewritten, so I tested
> against a current a current Linus kernel with the XFS aio-fsync
> patch. The results are all over the place - from several sequential
> runs of the same test (removing the files in between so each tests
> starts from an empty fs):
>
> Wall time	sys time	IOPS	 files/s
> 4m58.151s	11m12.648s	30,000	 13,500
> 4m35.075s	12m45.900s	45,000	 15,000
> 3m10.665s	11m15.804s	65,000	 21,000
> 3m27.384s	11m54.723s	85,000	 20,000
> 3m59.574s	11m12.012s	50,000	 16,500
> 4m12.704s	12m15.720s	50,000	 17,000
>
> The 3.15 based kernel was pretty consistent around the 4m10 mark,
> generally only +/-10s in runtime and not much change in system time.
> The files/s rate reported by fs_mark doesn't vary that much, either.
> So the new tag allocator seems to be no better in terms of IO
> dispatch scalability, yet adds significant variability to IO
> performance.
>
> What I noticed is a massive jump in context switch overhead: from
> around 250,000/s to over 800,000/s and the CPU profiles show that
> this comes from the new tag allocator:
>
> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 58.22% prepare_to_wait
>             100.00% bt_get
>                blk_mq_get_tag
>                __blk_mq_alloc_request
>                blk_mq_map_request
>                blk_sq_make_request
>                generic_make_request
>        - 22.51% virtio_queue_rq
>             __blk_mq_run_hw_queue
> ....
> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 58.73% __schedule
>           - 53.42% io_schedule
>                99.88% bt_get
>                   blk_mq_get_tag
>                   __blk_mq_alloc_request
>                   blk_mq_map_request
>                   blk_sq_make_request
>                   generic_make_request
>           - 35.58% schedule
>              + 49.31% worker_thread
>              + 32.45% schedule_timeout
>              + 10.35% _xfs_log_force_lsn
>              + 3.10% xlog_cil_force_lsn
> ....
>
> The new block-mq tag allocator is hammering the waitqueues and
> that's generating a large amount of lock contention. It looks like
> the new allocator replaces CPU burn doing work in the IDA allocator
> with the same amount of CPU burn from extra context switch
> overhead....
>
> Oh, OH. Now I understand!
>
> # echo 4 > /sys/block/vdc/queue/nr_requests
>
> <run workload>
>
> 80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 98.49% prepare_to_wait
>             bt_get
>             blk_mq_get_tag
>             __blk_mq_alloc_request
>             blk_mq_map_request
>             blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        + 1.07% finish_wait
> +  13.63%  [kernel]  [k] _raw_spin_unlock_irq
> ...
>
> It's context switch bound at 800,000 context switches/s, burning all
> 16 CPUs waking up and going to sleep and doing very little real
> work. How little real work? About 3000 IOPS for 2MB/s of IO. That
> amount of IO should only take a single digit CPU percentage of one
> CPU.

With thousands of threads? I think not. Sanely submitted 3000 IOPS, 
correct, I would agree with you.

> This seems like bad behaviour to have on a congested block device,
> even a high performance one....

That is pretty much the suck. How do I reproduce this (eg what are you 
running, and what are the xfs aio fsync patches)? Even if dispatching 
thousands of threads to do IO is a bad idea (it very much is), 
gracefully handling is a must. I haven't seen any bad behavior from the 
new allocator, it seems to be well behaved (for most normal cases, 
anyway). I'd like to take a stab at ensuring this works, too.

If you tell me exactly what you are running, I'll reproduce and get this 
fixed up tomorrow.

-- 
Jens Axboe

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-16  2:58             ` Jens Axboe
@ 2014-06-16  7:19                 ` Dave Chinner
  -1 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-16  7:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> On 2014-06-15 20:00, Dave Chinner wrote:
> >On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> >>>On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> >>>>On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> >>>>>There doesn't really seem anything XFS specific here, so instead
> >>>>>of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> >>>>>based on the workqueue and ->fsync.
> >>>>
> >>>>I really don't know whether the other ->fsync methods in other
> >>>>filesystems can stand alone like that. I also don't have the
> >>>>time to test that it works properly on all filesystems right now.
> >>>
> >>>Of course they can, as shown by various calls to vfs_fsync_range that
> >>>is nothing but a small wrapper around ->fsync.
> >>
> >>Sure, but that's not getting 10,000 concurrent callers, is it? And
> >>some fsync methods require journal credits, and others serialise
> >>completely, and so on.
> >>
> >>Besides, putting an *unbound, highly concurrent* aio queue into the
> >>kernel for an operation that can serialise the entire filesystem
> >>seems like a pretty nasty user-level DOS vector to me.
> >
> >FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >running isn't anything related to XFS.  The async fsync workqueue
> >results in several thousand worker threads dispatching IO
> >concurrently across 16 CPUs:
> >
> >$ ps -ef |grep kworker |wc -l
> >4693
> >$
> >
> >Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
> >
> >-  51.33%  [kernel]            [k] percpu_ida_alloc
> >    - percpu_ida_alloc
> >       + 85.73% blk_mq_wait_for_tags
> >       + 14.23% blk_mq_get_tag
> >-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 66.26% virtio_queue_rq
> >          - __blk_mq_run_hw_queue
> >             - 99.65% blk_mq_run_hw_queue
> >                + 99.47% blk_mq_insert_requests
> >                + 0.53% blk_mq_insert_request
> >.....
> >-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 69.59% __schedule
> >          - 86.49% schedule
> >             + 47.72% percpu_ida_alloc
> >             + 21.75% worker_thread
> >             + 19.12% schedule_timeout
> >....
> >       + 18.06% blk_mq_make_request
> >
> >Runtime:
> >
> >real    4m1.243s
> >user    0m47.724s
> >sys     11m56.724s
> >
> >Most of the excessive CPU usage is coming from the blk-mq layer, and
> >XFS is barely showing up in the profiles at all - the IDA tag
> >allocator is burning 8 CPUs at about 60,000 write IOPS....
> >
> >I know that the tag allocator has been rewritten, so I tested
> >against a current a current Linus kernel with the XFS aio-fsync
> >patch. The results are all over the place - from several sequential
> >runs of the same test (removing the files in between so each tests
> >starts from an empty fs):
> >
> >Wall time	sys time	IOPS	 files/s
> >4m58.151s	11m12.648s	30,000	 13,500
> >4m35.075s	12m45.900s	45,000	 15,000
> >3m10.665s	11m15.804s	65,000	 21,000
> >3m27.384s	11m54.723s	85,000	 20,000
> >3m59.574s	11m12.012s	50,000	 16,500
> >4m12.704s	12m15.720s	50,000	 17,000
> >
> >The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >generally only +/-10s in runtime and not much change in system time.
> >The files/s rate reported by fs_mark doesn't vary that much, either.
> >So the new tag allocator seems to be no better in terms of IO
> >dispatch scalability, yet adds significant variability to IO
> >performance.
> >
> >What I noticed is a massive jump in context switch overhead: from
> >around 250,000/s to over 800,000/s and the CPU profiles show that
> >this comes from the new tag allocator:
> >
> >-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 58.22% prepare_to_wait
> >            100.00% bt_get
> >               blk_mq_get_tag
> >               __blk_mq_alloc_request
> >               blk_mq_map_request
> >               blk_sq_make_request
> >               generic_make_request
> >       - 22.51% virtio_queue_rq
> >            __blk_mq_run_hw_queue
> >....
> >-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 58.73% __schedule
> >          - 53.42% io_schedule
> >               99.88% bt_get
> >                  blk_mq_get_tag
> >                  __blk_mq_alloc_request
> >                  blk_mq_map_request
> >                  blk_sq_make_request
> >                  generic_make_request
> >          - 35.58% schedule
> >             + 49.31% worker_thread
> >             + 32.45% schedule_timeout
> >             + 10.35% _xfs_log_force_lsn
> >             + 3.10% xlog_cil_force_lsn
> >....
> >
> >The new block-mq tag allocator is hammering the waitqueues and
> >that's generating a large amount of lock contention. It looks like
> >the new allocator replaces CPU burn doing work in the IDA allocator
> >with the same amount of CPU burn from extra context switch
> >overhead....
> >
> >Oh, OH. Now I understand!
> >
> ># echo 4 > /sys/block/vdc/queue/nr_requests
> >
> ><run workload>
> >
> >80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 98.49% prepare_to_wait
> >            bt_get
> >            blk_mq_get_tag
> >            __blk_mq_alloc_request
> >            blk_mq_map_request
> >            blk_sq_make_request
> >            generic_make_request
> >          + submit_bio
> >       + 1.07% finish_wait
> >+  13.63%  [kernel]  [k] _raw_spin_unlock_irq
> >...
> >
> >It's context switch bound at 800,000 context switches/s, burning all
> >16 CPUs waking up and going to sleep and doing very little real
> >work. How little real work? About 3000 IOPS for 2MB/s of IO. That
> >amount of IO should only take a single digit CPU percentage of one
> >CPU.
> 
> With thousands of threads? I think not. Sanely submitted 3000 IOPS,
> correct, I would agree with you.
> 
> >This seems like bad behaviour to have on a congested block device,
> >even a high performance one....
> 
> That is pretty much the suck. How do I reproduce this (eg what are
> you running, and what are the xfs aio fsync patches)? Even if

http://oss.sgi.com/pipermail/xfs/2014-June/036773.html

> dispatching thousands of threads to do IO is a bad idea (it very
> much is), gracefully handling is a must. I haven't seen any bad
> behavior from the new allocator, it seems to be well behaved (for
> most normal cases, anyway). I'd like to take a stab at ensuring this
> works, too.
> 
> If you tell me exactly what you are running, I'll reproduce and get
> this fixed up tomorrow.

Test case - take fs_mark:

	git://oss.sgi.com/dgc/fs_mark

Apply the patch for aio fsync support:

http://oss.sgi.com/pipermail/xfs/2014-June/036774.html

Run this test:

$ time ./fs_mark  -D  10000  -S5 -n  50000  -s  4096  -L  5 -A \
-d /mnt/scratch/0 -d  /mnt/scratch/1 -d  /mnt/scratch/2 \
-d /mnt/scratch/3 -d  /mnt/scratch/4 -d  /mnt/scratch/5 \
-d /mnt/scratch/6 -d  /mnt/scratch/7 -d  /mnt/scratch/8 \
-d /mnt/scratch/9 -d  /mnt/scratch/10 -d  /mnt/scratch/11 \
-d /mnt/scratch/12 -d  /mnt/scratch/13 -d  /mnt/scratch/14 \
-d /mnt/scratch/15

Drop the "-A" if you want to use normal fsync (but then you won't
see the problem).

Use a XFS filesystem that has at least 32 AGs (I'm using
a sparse 500TB fs image on a virtio device). I'm also using mkfs
options of "-m crc=1,finobt=1", but to use that last one you'll need
a mkfs built from the xfsprogs git tree. It shouldn't make any
difference to the result, though. I'm running on a VM with 16 CPUs
and 16GB RAM, using fakenuma=4.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-16  7:19                 ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-16  7:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-man

On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> On 2014-06-15 20:00, Dave Chinner wrote:
> >On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
> >>>On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
> >>>>On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
> >>>>>There doesn't really seem anything XFS specific here, so instead
> >>>>>of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
> >>>>>based on the workqueue and ->fsync.
> >>>>
> >>>>I really don't know whether the other ->fsync methods in other
> >>>>filesystems can stand alone like that. I also don't have the
> >>>>time to test that it works properly on all filesystems right now.
> >>>
> >>>Of course they can, as shown by various calls to vfs_fsync_range that
> >>>is nothing but a small wrapper around ->fsync.
> >>
> >>Sure, but that's not getting 10,000 concurrent callers, is it? And
> >>some fsync methods require journal credits, and others serialise
> >>completely, and so on.
> >>
> >>Besides, putting an *unbound, highly concurrent* aio queue into the
> >>kernel for an operation that can serialise the entire filesystem
> >>seems like a pretty nasty user-level DOS vector to me.
> >
> >FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >running isn't anything related to XFS.  The async fsync workqueue
> >results in several thousand worker threads dispatching IO
> >concurrently across 16 CPUs:
> >
> >$ ps -ef |grep kworker |wc -l
> >4693
> >$
> >
> >Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
> >
> >-  51.33%  [kernel]            [k] percpu_ida_alloc
> >    - percpu_ida_alloc
> >       + 85.73% blk_mq_wait_for_tags
> >       + 14.23% blk_mq_get_tag
> >-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 66.26% virtio_queue_rq
> >          - __blk_mq_run_hw_queue
> >             - 99.65% blk_mq_run_hw_queue
> >                + 99.47% blk_mq_insert_requests
> >                + 0.53% blk_mq_insert_request
> >.....
> >-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 69.59% __schedule
> >          - 86.49% schedule
> >             + 47.72% percpu_ida_alloc
> >             + 21.75% worker_thread
> >             + 19.12% schedule_timeout
> >....
> >       + 18.06% blk_mq_make_request
> >
> >Runtime:
> >
> >real    4m1.243s
> >user    0m47.724s
> >sys     11m56.724s
> >
> >Most of the excessive CPU usage is coming from the blk-mq layer, and
> >XFS is barely showing up in the profiles at all - the IDA tag
> >allocator is burning 8 CPUs at about 60,000 write IOPS....
> >
> >I know that the tag allocator has been rewritten, so I tested
> >against a current a current Linus kernel with the XFS aio-fsync
> >patch. The results are all over the place - from several sequential
> >runs of the same test (removing the files in between so each tests
> >starts from an empty fs):
> >
> >Wall time	sys time	IOPS	 files/s
> >4m58.151s	11m12.648s	30,000	 13,500
> >4m35.075s	12m45.900s	45,000	 15,000
> >3m10.665s	11m15.804s	65,000	 21,000
> >3m27.384s	11m54.723s	85,000	 20,000
> >3m59.574s	11m12.012s	50,000	 16,500
> >4m12.704s	12m15.720s	50,000	 17,000
> >
> >The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >generally only +/-10s in runtime and not much change in system time.
> >The files/s rate reported by fs_mark doesn't vary that much, either.
> >So the new tag allocator seems to be no better in terms of IO
> >dispatch scalability, yet adds significant variability to IO
> >performance.
> >
> >What I noticed is a massive jump in context switch overhead: from
> >around 250,000/s to over 800,000/s and the CPU profiles show that
> >this comes from the new tag allocator:
> >
> >-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 58.22% prepare_to_wait
> >            100.00% bt_get
> >               blk_mq_get_tag
> >               __blk_mq_alloc_request
> >               blk_mq_map_request
> >               blk_sq_make_request
> >               generic_make_request
> >       - 22.51% virtio_queue_rq
> >            __blk_mq_run_hw_queue
> >....
> >-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 58.73% __schedule
> >          - 53.42% io_schedule
> >               99.88% bt_get
> >                  blk_mq_get_tag
> >                  __blk_mq_alloc_request
> >                  blk_mq_map_request
> >                  blk_sq_make_request
> >                  generic_make_request
> >          - 35.58% schedule
> >             + 49.31% worker_thread
> >             + 32.45% schedule_timeout
> >             + 10.35% _xfs_log_force_lsn
> >             + 3.10% xlog_cil_force_lsn
> >....
> >
> >The new block-mq tag allocator is hammering the waitqueues and
> >that's generating a large amount of lock contention. It looks like
> >the new allocator replaces CPU burn doing work in the IDA allocator
> >with the same amount of CPU burn from extra context switch
> >overhead....
> >
> >Oh, OH. Now I understand!
> >
> ># echo 4 > /sys/block/vdc/queue/nr_requests
> >
> ><run workload>
> >
> >80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 98.49% prepare_to_wait
> >            bt_get
> >            blk_mq_get_tag
> >            __blk_mq_alloc_request
> >            blk_mq_map_request
> >            blk_sq_make_request
> >            generic_make_request
> >          + submit_bio
> >       + 1.07% finish_wait
> >+  13.63%  [kernel]  [k] _raw_spin_unlock_irq
> >...
> >
> >It's context switch bound at 800,000 context switches/s, burning all
> >16 CPUs waking up and going to sleep and doing very little real
> >work. How little real work? About 3000 IOPS for 2MB/s of IO. That
> >amount of IO should only take a single digit CPU percentage of one
> >CPU.
> 
> With thousands of threads? I think not. Sanely submitted 3000 IOPS,
> correct, I would agree with you.
> 
> >This seems like bad behaviour to have on a congested block device,
> >even a high performance one....
> 
> That is pretty much the suck. How do I reproduce this (eg what are
> you running, and what are the xfs aio fsync patches)? Even if

http://oss.sgi.com/pipermail/xfs/2014-June/036773.html

> dispatching thousands of threads to do IO is a bad idea (it very
> much is), gracefully handling is a must. I haven't seen any bad
> behavior from the new allocator, it seems to be well behaved (for
> most normal cases, anyway). I'd like to take a stab at ensuring this
> works, too.
> 
> If you tell me exactly what you are running, I'll reproduce and get
> this fixed up tomorrow.

Test case - take fs_mark:

	git://oss.sgi.com/dgc/fs_mark

Apply the patch for aio fsync support:

http://oss.sgi.com/pipermail/xfs/2014-June/036774.html

Run this test:

$ time ./fs_mark  -D  10000  -S5 -n  50000  -s  4096  -L  5 -A \
-d /mnt/scratch/0 -d  /mnt/scratch/1 -d  /mnt/scratch/2 \
-d /mnt/scratch/3 -d  /mnt/scratch/4 -d  /mnt/scratch/5 \
-d /mnt/scratch/6 -d  /mnt/scratch/7 -d  /mnt/scratch/8 \
-d /mnt/scratch/9 -d  /mnt/scratch/10 -d  /mnt/scratch/11 \
-d /mnt/scratch/12 -d  /mnt/scratch/13 -d  /mnt/scratch/14 \
-d /mnt/scratch/15

Drop the "-A" if you want to use normal fsync (but then you won't
see the problem).

Use a XFS filesystem that has at least 32 AGs (I'm using
a sparse 500TB fs image on a virtio device). I'm also using mkfs
options of "-m crc=1,finobt=1", but to use that last one you'll need
a mkfs built from the xfsprogs git tree. It shouldn't make any
difference to the result, though. I'm running on a VM with 16 CPUs
and 16GB RAM, using fakenuma=4.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-16  7:19                 ` Dave Chinner
@ 2014-06-16 19:30                   ` Jens Axboe
  -1 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-16 19:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ

[-- Attachment #1: Type: text/plain, Size: 8031 bytes --]

On 06/16/2014 01:19 AM, Dave Chinner wrote:
> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>> On 2014-06-15 20:00, Dave Chinner wrote:
>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
>>>>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
>>>>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
>>>>>>> There doesn't really seem anything XFS specific here, so instead
>>>>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
>>>>>>> based on the workqueue and ->fsync.
>>>>>>
>>>>>> I really don't know whether the other ->fsync methods in other
>>>>>> filesystems can stand alone like that. I also don't have the
>>>>>> time to test that it works properly on all filesystems right now.
>>>>>
>>>>> Of course they can, as shown by various calls to vfs_fsync_range that
>>>>> is nothing but a small wrapper around ->fsync.
>>>>
>>>> Sure, but that's not getting 10,000 concurrent callers, is it? And
>>>> some fsync methods require journal credits, and others serialise
>>>> completely, and so on.
>>>>
>>>> Besides, putting an *unbound, highly concurrent* aio queue into the
>>>> kernel for an operation that can serialise the entire filesystem
>>>> seems like a pretty nasty user-level DOS vector to me.
>>>
>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>> running isn't anything related to XFS.  The async fsync workqueue
>>> results in several thousand worker threads dispatching IO
>>> concurrently across 16 CPUs:
>>>
>>> $ ps -ef |grep kworker |wc -l
>>> 4693
>>> $
>>>
>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>
>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>    - percpu_ida_alloc
>>>       + 85.73% blk_mq_wait_for_tags
>>>       + 14.23% blk_mq_get_tag
>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>    - _raw_spin_unlock_irqrestore
>>>       - 66.26% virtio_queue_rq
>>>          - __blk_mq_run_hw_queue
>>>             - 99.65% blk_mq_run_hw_queue
>>>                + 99.47% blk_mq_insert_requests
>>>                + 0.53% blk_mq_insert_request
>>> .....
>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>    - _raw_spin_unlock_irq
>>>       - 69.59% __schedule
>>>          - 86.49% schedule
>>>             + 47.72% percpu_ida_alloc
>>>             + 21.75% worker_thread
>>>             + 19.12% schedule_timeout
>>> ....
>>>       + 18.06% blk_mq_make_request
>>>
>>> Runtime:
>>>
>>> real    4m1.243s
>>> user    0m47.724s
>>> sys     11m56.724s
>>>
>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>> XFS is barely showing up in the profiles at all - the IDA tag
>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>
>>> I know that the tag allocator has been rewritten, so I tested
>>> against a current a current Linus kernel with the XFS aio-fsync
>>> patch. The results are all over the place - from several sequential
>>> runs of the same test (removing the files in between so each tests
>>> starts from an empty fs):
>>>
>>> Wall time	sys time	IOPS	 files/s
>>> 4m58.151s	11m12.648s	30,000	 13,500
>>> 4m35.075s	12m45.900s	45,000	 15,000
>>> 3m10.665s	11m15.804s	65,000	 21,000
>>> 3m27.384s	11m54.723s	85,000	 20,000
>>> 3m59.574s	11m12.012s	50,000	 16,500
>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>
>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>> generally only +/-10s in runtime and not much change in system time.
>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>> So the new tag allocator seems to be no better in terms of IO
>>> dispatch scalability, yet adds significant variability to IO
>>> performance.
>>>
>>> What I noticed is a massive jump in context switch overhead: from
>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>> this comes from the new tag allocator:
>>>
>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>    - _raw_spin_unlock_irqrestore
>>>       - 58.22% prepare_to_wait
>>>            100.00% bt_get
>>>               blk_mq_get_tag
>>>               __blk_mq_alloc_request
>>>               blk_mq_map_request
>>>               blk_sq_make_request
>>>               generic_make_request
>>>       - 22.51% virtio_queue_rq
>>>            __blk_mq_run_hw_queue
>>> ....
>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>    - _raw_spin_unlock_irq
>>>       - 58.73% __schedule
>>>          - 53.42% io_schedule
>>>               99.88% bt_get
>>>                  blk_mq_get_tag
>>>                  __blk_mq_alloc_request
>>>                  blk_mq_map_request
>>>                  blk_sq_make_request
>>>                  generic_make_request
>>>          - 35.58% schedule
>>>             + 49.31% worker_thread
>>>             + 32.45% schedule_timeout
>>>             + 10.35% _xfs_log_force_lsn
>>>             + 3.10% xlog_cil_force_lsn
>>> ....
>>>
>>> The new block-mq tag allocator is hammering the waitqueues and
>>> that's generating a large amount of lock contention. It looks like
>>> the new allocator replaces CPU burn doing work in the IDA allocator
>>> with the same amount of CPU burn from extra context switch
>>> overhead....
>>>
>>> Oh, OH. Now I understand!
>>>
>>> # echo 4 > /sys/block/vdc/queue/nr_requests
>>>
>>> <run workload>
>>>
>>> 80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>    - _raw_spin_unlock_irqrestore
>>>       - 98.49% prepare_to_wait
>>>            bt_get
>>>            blk_mq_get_tag
>>>            __blk_mq_alloc_request
>>>            blk_mq_map_request
>>>            blk_sq_make_request
>>>            generic_make_request
>>>          + submit_bio
>>>       + 1.07% finish_wait
>>> +  13.63%  [kernel]  [k] _raw_spin_unlock_irq
>>> ...
>>>
>>> It's context switch bound at 800,000 context switches/s, burning all
>>> 16 CPUs waking up and going to sleep and doing very little real
>>> work. How little real work? About 3000 IOPS for 2MB/s of IO. That
>>> amount of IO should only take a single digit CPU percentage of one
>>> CPU.
>>
>> With thousands of threads? I think not. Sanely submitted 3000 IOPS,
>> correct, I would agree with you.
>>
>>> This seems like bad behaviour to have on a congested block device,
>>> even a high performance one....
>>
>> That is pretty much the suck. How do I reproduce this (eg what are
>> you running, and what are the xfs aio fsync patches)? Even if
> 
> http://oss.sgi.com/pipermail/xfs/2014-June/036773.html
> 
>> dispatching thousands of threads to do IO is a bad idea (it very
>> much is), gracefully handling is a must. I haven't seen any bad
>> behavior from the new allocator, it seems to be well behaved (for
>> most normal cases, anyway). I'd like to take a stab at ensuring this
>> works, too.
>>
>> If you tell me exactly what you are running, I'll reproduce and get
>> this fixed up tomorrow.
> 
> Test case - take fs_mark:
> 
> 	git://oss.sgi.com/dgc/fs_mark
> 
> Apply the patch for aio fsync support:
> 
> http://oss.sgi.com/pipermail/xfs/2014-June/036774.html
> 
> Run this test:
> 
> $ time ./fs_mark  -D  10000  -S5 -n  50000  -s  4096  -L  5 -A \
> -d /mnt/scratch/0 -d  /mnt/scratch/1 -d  /mnt/scratch/2 \
> -d /mnt/scratch/3 -d  /mnt/scratch/4 -d  /mnt/scratch/5 \
> -d /mnt/scratch/6 -d  /mnt/scratch/7 -d  /mnt/scratch/8 \
> -d /mnt/scratch/9 -d  /mnt/scratch/10 -d  /mnt/scratch/11 \
> -d /mnt/scratch/12 -d  /mnt/scratch/13 -d  /mnt/scratch/14 \
> -d /mnt/scratch/15
> 
> Drop the "-A" if you want to use normal fsync (but then you won't
> see the problem).
> 
> Use a XFS filesystem that has at least 32 AGs (I'm using
> a sparse 500TB fs image on a virtio device). I'm also using mkfs
> options of "-m crc=1,finobt=1", but to use that last one you'll need
> a mkfs built from the xfsprogs git tree. It shouldn't make any
> difference to the result, though. I'm running on a VM with 16 CPUs
> and 16GB RAM, using fakenuma=4.

Can you try with this patch?

-- 
Jens Axboe


[-- Attachment #2: tag-batch-wakeup.patch --]
[-- Type: text/x-patch, Size: 751 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1aab39f..4f90f91 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -242,7 +242,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		bool was_empty;
 
 		was_empty = list_empty(&wait.task_list);
-		prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait_exclusive(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __bt_get(hctx, bt, last_tag);
 		if (tag != -1)
@@ -345,7 +345,7 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
 		atomic_set(&bs->wait_cnt, bt->wake_cnt);
 		bt_index_inc(&bt->wake_index);
-		wake_up(&bs->wait);
+		wake_up_nr(&bs->wait, bt->wake_cnt);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-16 19:30                   ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-16 19:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-man

[-- Attachment #1: Type: text/plain, Size: 8031 bytes --]

On 06/16/2014 01:19 AM, Dave Chinner wrote:
> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>> On 2014-06-15 20:00, Dave Chinner wrote:
>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
>>>>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
>>>>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
>>>>>>> There doesn't really seem anything XFS specific here, so instead
>>>>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
>>>>>>> based on the workqueue and ->fsync.
>>>>>>
>>>>>> I really don't know whether the other ->fsync methods in other
>>>>>> filesystems can stand alone like that. I also don't have the
>>>>>> time to test that it works properly on all filesystems right now.
>>>>>
>>>>> Of course they can, as shown by various calls to vfs_fsync_range that
>>>>> is nothing but a small wrapper around ->fsync.
>>>>
>>>> Sure, but that's not getting 10,000 concurrent callers, is it? And
>>>> some fsync methods require journal credits, and others serialise
>>>> completely, and so on.
>>>>
>>>> Besides, putting an *unbound, highly concurrent* aio queue into the
>>>> kernel for an operation that can serialise the entire filesystem
>>>> seems like a pretty nasty user-level DOS vector to me.
>>>
>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>> running isn't anything related to XFS.  The async fsync workqueue
>>> results in several thousand worker threads dispatching IO
>>> concurrently across 16 CPUs:
>>>
>>> $ ps -ef |grep kworker |wc -l
>>> 4693
>>> $
>>>
>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>
>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>    - percpu_ida_alloc
>>>       + 85.73% blk_mq_wait_for_tags
>>>       + 14.23% blk_mq_get_tag
>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>    - _raw_spin_unlock_irqrestore
>>>       - 66.26% virtio_queue_rq
>>>          - __blk_mq_run_hw_queue
>>>             - 99.65% blk_mq_run_hw_queue
>>>                + 99.47% blk_mq_insert_requests
>>>                + 0.53% blk_mq_insert_request
>>> .....
>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>    - _raw_spin_unlock_irq
>>>       - 69.59% __schedule
>>>          - 86.49% schedule
>>>             + 47.72% percpu_ida_alloc
>>>             + 21.75% worker_thread
>>>             + 19.12% schedule_timeout
>>> ....
>>>       + 18.06% blk_mq_make_request
>>>
>>> Runtime:
>>>
>>> real    4m1.243s
>>> user    0m47.724s
>>> sys     11m56.724s
>>>
>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>> XFS is barely showing up in the profiles at all - the IDA tag
>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>
>>> I know that the tag allocator has been rewritten, so I tested
>>> against a current a current Linus kernel with the XFS aio-fsync
>>> patch. The results are all over the place - from several sequential
>>> runs of the same test (removing the files in between so each tests
>>> starts from an empty fs):
>>>
>>> Wall time	sys time	IOPS	 files/s
>>> 4m58.151s	11m12.648s	30,000	 13,500
>>> 4m35.075s	12m45.900s	45,000	 15,000
>>> 3m10.665s	11m15.804s	65,000	 21,000
>>> 3m27.384s	11m54.723s	85,000	 20,000
>>> 3m59.574s	11m12.012s	50,000	 16,500
>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>
>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>> generally only +/-10s in runtime and not much change in system time.
>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>> So the new tag allocator seems to be no better in terms of IO
>>> dispatch scalability, yet adds significant variability to IO
>>> performance.
>>>
>>> What I noticed is a massive jump in context switch overhead: from
>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>> this comes from the new tag allocator:
>>>
>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>    - _raw_spin_unlock_irqrestore
>>>       - 58.22% prepare_to_wait
>>>            100.00% bt_get
>>>               blk_mq_get_tag
>>>               __blk_mq_alloc_request
>>>               blk_mq_map_request
>>>               blk_sq_make_request
>>>               generic_make_request
>>>       - 22.51% virtio_queue_rq
>>>            __blk_mq_run_hw_queue
>>> ....
>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>    - _raw_spin_unlock_irq
>>>       - 58.73% __schedule
>>>          - 53.42% io_schedule
>>>               99.88% bt_get
>>>                  blk_mq_get_tag
>>>                  __blk_mq_alloc_request
>>>                  blk_mq_map_request
>>>                  blk_sq_make_request
>>>                  generic_make_request
>>>          - 35.58% schedule
>>>             + 49.31% worker_thread
>>>             + 32.45% schedule_timeout
>>>             + 10.35% _xfs_log_force_lsn
>>>             + 3.10% xlog_cil_force_lsn
>>> ....
>>>
>>> The new block-mq tag allocator is hammering the waitqueues and
>>> that's generating a large amount of lock contention. It looks like
>>> the new allocator replaces CPU burn doing work in the IDA allocator
>>> with the same amount of CPU burn from extra context switch
>>> overhead....
>>>
>>> Oh, OH. Now I understand!
>>>
>>> # echo 4 > /sys/block/vdc/queue/nr_requests
>>>
>>> <run workload>
>>>
>>> 80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>    - _raw_spin_unlock_irqrestore
>>>       - 98.49% prepare_to_wait
>>>            bt_get
>>>            blk_mq_get_tag
>>>            __blk_mq_alloc_request
>>>            blk_mq_map_request
>>>            blk_sq_make_request
>>>            generic_make_request
>>>          + submit_bio
>>>       + 1.07% finish_wait
>>> +  13.63%  [kernel]  [k] _raw_spin_unlock_irq
>>> ...
>>>
>>> It's context switch bound at 800,000 context switches/s, burning all
>>> 16 CPUs waking up and going to sleep and doing very little real
>>> work. How little real work? About 3000 IOPS for 2MB/s of IO. That
>>> amount of IO should only take a single digit CPU percentage of one
>>> CPU.
>>
>> With thousands of threads? I think not. Sanely submitted 3000 IOPS,
>> correct, I would agree with you.
>>
>>> This seems like bad behaviour to have on a congested block device,
>>> even a high performance one....
>>
>> That is pretty much the suck. How do I reproduce this (eg what are
>> you running, and what are the xfs aio fsync patches)? Even if
> 
> http://oss.sgi.com/pipermail/xfs/2014-June/036773.html
> 
>> dispatching thousands of threads to do IO is a bad idea (it very
>> much is), gracefully handling is a must. I haven't seen any bad
>> behavior from the new allocator, it seems to be well behaved (for
>> most normal cases, anyway). I'd like to take a stab at ensuring this
>> works, too.
>>
>> If you tell me exactly what you are running, I'll reproduce and get
>> this fixed up tomorrow.
> 
> Test case - take fs_mark:
> 
> 	git://oss.sgi.com/dgc/fs_mark
> 
> Apply the patch for aio fsync support:
> 
> http://oss.sgi.com/pipermail/xfs/2014-June/036774.html
> 
> Run this test:
> 
> $ time ./fs_mark  -D  10000  -S5 -n  50000  -s  4096  -L  5 -A \
> -d /mnt/scratch/0 -d  /mnt/scratch/1 -d  /mnt/scratch/2 \
> -d /mnt/scratch/3 -d  /mnt/scratch/4 -d  /mnt/scratch/5 \
> -d /mnt/scratch/6 -d  /mnt/scratch/7 -d  /mnt/scratch/8 \
> -d /mnt/scratch/9 -d  /mnt/scratch/10 -d  /mnt/scratch/11 \
> -d /mnt/scratch/12 -d  /mnt/scratch/13 -d  /mnt/scratch/14 \
> -d /mnt/scratch/15
> 
> Drop the "-A" if you want to use normal fsync (but then you won't
> see the problem).
> 
> Use a XFS filesystem that has at least 32 AGs (I'm using
> a sparse 500TB fs image on a virtio device). I'm also using mkfs
> options of "-m crc=1,finobt=1", but to use that last one you'll need
> a mkfs built from the xfsprogs git tree. It shouldn't make any
> difference to the result, though. I'm running on a VM with 16 CPUs
> and 16GB RAM, using fakenuma=4.

Can you try with this patch?

-- 
Jens Axboe


[-- Attachment #2: tag-batch-wakeup.patch --]
[-- Type: text/x-patch, Size: 751 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1aab39f..4f90f91 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -242,7 +242,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		bool was_empty;
 
 		was_empty = list_empty(&wait.task_list);
-		prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait_exclusive(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __bt_get(hctx, bt, last_tag);
 		if (tag != -1)
@@ -345,7 +345,7 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
 		atomic_set(&bs->wait_cnt, bt->wake_cnt);
 		bt_index_inc(&bt->wake_index);
-		wake_up(&bs->wait);
+		wake_up_nr(&bs->wait, bt->wake_cnt);
 	}
 }
 

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-13 16:23     ` Christoph Hellwig
@ 2014-06-16 21:06           ` Michael Kerrisk (man-pages)
       [not found]       ` <20140613162352.GB23394-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-16 21:06 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA

Hi Christoph,

On 06/13/2014 06:23 PM, Christoph Hellwig wrote:

> Talking about documentation:  The kernel aio manpages (io_*.2) seems
> to not really be very useful, mostly because they don't explain how
> to set up the iocbs.  Michael, any idea how to get started to improve
> this?

Well, it just needs someone knowledgeable to fill the gaps, and
perhaps add an example program. I don't have that knowledge 
offhand myself...

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-16 21:06           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 39+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-06-16 21:06 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner
  Cc: linux-fsdevel, linux-man, mtk.manpages, xfs

Hi Christoph,

On 06/13/2014 06:23 PM, Christoph Hellwig wrote:

> Talking about documentation:  The kernel aio manpages (io_*.2) seems
> to not really be very useful, mostly because they don't explain how
> to set up the iocbs.  Michael, any idea how to get started to improve
> this?

Well, it just needs someone knowledgeable to fill the gaps, and
perhaps add an example program. I don't have that knowledge 
offhand myself...

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-16 19:30                   ` Jens Axboe
  (?)
@ 2014-06-16 22:27                   ` Dave Chinner
  2014-06-17 13:23                       ` Jens Axboe
  -1 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2014-06-16 22:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-man

On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> On 06/16/2014 01:19 AM, Dave Chinner wrote:
> > On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >> On 2014-06-15 20:00, Dave Chinner wrote:
> >>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>> running isn't anything related to XFS.  The async fsync workqueue
> >>> results in several thousand worker threads dispatching IO
> >>> concurrently across 16 CPUs:
> >>>
> >>> $ ps -ef |grep kworker |wc -l
> >>> 4693
> >>> $
> >>>
> >>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
> >>>
> >>> -  51.33%  [kernel]            [k] percpu_ida_alloc
> >>>    - percpu_ida_alloc
> >>>       + 85.73% blk_mq_wait_for_tags
> >>>       + 14.23% blk_mq_get_tag
> >>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >>>    - _raw_spin_unlock_irqrestore
> >>>       - 66.26% virtio_queue_rq
> >>>          - __blk_mq_run_hw_queue
> >>>             - 99.65% blk_mq_run_hw_queue
> >>>                + 99.47% blk_mq_insert_requests
> >>>                + 0.53% blk_mq_insert_request
> >>> .....
> >>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
> >>>    - _raw_spin_unlock_irq
> >>>       - 69.59% __schedule
> >>>          - 86.49% schedule
> >>>             + 47.72% percpu_ida_alloc
> >>>             + 21.75% worker_thread
> >>>             + 19.12% schedule_timeout
> >>> ....
> >>>       + 18.06% blk_mq_make_request
> >>>
> >>> Runtime:
> >>>
> >>> real    4m1.243s
> >>> user    0m47.724s
> >>> sys     11m56.724s
> >>>
> >>> Most of the excessive CPU usage is coming from the blk-mq layer, and
> >>> XFS is barely showing up in the profiles at all - the IDA tag
> >>> allocator is burning 8 CPUs at about 60,000 write IOPS....
> >>>
> >>> I know that the tag allocator has been rewritten, so I tested
> >>> against a current a current Linus kernel with the XFS aio-fsync
> >>> patch. The results are all over the place - from several sequential
> >>> runs of the same test (removing the files in between so each tests
> >>> starts from an empty fs):
> >>>
> >>> Wall time	sys time	IOPS	 files/s
> >>> 4m58.151s	11m12.648s	30,000	 13,500
> >>> 4m35.075s	12m45.900s	45,000	 15,000
> >>> 3m10.665s	11m15.804s	65,000	 21,000
> >>> 3m27.384s	11m54.723s	85,000	 20,000
> >>> 3m59.574s	11m12.012s	50,000	 16,500
> >>> 4m12.704s	12m15.720s	50,000	 17,000
> >>>
> >>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >>> generally only +/-10s in runtime and not much change in system time.
> >>> The files/s rate reported by fs_mark doesn't vary that much, either.
> >>> So the new tag allocator seems to be no better in terms of IO
> >>> dispatch scalability, yet adds significant variability to IO
> >>> performance.
> >>>
> >>> What I noticed is a massive jump in context switch overhead: from
> >>> around 250,000/s to over 800,000/s and the CPU profiles show that
> >>> this comes from the new tag allocator:
> >>>
> >>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >>>    - _raw_spin_unlock_irqrestore
> >>>       - 58.22% prepare_to_wait
> >>>            100.00% bt_get
> >>>               blk_mq_get_tag
> >>>               __blk_mq_alloc_request
> >>>               blk_mq_map_request
> >>>               blk_sq_make_request
> >>>               generic_make_request
> >>>       - 22.51% virtio_queue_rq
> >>>            __blk_mq_run_hw_queue
> >>> ....
> >>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
> >>>    - _raw_spin_unlock_irq
> >>>       - 58.73% __schedule
> >>>          - 53.42% io_schedule
> >>>               99.88% bt_get
> >>>                  blk_mq_get_tag
> >>>                  __blk_mq_alloc_request
> >>>                  blk_mq_map_request
> >>>                  blk_sq_make_request
> >>>                  generic_make_request
> >>>          - 35.58% schedule
> >>>             + 49.31% worker_thread
> >>>             + 32.45% schedule_timeout
> >>>             + 10.35% _xfs_log_force_lsn
> >>>             + 3.10% xlog_cil_force_lsn
> >>> ....
.....
> Can you try with this patch?

Ok, context switches are back down in the realm of 400,000/s. It's
better, but it's still a bit higher than that the 3.15 code. XFS is
actually showing up in the context switch path profiles now...

However, performance is still excitingly variable and not much
different to not having this patch applied. System time is unchanged
(still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
show significant variance (at least +/-25%) from run to run. The
worst case is not as slow as the unpatched kernel, but it's no
better than the 3.15 worst case.

Profiles on a slow run look like:

-  43.43%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 64.23% blk_sq_make_request
           generic_make_request
          submit_bio                                                                                                                                                  ¿
      + 26.79% __schedule
...
-  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 39.81% virtio_queue_rq
           __blk_mq_run_hw_queue
      + 24.13% complete
      + 17.74% prepare_to_wait_exclusive
      + 9.66% remove_wait_queue

Looks like the main contention problem is in blk_sq_make_request().
Also, there looks to be quite a bit of lock contention on the tag
wait queues given that this patch made prepare_to_wait_exclusive()
suddenly show up in the profiles.

FWIW, on a fast run there is very little time in
blk_sq_make_request() lock contention, and overall spin lock/unlock
overhead of these two functions is around 10% each....

So, yes, the patch reduces context switches but doesn't really
reduce system time, improve performance noticably or address the
run-to-run variability issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-16 22:27                   ` Dave Chinner
@ 2014-06-17 13:23                       ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-17 13:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs

On 2014-06-16 16:27, Dave Chinner wrote:
> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>> results in several thousand worker threads dispatching IO
>>>>> concurrently across 16 CPUs:
>>>>>
>>>>> $ ps -ef |grep kworker |wc -l
>>>>> 4693
>>>>> $
>>>>>
>>>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>>>
>>>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>>>     - percpu_ida_alloc
>>>>>        + 85.73% blk_mq_wait_for_tags
>>>>>        + 14.23% blk_mq_get_tag
>>>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>>>     - _raw_spin_unlock_irqrestore
>>>>>        - 66.26% virtio_queue_rq
>>>>>           - __blk_mq_run_hw_queue
>>>>>              - 99.65% blk_mq_run_hw_queue
>>>>>                 + 99.47% blk_mq_insert_requests
>>>>>                 + 0.53% blk_mq_insert_request
>>>>> .....
>>>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>>>     - _raw_spin_unlock_irq
>>>>>        - 69.59% __schedule
>>>>>           - 86.49% schedule
>>>>>              + 47.72% percpu_ida_alloc
>>>>>              + 21.75% worker_thread
>>>>>              + 19.12% schedule_timeout
>>>>> ....
>>>>>        + 18.06% blk_mq_make_request
>>>>>
>>>>> Runtime:
>>>>>
>>>>> real    4m1.243s
>>>>> user    0m47.724s
>>>>> sys     11m56.724s
>>>>>
>>>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>>>> XFS is barely showing up in the profiles at all - the IDA tag
>>>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>>>
>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>> patch. The results are all over the place - from several sequential
>>>>> runs of the same test (removing the files in between so each tests
>>>>> starts from an empty fs):
>>>>>
>>>>> Wall time	sys time	IOPS	 files/s
>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>
>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>> dispatch scalability, yet adds significant variability to IO
>>>>> performance.
>>>>>
>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>> this comes from the new tag allocator:
>>>>>
>>>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>>>     - _raw_spin_unlock_irqrestore
>>>>>        - 58.22% prepare_to_wait
>>>>>             100.00% bt_get
>>>>>                blk_mq_get_tag
>>>>>                __blk_mq_alloc_request
>>>>>                blk_mq_map_request
>>>>>                blk_sq_make_request
>>>>>                generic_make_request
>>>>>        - 22.51% virtio_queue_rq
>>>>>             __blk_mq_run_hw_queue
>>>>> ....
>>>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>>>     - _raw_spin_unlock_irq
>>>>>        - 58.73% __schedule
>>>>>           - 53.42% io_schedule
>>>>>                99.88% bt_get
>>>>>                   blk_mq_get_tag
>>>>>                   __blk_mq_alloc_request
>>>>>                   blk_mq_map_request
>>>>>                   blk_sq_make_request
>>>>>                   generic_make_request
>>>>>           - 35.58% schedule
>>>>>              + 49.31% worker_thread
>>>>>              + 32.45% schedule_timeout
>>>>>              + 10.35% _xfs_log_force_lsn
>>>>>              + 3.10% xlog_cil_force_lsn
>>>>> ....
> .....
>> Can you try with this patch?
>
> Ok, context switches are back down in the realm of 400,000/s. It's
> better, but it's still a bit higher than that the 3.15 code. XFS is
> actually showing up in the context switch path profiles now...
>
> However, performance is still excitingly variable and not much
> different to not having this patch applied. System time is unchanged
> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> show significant variance (at least +/-25%) from run to run. The
> worst case is not as slow as the unpatched kernel, but it's no
> better than the 3.15 worst case.
>
> Profiles on a slow run look like:
>
> -  43.43%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 64.23% blk_sq_make_request
>             generic_make_request
>            submit_bio                                                                                                                                                  ¿
>        + 26.79% __schedule
> ...
> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 39.81% virtio_queue_rq
>             __blk_mq_run_hw_queue
>        + 24.13% complete
>        + 17.74% prepare_to_wait_exclusive
>        + 9.66% remove_wait_queue
>
> Looks like the main contention problem is in blk_sq_make_request().
> Also, there looks to be quite a bit of lock contention on the tag
> wait queues given that this patch made prepare_to_wait_exclusive()
> suddenly show up in the profiles.
>
> FWIW, on a fast run there is very little time in
> blk_sq_make_request() lock contention, and overall spin lock/unlock
> overhead of these two functions is around 10% each....
>
> So, yes, the patch reduces context switches but doesn't really
> reduce system time, improve performance noticably or address the
> run-to-run variability issue...

OK, so one more thing to try. With the same patch still applied, could 
you edit block/blk-mq-tag.h and change

         BT_WAIT_QUEUES  = 8,

to

         BT_WAIT_QUEUES  = 1,

and see if that smoothes things out?

On the road the next few days, so might take me a few days to get back 
to this. I was able to reproduce the horrible contention on the wait 
queue, but everything seemed to behave nicely with just the 
exclusive_wait/batch_wakeup for me. Looks like I might have to fire up 
kvm and set it you like you.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-17 13:23                       ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-17 13:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-man

On 2014-06-16 16:27, Dave Chinner wrote:
> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>> results in several thousand worker threads dispatching IO
>>>>> concurrently across 16 CPUs:
>>>>>
>>>>> $ ps -ef |grep kworker |wc -l
>>>>> 4693
>>>>> $
>>>>>
>>>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>>>
>>>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>>>     - percpu_ida_alloc
>>>>>        + 85.73% blk_mq_wait_for_tags
>>>>>        + 14.23% blk_mq_get_tag
>>>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>>>     - _raw_spin_unlock_irqrestore
>>>>>        - 66.26% virtio_queue_rq
>>>>>           - __blk_mq_run_hw_queue
>>>>>              - 99.65% blk_mq_run_hw_queue
>>>>>                 + 99.47% blk_mq_insert_requests
>>>>>                 + 0.53% blk_mq_insert_request
>>>>> .....
>>>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>>>     - _raw_spin_unlock_irq
>>>>>        - 69.59% __schedule
>>>>>           - 86.49% schedule
>>>>>              + 47.72% percpu_ida_alloc
>>>>>              + 21.75% worker_thread
>>>>>              + 19.12% schedule_timeout
>>>>> ....
>>>>>        + 18.06% blk_mq_make_request
>>>>>
>>>>> Runtime:
>>>>>
>>>>> real    4m1.243s
>>>>> user    0m47.724s
>>>>> sys     11m56.724s
>>>>>
>>>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>>>> XFS is barely showing up in the profiles at all - the IDA tag
>>>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>>>
>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>> patch. The results are all over the place - from several sequential
>>>>> runs of the same test (removing the files in between so each tests
>>>>> starts from an empty fs):
>>>>>
>>>>> Wall time	sys time	IOPS	 files/s
>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>
>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>> dispatch scalability, yet adds significant variability to IO
>>>>> performance.
>>>>>
>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>> this comes from the new tag allocator:
>>>>>
>>>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>>>     - _raw_spin_unlock_irqrestore
>>>>>        - 58.22% prepare_to_wait
>>>>>             100.00% bt_get
>>>>>                blk_mq_get_tag
>>>>>                __blk_mq_alloc_request
>>>>>                blk_mq_map_request
>>>>>                blk_sq_make_request
>>>>>                generic_make_request
>>>>>        - 22.51% virtio_queue_rq
>>>>>             __blk_mq_run_hw_queue
>>>>> ....
>>>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>>>     - _raw_spin_unlock_irq
>>>>>        - 58.73% __schedule
>>>>>           - 53.42% io_schedule
>>>>>                99.88% bt_get
>>>>>                   blk_mq_get_tag
>>>>>                   __blk_mq_alloc_request
>>>>>                   blk_mq_map_request
>>>>>                   blk_sq_make_request
>>>>>                   generic_make_request
>>>>>           - 35.58% schedule
>>>>>              + 49.31% worker_thread
>>>>>              + 32.45% schedule_timeout
>>>>>              + 10.35% _xfs_log_force_lsn
>>>>>              + 3.10% xlog_cil_force_lsn
>>>>> ....
> .....
>> Can you try with this patch?
>
> Ok, context switches are back down in the realm of 400,000/s. It's
> better, but it's still a bit higher than that the 3.15 code. XFS is
> actually showing up in the context switch path profiles now...
>
> However, performance is still excitingly variable and not much
> different to not having this patch applied. System time is unchanged
> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> show significant variance (at least +/-25%) from run to run. The
> worst case is not as slow as the unpatched kernel, but it's no
> better than the 3.15 worst case.
>
> Profiles on a slow run look like:
>
> -  43.43%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 64.23% blk_sq_make_request
>             generic_make_request
>            submit_bio                                                                                                                                                  ¿
>        + 26.79% __schedule
> ...
> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 39.81% virtio_queue_rq
>             __blk_mq_run_hw_queue
>        + 24.13% complete
>        + 17.74% prepare_to_wait_exclusive
>        + 9.66% remove_wait_queue
>
> Looks like the main contention problem is in blk_sq_make_request().
> Also, there looks to be quite a bit of lock contention on the tag
> wait queues given that this patch made prepare_to_wait_exclusive()
> suddenly show up in the profiles.
>
> FWIW, on a fast run there is very little time in
> blk_sq_make_request() lock contention, and overall spin lock/unlock
> overhead of these two functions is around 10% each....
>
> So, yes, the patch reduces context switches but doesn't really
> reduce system time, improve performance noticably or address the
> run-to-run variability issue...

OK, so one more thing to try. With the same patch still applied, could 
you edit block/blk-mq-tag.h and change

         BT_WAIT_QUEUES  = 8,

to

         BT_WAIT_QUEUES  = 1,

and see if that smoothes things out?

On the road the next few days, so might take me a few days to get back 
to this. I was able to reproduce the horrible contention on the wait 
queue, but everything seemed to behave nicely with just the 
exclusive_wait/batch_wakeup for me. Looks like I might have to fire up 
kvm and set it you like you.

--
Jens Axboe

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-16 21:06           ` Michael Kerrisk (man-pages)
  (?)
@ 2014-06-17 14:01           ` Christoph Hellwig
  -1 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2014-06-17 14:01 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs

On Mon, Jun 16, 2014 at 11:06:55PM +0200, Michael Kerrisk (man-pages) wrote:
> > Talking about documentation:  The kernel aio manpages (io_*.2) seems
> > to not really be very useful, mostly because they don't explain how
> > to set up the iocbs.  Michael, any idea how to get started to improve
> > this?
> 
> Well, it just needs someone knowledgeable to fill the gaps, and
> perhaps add an example program. I don't have that knowledge 
> offhand myself...

Dave pointed out that there actually are man pages for the ops, they
just come with libaio instead of the man-pages project.  I'll send you
a patch to refer to them.

The libaio man pages could also use some work, I'll see if I can find
time for that as well.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-17 13:23                       ` Jens Axboe
  (?)
@ 2014-06-18  0:28                         ` Dave Chinner
  -1 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  0:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs, linux-mm

[cc linux-mm]

On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> On 2014-06-16 16:27, Dave Chinner wrote:
> >On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>results in several thousand worker threads dispatching IO
> >>>>>concurrently across 16 CPUs:
> >>>>>
> >>>>>$ ps -ef |grep kworker |wc -l
> >>>>>4693
> >>>>>$
> >>>>>
> >>>>>Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
> >>>>>
> >>>>>-  51.33%  [kernel]            [k] percpu_ida_alloc
> >>>>>    - percpu_ida_alloc
> >>>>>       + 85.73% blk_mq_wait_for_tags
> >>>>>       + 14.23% blk_mq_get_tag
> >>>>>-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >>>>>    - _raw_spin_unlock_irqrestore
> >>>>>       - 66.26% virtio_queue_rq
> >>>>>          - __blk_mq_run_hw_queue
> >>>>>             - 99.65% blk_mq_run_hw_queue
> >>>>>                + 99.47% blk_mq_insert_requests
> >>>>>                + 0.53% blk_mq_insert_request
> >>>>>.....
> >>>>>-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
> >>>>>    - _raw_spin_unlock_irq
> >>>>>       - 69.59% __schedule
> >>>>>          - 86.49% schedule
> >>>>>             + 47.72% percpu_ida_alloc
> >>>>>             + 21.75% worker_thread
> >>>>>             + 19.12% schedule_timeout
> >>>>>....
> >>>>>       + 18.06% blk_mq_make_request
> >>>>>
> >>>>>Runtime:
> >>>>>
> >>>>>real    4m1.243s
> >>>>>user    0m47.724s
> >>>>>sys     11m56.724s
> >>>>>
> >>>>>Most of the excessive CPU usage is coming from the blk-mq layer, and
> >>>>>XFS is barely showing up in the profiles at all - the IDA tag
> >>>>>allocator is burning 8 CPUs at about 60,000 write IOPS....
> >>>>>
> >>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>patch. The results are all over the place - from several sequential
> >>>>>runs of the same test (removing the files in between so each tests
> >>>>>starts from an empty fs):
> >>>>>
> >>>>>Wall time	sys time	IOPS	 files/s
> >>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>4m12.704s	12m15.720s	50,000	 17,000
> >>>>>
> >>>>>The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >>>>>generally only +/-10s in runtime and not much change in system time.
> >>>>>The files/s rate reported by fs_mark doesn't vary that much, either.
> >>>>>So the new tag allocator seems to be no better in terms of IO
> >>>>>dispatch scalability, yet adds significant variability to IO
> >>>>>performance.
> >>>>>
> >>>>>What I noticed is a massive jump in context switch overhead: from
> >>>>>around 250,000/s to over 800,000/s and the CPU profiles show that
> >>>>>this comes from the new tag allocator:
> >>>>>
> >>>>>-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >>>>>    - _raw_spin_unlock_irqrestore
> >>>>>       - 58.22% prepare_to_wait
> >>>>>            100.00% bt_get
> >>>>>               blk_mq_get_tag
> >>>>>               __blk_mq_alloc_request
> >>>>>               blk_mq_map_request
> >>>>>               blk_sq_make_request
> >>>>>               generic_make_request
> >>>>>       - 22.51% virtio_queue_rq
> >>>>>            __blk_mq_run_hw_queue
> >>>>>....
> >>>>>-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
> >>>>>    - _raw_spin_unlock_irq
> >>>>>       - 58.73% __schedule
> >>>>>          - 53.42% io_schedule
> >>>>>               99.88% bt_get
> >>>>>                  blk_mq_get_tag
> >>>>>                  __blk_mq_alloc_request
> >>>>>                  blk_mq_map_request
> >>>>>                  blk_sq_make_request
> >>>>>                  generic_make_request
> >>>>>          - 35.58% schedule
> >>>>>             + 49.31% worker_thread
> >>>>>             + 32.45% schedule_timeout
> >>>>>             + 10.35% _xfs_log_force_lsn
> >>>>>             + 3.10% xlog_cil_force_lsn
> >>>>>....
> >.....
> >>Can you try with this patch?
> >
> >Ok, context switches are back down in the realm of 400,000/s. It's
> >better, but it's still a bit higher than that the 3.15 code. XFS is
> >actually showing up in the context switch path profiles now...
> >
> >However, performance is still excitingly variable and not much
> >different to not having this patch applied. System time is unchanged
> >(still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> >show significant variance (at least +/-25%) from run to run. The
> >worst case is not as slow as the unpatched kernel, but it's no
> >better than the 3.15 worst case.
> >
> >Profiles on a slow run look like:
> >
> >-  43.43%  [kernel]  [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 64.23% blk_sq_make_request
> >            generic_make_request
> >           submit_bio                                                                                                                                                  ¿
> >       + 26.79% __schedule
> >...
> >-  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 39.81% virtio_queue_rq
> >            __blk_mq_run_hw_queue
> >       + 24.13% complete
> >       + 17.74% prepare_to_wait_exclusive
> >       + 9.66% remove_wait_queue
> >
> >Looks like the main contention problem is in blk_sq_make_request().
> >Also, there looks to be quite a bit of lock contention on the tag
> >wait queues given that this patch made prepare_to_wait_exclusive()
> >suddenly show up in the profiles.
> >
> >FWIW, on a fast run there is very little time in
> >blk_sq_make_request() lock contention, and overall spin lock/unlock
> >overhead of these two functions is around 10% each....
> >
> >So, yes, the patch reduces context switches but doesn't really
> >reduce system time, improve performance noticably or address the
> >run-to-run variability issue...
> 
> OK, so one more thing to try. With the same patch still applied,
> could you edit block/blk-mq-tag.h and change
> 
>         BT_WAIT_QUEUES  = 8,
> 
> to
> 
>         BT_WAIT_QUEUES  = 1,
> 
> and see if that smoothes things out?

Ok, that smoothes things out to the point where I can see the
trigger for the really nasty variable performance. The trigger is
the machine running out of free memory. i.e. direct reclaim of clean
pages for the data in the new files in the page cache drives the
performance down by 25-50% and introduces significant variability.

So the variability doesn't seem to be solely related to the tag
allocator; it is contributing some via wait queue contention,
but it's definitely not the main contributor, nor the trigger...

MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
each step in the workload is generating 3.2GB of dirty pages (i.e.
just on the dirty throttling threshold). It then does a concurrent
asynchronous fsync of the 800,000 dirty files it just created,
leaving 3.2GB of clean pages in the cache. The workload iterates
this several times. Once the machine runs out of free memory (2.5
iterations in) performance drops by about 30% on average, but the
drop varies between 20-60% randomly. I'm not concerned by a 30% drop
when memory fills up - I'm concerned by the volatility of the drop
that occurs. e.g:

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000         4096      29938.0         13459475
     0      1600000         4096      28023.7         15662387
     0      2400000         4096      23704.6         16451761
     0      3200000         4096      16976.8         15029056
     0      4000000         4096      21858.3         15591604

Iteration 3 is where memory fills, and you can see that performance
dropped by 25%. Iteration 4 drops another 25%, then iteration 5
regains it. If I keep running the workload for more iterations, this
is pretty typical of the iteration-to-iteration variability, even
though every iteration is identical in behaviour as are the initial
conditions (i.e. memory full of clean, used-once pages).

This didn't happen in 3.15.0, but the behaviour may have been masked
by the block layer tag allocator CPU overhead dominating the system
behaviour.

> On the road the next few days, so might take me a few days to get
> back to this. I was able to reproduce the horrible contention on the
> wait queue, but everything seemed to behave nicely with just the
> exclusive_wait/batch_wakeup for me. Looks like I might have to fire
> up kvm and set it you like you.

I'm guessing that the difference is that you weren't driving the
machine into memory reclaim at the same time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  0:28                         ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  0:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-mm, linux-man

[cc linux-mm]

On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> On 2014-06-16 16:27, Dave Chinner wrote:
> >On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>results in several thousand worker threads dispatching IO
> >>>>>concurrently across 16 CPUs:
> >>>>>
> >>>>>$ ps -ef |grep kworker |wc -l
> >>>>>4693
> >>>>>$
> >>>>>
> >>>>>Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
> >>>>>
> >>>>>-  51.33%  [kernel]            [k] percpu_ida_alloc
> >>>>>    - percpu_ida_alloc
> >>>>>       + 85.73% blk_mq_wait_for_tags
> >>>>>       + 14.23% blk_mq_get_tag
> >>>>>-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >>>>>    - _raw_spin_unlock_irqrestore
> >>>>>       - 66.26% virtio_queue_rq
> >>>>>          - __blk_mq_run_hw_queue
> >>>>>             - 99.65% blk_mq_run_hw_queue
> >>>>>                + 99.47% blk_mq_insert_requests
> >>>>>                + 0.53% blk_mq_insert_request
> >>>>>.....
> >>>>>-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
> >>>>>    - _raw_spin_unlock_irq
> >>>>>       - 69.59% __schedule
> >>>>>          - 86.49% schedule
> >>>>>             + 47.72% percpu_ida_alloc
> >>>>>             + 21.75% worker_thread
> >>>>>             + 19.12% schedule_timeout
> >>>>>....
> >>>>>       + 18.06% blk_mq_make_request
> >>>>>
> >>>>>Runtime:
> >>>>>
> >>>>>real    4m1.243s
> >>>>>user    0m47.724s
> >>>>>sys     11m56.724s
> >>>>>
> >>>>>Most of the excessive CPU usage is coming from the blk-mq layer, and
> >>>>>XFS is barely showing up in the profiles at all - the IDA tag
> >>>>>allocator is burning 8 CPUs at about 60,000 write IOPS....
> >>>>>
> >>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>patch. The results are all over the place - from several sequential
> >>>>>runs of the same test (removing the files in between so each tests
> >>>>>starts from an empty fs):
> >>>>>
> >>>>>Wall time	sys time	IOPS	 files/s
> >>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>4m12.704s	12m15.720s	50,000	 17,000
> >>>>>
> >>>>>The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >>>>>generally only +/-10s in runtime and not much change in system time.
> >>>>>The files/s rate reported by fs_mark doesn't vary that much, either.
> >>>>>So the new tag allocator seems to be no better in terms of IO
> >>>>>dispatch scalability, yet adds significant variability to IO
> >>>>>performance.
> >>>>>
> >>>>>What I noticed is a massive jump in context switch overhead: from
> >>>>>around 250,000/s to over 800,000/s and the CPU profiles show that
> >>>>>this comes from the new tag allocator:
> >>>>>
> >>>>>-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >>>>>    - _raw_spin_unlock_irqrestore
> >>>>>       - 58.22% prepare_to_wait
> >>>>>            100.00% bt_get
> >>>>>               blk_mq_get_tag
> >>>>>               __blk_mq_alloc_request
> >>>>>               blk_mq_map_request
> >>>>>               blk_sq_make_request
> >>>>>               generic_make_request
> >>>>>       - 22.51% virtio_queue_rq
> >>>>>            __blk_mq_run_hw_queue
> >>>>>....
> >>>>>-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
> >>>>>    - _raw_spin_unlock_irq
> >>>>>       - 58.73% __schedule
> >>>>>          - 53.42% io_schedule
> >>>>>               99.88% bt_get
> >>>>>                  blk_mq_get_tag
> >>>>>                  __blk_mq_alloc_request
> >>>>>                  blk_mq_map_request
> >>>>>                  blk_sq_make_request
> >>>>>                  generic_make_request
> >>>>>          - 35.58% schedule
> >>>>>             + 49.31% worker_thread
> >>>>>             + 32.45% schedule_timeout
> >>>>>             + 10.35% _xfs_log_force_lsn
> >>>>>             + 3.10% xlog_cil_force_lsn
> >>>>>....
> >.....
> >>Can you try with this patch?
> >
> >Ok, context switches are back down in the realm of 400,000/s. It's
> >better, but it's still a bit higher than that the 3.15 code. XFS is
> >actually showing up in the context switch path profiles now...
> >
> >However, performance is still excitingly variable and not much
> >different to not having this patch applied. System time is unchanged
> >(still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> >show significant variance (at least +/-25%) from run to run. The
> >worst case is not as slow as the unpatched kernel, but it's no
> >better than the 3.15 worst case.
> >
> >Profiles on a slow run look like:
> >
> >-  43.43%  [kernel]  [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 64.23% blk_sq_make_request
> >            generic_make_request
> >           submit_bio                                                                                                                                                  ¿
> >       + 26.79% __schedule
> >...
> >-  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 39.81% virtio_queue_rq
> >            __blk_mq_run_hw_queue
> >       + 24.13% complete
> >       + 17.74% prepare_to_wait_exclusive
> >       + 9.66% remove_wait_queue
> >
> >Looks like the main contention problem is in blk_sq_make_request().
> >Also, there looks to be quite a bit of lock contention on the tag
> >wait queues given that this patch made prepare_to_wait_exclusive()
> >suddenly show up in the profiles.
> >
> >FWIW, on a fast run there is very little time in
> >blk_sq_make_request() lock contention, and overall spin lock/unlock
> >overhead of these two functions is around 10% each....
> >
> >So, yes, the patch reduces context switches but doesn't really
> >reduce system time, improve performance noticably or address the
> >run-to-run variability issue...
> 
> OK, so one more thing to try. With the same patch still applied,
> could you edit block/blk-mq-tag.h and change
> 
>         BT_WAIT_QUEUES  = 8,
> 
> to
> 
>         BT_WAIT_QUEUES  = 1,
> 
> and see if that smoothes things out?

Ok, that smoothes things out to the point where I can see the
trigger for the really nasty variable performance. The trigger is
the machine running out of free memory. i.e. direct reclaim of clean
pages for the data in the new files in the page cache drives the
performance down by 25-50% and introduces significant variability.

So the variability doesn't seem to be solely related to the tag
allocator; it is contributing some via wait queue contention,
but it's definitely not the main contributor, nor the trigger...

MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
each step in the workload is generating 3.2GB of dirty pages (i.e.
just on the dirty throttling threshold). It then does a concurrent
asynchronous fsync of the 800,000 dirty files it just created,
leaving 3.2GB of clean pages in the cache. The workload iterates
this several times. Once the machine runs out of free memory (2.5
iterations in) performance drops by about 30% on average, but the
drop varies between 20-60% randomly. I'm not concerned by a 30% drop
when memory fills up - I'm concerned by the volatility of the drop
that occurs. e.g:

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000         4096      29938.0         13459475
     0      1600000         4096      28023.7         15662387
     0      2400000         4096      23704.6         16451761
     0      3200000         4096      16976.8         15029056
     0      4000000         4096      21858.3         15591604

Iteration 3 is where memory fills, and you can see that performance
dropped by 25%. Iteration 4 drops another 25%, then iteration 5
regains it. If I keep running the workload for more iterations, this
is pretty typical of the iteration-to-iteration variability, even
though every iteration is identical in behaviour as are the initial
conditions (i.e. memory full of clean, used-once pages).

This didn't happen in 3.15.0, but the behaviour may have been masked
by the block layer tag allocator CPU overhead dominating the system
behaviour.

> On the road the next few days, so might take me a few days to get
> back to this. I was able to reproduce the horrible contention on the
> wait queue, but everything seemed to behave nicely with just the
> exclusive_wait/batch_wakeup for me. Looks like I might have to fire
> up kvm and set it you like you.

I'm guessing that the difference is that you weren't driving the
machine into memory reclaim at the same time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  0:28                         ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  0:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs, linux-mm

[cc linux-mm]

On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> On 2014-06-16 16:27, Dave Chinner wrote:
> >On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>results in several thousand worker threads dispatching IO
> >>>>>concurrently across 16 CPUs:
> >>>>>
> >>>>>$ ps -ef |grep kworker |wc -l
> >>>>>4693
> >>>>>$
> >>>>>
> >>>>>Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
> >>>>>
> >>>>>-  51.33%  [kernel]            [k] percpu_ida_alloc
> >>>>>    - percpu_ida_alloc
> >>>>>       + 85.73% blk_mq_wait_for_tags
> >>>>>       + 14.23% blk_mq_get_tag
> >>>>>-  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
> >>>>>    - _raw_spin_unlock_irqrestore
> >>>>>       - 66.26% virtio_queue_rq
> >>>>>          - __blk_mq_run_hw_queue
> >>>>>             - 99.65% blk_mq_run_hw_queue
> >>>>>                + 99.47% blk_mq_insert_requests
> >>>>>                + 0.53% blk_mq_insert_request
> >>>>>.....
> >>>>>-   7.91%  [kernel]            [k] _raw_spin_unlock_irq
> >>>>>    - _raw_spin_unlock_irq
> >>>>>       - 69.59% __schedule
> >>>>>          - 86.49% schedule
> >>>>>             + 47.72% percpu_ida_alloc
> >>>>>             + 21.75% worker_thread
> >>>>>             + 19.12% schedule_timeout
> >>>>>....
> >>>>>       + 18.06% blk_mq_make_request
> >>>>>
> >>>>>Runtime:
> >>>>>
> >>>>>real    4m1.243s
> >>>>>user    0m47.724s
> >>>>>sys     11m56.724s
> >>>>>
> >>>>>Most of the excessive CPU usage is coming from the blk-mq layer, and
> >>>>>XFS is barely showing up in the profiles at all - the IDA tag
> >>>>>allocator is burning 8 CPUs at about 60,000 write IOPS....
> >>>>>
> >>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>patch. The results are all over the place - from several sequential
> >>>>>runs of the same test (removing the files in between so each tests
> >>>>>starts from an empty fs):
> >>>>>
> >>>>>Wall time	sys time	IOPS	 files/s
> >>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>4m12.704s	12m15.720s	50,000	 17,000
> >>>>>
> >>>>>The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >>>>>generally only +/-10s in runtime and not much change in system time.
> >>>>>The files/s rate reported by fs_mark doesn't vary that much, either.
> >>>>>So the new tag allocator seems to be no better in terms of IO
> >>>>>dispatch scalability, yet adds significant variability to IO
> >>>>>performance.
> >>>>>
> >>>>>What I noticed is a massive jump in context switch overhead: from
> >>>>>around 250,000/s to over 800,000/s and the CPU profiles show that
> >>>>>this comes from the new tag allocator:
> >>>>>
> >>>>>-  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >>>>>    - _raw_spin_unlock_irqrestore
> >>>>>       - 58.22% prepare_to_wait
> >>>>>            100.00% bt_get
> >>>>>               blk_mq_get_tag
> >>>>>               __blk_mq_alloc_request
> >>>>>               blk_mq_map_request
> >>>>>               blk_sq_make_request
> >>>>>               generic_make_request
> >>>>>       - 22.51% virtio_queue_rq
> >>>>>            __blk_mq_run_hw_queue
> >>>>>....
> >>>>>-  21.56%  [kernel]  [k] _raw_spin_unlock_irq
> >>>>>    - _raw_spin_unlock_irq
> >>>>>       - 58.73% __schedule
> >>>>>          - 53.42% io_schedule
> >>>>>               99.88% bt_get
> >>>>>                  blk_mq_get_tag
> >>>>>                  __blk_mq_alloc_request
> >>>>>                  blk_mq_map_request
> >>>>>                  blk_sq_make_request
> >>>>>                  generic_make_request
> >>>>>          - 35.58% schedule
> >>>>>             + 49.31% worker_thread
> >>>>>             + 32.45% schedule_timeout
> >>>>>             + 10.35% _xfs_log_force_lsn
> >>>>>             + 3.10% xlog_cil_force_lsn
> >>>>>....
> >.....
> >>Can you try with this patch?
> >
> >Ok, context switches are back down in the realm of 400,000/s. It's
> >better, but it's still a bit higher than that the 3.15 code. XFS is
> >actually showing up in the context switch path profiles now...
> >
> >However, performance is still excitingly variable and not much
> >different to not having this patch applied. System time is unchanged
> >(still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> >show significant variance (at least +/-25%) from run to run. The
> >worst case is not as slow as the unpatched kernel, but it's no
> >better than the 3.15 worst case.
> >
> >Profiles on a slow run look like:
> >
> >-  43.43%  [kernel]  [k] _raw_spin_unlock_irq
> >    - _raw_spin_unlock_irq
> >       - 64.23% blk_sq_make_request
> >            generic_make_request
> >           submit_bio                                                                                                                                                  ?
> >       + 26.79% __schedule
> >...
> >-  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
> >    - _raw_spin_unlock_irqrestore
> >       - 39.81% virtio_queue_rq
> >            __blk_mq_run_hw_queue
> >       + 24.13% complete
> >       + 17.74% prepare_to_wait_exclusive
> >       + 9.66% remove_wait_queue
> >
> >Looks like the main contention problem is in blk_sq_make_request().
> >Also, there looks to be quite a bit of lock contention on the tag
> >wait queues given that this patch made prepare_to_wait_exclusive()
> >suddenly show up in the profiles.
> >
> >FWIW, on a fast run there is very little time in
> >blk_sq_make_request() lock contention, and overall spin lock/unlock
> >overhead of these two functions is around 10% each....
> >
> >So, yes, the patch reduces context switches but doesn't really
> >reduce system time, improve performance noticably or address the
> >run-to-run variability issue...
> 
> OK, so one more thing to try. With the same patch still applied,
> could you edit block/blk-mq-tag.h and change
> 
>         BT_WAIT_QUEUES  = 8,
> 
> to
> 
>         BT_WAIT_QUEUES  = 1,
> 
> and see if that smoothes things out?

Ok, that smoothes things out to the point where I can see the
trigger for the really nasty variable performance. The trigger is
the machine running out of free memory. i.e. direct reclaim of clean
pages for the data in the new files in the page cache drives the
performance down by 25-50% and introduces significant variability.

So the variability doesn't seem to be solely related to the tag
allocator; it is contributing some via wait queue contention,
but it's definitely not the main contributor, nor the trigger...

MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
each step in the workload is generating 3.2GB of dirty pages (i.e.
just on the dirty throttling threshold). It then does a concurrent
asynchronous fsync of the 800,000 dirty files it just created,
leaving 3.2GB of clean pages in the cache. The workload iterates
this several times. Once the machine runs out of free memory (2.5
iterations in) performance drops by about 30% on average, but the
drop varies between 20-60% randomly. I'm not concerned by a 30% drop
when memory fills up - I'm concerned by the volatility of the drop
that occurs. e.g:

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000         4096      29938.0         13459475
     0      1600000         4096      28023.7         15662387
     0      2400000         4096      23704.6         16451761
     0      3200000         4096      16976.8         15029056
     0      4000000         4096      21858.3         15591604

Iteration 3 is where memory fills, and you can see that performance
dropped by 25%. Iteration 4 drops another 25%, then iteration 5
regains it. If I keep running the workload for more iterations, this
is pretty typical of the iteration-to-iteration variability, even
though every iteration is identical in behaviour as are the initial
conditions (i.e. memory full of clean, used-once pages).

This didn't happen in 3.15.0, but the behaviour may have been masked
by the block layer tag allocator CPU overhead dominating the system
behaviour.

> On the road the next few days, so might take me a few days to get
> back to this. I was able to reproduce the horrible contention on the
> wait queue, but everything seemed to behave nicely with just the
> exclusive_wait/batch_wakeup for me. Looks like I might have to fire
> up kvm and set it you like you.

I'm guessing that the difference is that you weren't driving the
machine into memory reclaim at the same time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-18  0:28                         ` Dave Chinner
  (?)
@ 2014-06-18  2:24                           ` Jens Axboe
  -1 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-18  2:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On 2014-06-17 17:28, Dave Chinner wrote:
> [cc linux-mm]
>
> On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
>> On 2014-06-16 16:27, Dave Chinner wrote:
>>> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>>>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>>>> results in several thousand worker threads dispatching IO
>>>>>>> concurrently across 16 CPUs:
>>>>>>>
>>>>>>> $ ps -ef |grep kworker |wc -l
>>>>>>> 4693
>>>>>>> $
>>>>>>>
>>>>>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>>>>>
>>>>>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>>>>>     - percpu_ida_alloc
>>>>>>>        + 85.73% blk_mq_wait_for_tags
>>>>>>>        + 14.23% blk_mq_get_tag
>>>>>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>>>>>     - _raw_spin_unlock_irqrestore
>>>>>>>        - 66.26% virtio_queue_rq
>>>>>>>           - __blk_mq_run_hw_queue
>>>>>>>              - 99.65% blk_mq_run_hw_queue
>>>>>>>                 + 99.47% blk_mq_insert_requests
>>>>>>>                 + 0.53% blk_mq_insert_request
>>>>>>> .....
>>>>>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>>>>>     - _raw_spin_unlock_irq
>>>>>>>        - 69.59% __schedule
>>>>>>>           - 86.49% schedule
>>>>>>>              + 47.72% percpu_ida_alloc
>>>>>>>              + 21.75% worker_thread
>>>>>>>              + 19.12% schedule_timeout
>>>>>>> ....
>>>>>>>        + 18.06% blk_mq_make_request
>>>>>>>
>>>>>>> Runtime:
>>>>>>>
>>>>>>> real    4m1.243s
>>>>>>> user    0m47.724s
>>>>>>> sys     11m56.724s
>>>>>>>
>>>>>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>>>>>> XFS is barely showing up in the profiles at all - the IDA tag
>>>>>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>>>>>
>>>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>>>> patch. The results are all over the place - from several sequential
>>>>>>> runs of the same test (removing the files in between so each tests
>>>>>>> starts from an empty fs):
>>>>>>>
>>>>>>> Wall time	sys time	IOPS	 files/s
>>>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>>>
>>>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>>>> dispatch scalability, yet adds significant variability to IO
>>>>>>> performance.
>>>>>>>
>>>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>>>> this comes from the new tag allocator:
>>>>>>>
>>>>>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>>>>>     - _raw_spin_unlock_irqrestore
>>>>>>>        - 58.22% prepare_to_wait
>>>>>>>             100.00% bt_get
>>>>>>>                blk_mq_get_tag
>>>>>>>                __blk_mq_alloc_request
>>>>>>>                blk_mq_map_request
>>>>>>>                blk_sq_make_request
>>>>>>>                generic_make_request
>>>>>>>        - 22.51% virtio_queue_rq
>>>>>>>             __blk_mq_run_hw_queue
>>>>>>> ....
>>>>>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>>>>>     - _raw_spin_unlock_irq
>>>>>>>        - 58.73% __schedule
>>>>>>>           - 53.42% io_schedule
>>>>>>>                99.88% bt_get
>>>>>>>                   blk_mq_get_tag
>>>>>>>                   __blk_mq_alloc_request
>>>>>>>                   blk_mq_map_request
>>>>>>>                   blk_sq_make_request
>>>>>>>                   generic_make_request
>>>>>>>           - 35.58% schedule
>>>>>>>              + 49.31% worker_thread
>>>>>>>              + 32.45% schedule_timeout
>>>>>>>              + 10.35% _xfs_log_force_lsn
>>>>>>>              + 3.10% xlog_cil_force_lsn
>>>>>>> ....
>>> .....
>>>> Can you try with this patch?
>>>
>>> Ok, context switches are back down in the realm of 400,000/s. It's
>>> better, but it's still a bit higher than that the 3.15 code. XFS is
>>> actually showing up in the context switch path profiles now...
>>>
>>> However, performance is still excitingly variable and not much
>>> different to not having this patch applied. System time is unchanged
>>> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
>>> show significant variance (at least +/-25%) from run to run. The
>>> worst case is not as slow as the unpatched kernel, but it's no
>>> better than the 3.15 worst case.
>>>
>>> Profiles on a slow run look like:
>>>
>>> -  43.43%  [kernel]  [k] _raw_spin_unlock_irq
>>>     - _raw_spin_unlock_irq
>>>        - 64.23% blk_sq_make_request
>>>             generic_make_request
>>>            submit_bio                                                                                                                                                  ¿
>>>        + 26.79% __schedule
>>> ...
>>> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>     - _raw_spin_unlock_irqrestore
>>>        - 39.81% virtio_queue_rq
>>>             __blk_mq_run_hw_queue
>>>        + 24.13% complete
>>>        + 17.74% prepare_to_wait_exclusive
>>>        + 9.66% remove_wait_queue
>>>
>>> Looks like the main contention problem is in blk_sq_make_request().
>>> Also, there looks to be quite a bit of lock contention on the tag
>>> wait queues given that this patch made prepare_to_wait_exclusive()
>>> suddenly show up in the profiles.
>>>
>>> FWIW, on a fast run there is very little time in
>>> blk_sq_make_request() lock contention, and overall spin lock/unlock
>>> overhead of these two functions is around 10% each....
>>>
>>> So, yes, the patch reduces context switches but doesn't really
>>> reduce system time, improve performance noticably or address the
>>> run-to-run variability issue...
>>
>> OK, so one more thing to try. With the same patch still applied,
>> could you edit block/blk-mq-tag.h and change
>>
>>          BT_WAIT_QUEUES  = 8,
>>
>> to
>>
>>          BT_WAIT_QUEUES  = 1,
>>
>> and see if that smoothes things out?
>
> Ok, that smoothes things out to the point where I can see the
> trigger for the really nasty variable performance. The trigger is
> the machine running out of free memory. i.e. direct reclaim of clean
> pages for the data in the new files in the page cache drives the
> performance down by 25-50% and introduces significant variability.
>
> So the variability doesn't seem to be solely related to the tag
> allocator; it is contributing some via wait queue contention,
> but it's definitely not the main contributor, nor the trigger...
>
> MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
> each step in the workload is generating 3.2GB of dirty pages (i.e.
> just on the dirty throttling threshold). It then does a concurrent
> asynchronous fsync of the 800,000 dirty files it just created,
> leaving 3.2GB of clean pages in the cache. The workload iterates
> this several times. Once the machine runs out of free memory (2.5
> iterations in) performance drops by about 30% on average, but the
> drop varies between 20-60% randomly. I'm not concerned by a 30% drop
> when memory fills up - I'm concerned by the volatility of the drop
> that occurs. e.g:
>
> FSUse%        Count         Size    Files/sec     App Overhead
>       0       800000         4096      29938.0         13459475
>       0      1600000         4096      28023.7         15662387
>       0      2400000         4096      23704.6         16451761
>       0      3200000         4096      16976.8         15029056
>       0      4000000         4096      21858.3         15591604
>
> Iteration 3 is where memory fills, and you can see that performance
> dropped by 25%. Iteration 4 drops another 25%, then iteration 5
> regains it. If I keep running the workload for more iterations, this
> is pretty typical of the iteration-to-iteration variability, even
> though every iteration is identical in behaviour as are the initial
> conditions (i.e. memory full of clean, used-once pages).
>
> This didn't happen in 3.15.0, but the behaviour may have been masked
> by the block layer tag allocator CPU overhead dominating the system
> behaviour.

OK, that's reassuring. I'll do some testing with the cyclic wait queues, 
but probably not until Thursday. Alexanders patches might potentially 
fix the variability as well, but if we can make-do without the multiple 
wait queues, I'd much rather just kill it.

Did you see any spinlock contention with BT_WAIT_QUEUES = 1?

>> On the road the next few days, so might take me a few days to get
>> back to this. I was able to reproduce the horrible contention on the
>> wait queue, but everything seemed to behave nicely with just the
>> exclusive_wait/batch_wakeup for me. Looks like I might have to fire
>> up kvm and set it you like you.
>
> I'm guessing that the difference is that you weren't driving the
> machine into memory reclaim at the same time.

My testing was purely O_DIRECT, so no, no reclaim at all. I was only 
focused on the lock contention in there, and the patch completely got 
rid of that.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  2:24                           ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-18  2:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-mm, linux-man

On 2014-06-17 17:28, Dave Chinner wrote:
> [cc linux-mm]
>
> On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
>> On 2014-06-16 16:27, Dave Chinner wrote:
>>> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>>>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>>>> results in several thousand worker threads dispatching IO
>>>>>>> concurrently across 16 CPUs:
>>>>>>>
>>>>>>> $ ps -ef |grep kworker |wc -l
>>>>>>> 4693
>>>>>>> $
>>>>>>>
>>>>>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>>>>>
>>>>>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>>>>>     - percpu_ida_alloc
>>>>>>>        + 85.73% blk_mq_wait_for_tags
>>>>>>>        + 14.23% blk_mq_get_tag
>>>>>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>>>>>     - _raw_spin_unlock_irqrestore
>>>>>>>        - 66.26% virtio_queue_rq
>>>>>>>           - __blk_mq_run_hw_queue
>>>>>>>              - 99.65% blk_mq_run_hw_queue
>>>>>>>                 + 99.47% blk_mq_insert_requests
>>>>>>>                 + 0.53% blk_mq_insert_request
>>>>>>> .....
>>>>>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>>>>>     - _raw_spin_unlock_irq
>>>>>>>        - 69.59% __schedule
>>>>>>>           - 86.49% schedule
>>>>>>>              + 47.72% percpu_ida_alloc
>>>>>>>              + 21.75% worker_thread
>>>>>>>              + 19.12% schedule_timeout
>>>>>>> ....
>>>>>>>        + 18.06% blk_mq_make_request
>>>>>>>
>>>>>>> Runtime:
>>>>>>>
>>>>>>> real    4m1.243s
>>>>>>> user    0m47.724s
>>>>>>> sys     11m56.724s
>>>>>>>
>>>>>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>>>>>> XFS is barely showing up in the profiles at all - the IDA tag
>>>>>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>>>>>
>>>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>>>> patch. The results are all over the place - from several sequential
>>>>>>> runs of the same test (removing the files in between so each tests
>>>>>>> starts from an empty fs):
>>>>>>>
>>>>>>> Wall time	sys time	IOPS	 files/s
>>>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>>>
>>>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>>>> dispatch scalability, yet adds significant variability to IO
>>>>>>> performance.
>>>>>>>
>>>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>>>> this comes from the new tag allocator:
>>>>>>>
>>>>>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>>>>>     - _raw_spin_unlock_irqrestore
>>>>>>>        - 58.22% prepare_to_wait
>>>>>>>             100.00% bt_get
>>>>>>>                blk_mq_get_tag
>>>>>>>                __blk_mq_alloc_request
>>>>>>>                blk_mq_map_request
>>>>>>>                blk_sq_make_request
>>>>>>>                generic_make_request
>>>>>>>        - 22.51% virtio_queue_rq
>>>>>>>             __blk_mq_run_hw_queue
>>>>>>> ....
>>>>>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>>>>>     - _raw_spin_unlock_irq
>>>>>>>        - 58.73% __schedule
>>>>>>>           - 53.42% io_schedule
>>>>>>>                99.88% bt_get
>>>>>>>                   blk_mq_get_tag
>>>>>>>                   __blk_mq_alloc_request
>>>>>>>                   blk_mq_map_request
>>>>>>>                   blk_sq_make_request
>>>>>>>                   generic_make_request
>>>>>>>           - 35.58% schedule
>>>>>>>              + 49.31% worker_thread
>>>>>>>              + 32.45% schedule_timeout
>>>>>>>              + 10.35% _xfs_log_force_lsn
>>>>>>>              + 3.10% xlog_cil_force_lsn
>>>>>>> ....
>>> .....
>>>> Can you try with this patch?
>>>
>>> Ok, context switches are back down in the realm of 400,000/s. It's
>>> better, but it's still a bit higher than that the 3.15 code. XFS is
>>> actually showing up in the context switch path profiles now...
>>>
>>> However, performance is still excitingly variable and not much
>>> different to not having this patch applied. System time is unchanged
>>> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
>>> show significant variance (at least +/-25%) from run to run. The
>>> worst case is not as slow as the unpatched kernel, but it's no
>>> better than the 3.15 worst case.
>>>
>>> Profiles on a slow run look like:
>>>
>>> -  43.43%  [kernel]  [k] _raw_spin_unlock_irq
>>>     - _raw_spin_unlock_irq
>>>        - 64.23% blk_sq_make_request
>>>             generic_make_request
>>>            submit_bio                                                                                                                                                  ¿
>>>        + 26.79% __schedule
>>> ...
>>> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>     - _raw_spin_unlock_irqrestore
>>>        - 39.81% virtio_queue_rq
>>>             __blk_mq_run_hw_queue
>>>        + 24.13% complete
>>>        + 17.74% prepare_to_wait_exclusive
>>>        + 9.66% remove_wait_queue
>>>
>>> Looks like the main contention problem is in blk_sq_make_request().
>>> Also, there looks to be quite a bit of lock contention on the tag
>>> wait queues given that this patch made prepare_to_wait_exclusive()
>>> suddenly show up in the profiles.
>>>
>>> FWIW, on a fast run there is very little time in
>>> blk_sq_make_request() lock contention, and overall spin lock/unlock
>>> overhead of these two functions is around 10% each....
>>>
>>> So, yes, the patch reduces context switches but doesn't really
>>> reduce system time, improve performance noticably or address the
>>> run-to-run variability issue...
>>
>> OK, so one more thing to try. With the same patch still applied,
>> could you edit block/blk-mq-tag.h and change
>>
>>          BT_WAIT_QUEUES  = 8,
>>
>> to
>>
>>          BT_WAIT_QUEUES  = 1,
>>
>> and see if that smoothes things out?
>
> Ok, that smoothes things out to the point where I can see the
> trigger for the really nasty variable performance. The trigger is
> the machine running out of free memory. i.e. direct reclaim of clean
> pages for the data in the new files in the page cache drives the
> performance down by 25-50% and introduces significant variability.
>
> So the variability doesn't seem to be solely related to the tag
> allocator; it is contributing some via wait queue contention,
> but it's definitely not the main contributor, nor the trigger...
>
> MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
> each step in the workload is generating 3.2GB of dirty pages (i.e.
> just on the dirty throttling threshold). It then does a concurrent
> asynchronous fsync of the 800,000 dirty files it just created,
> leaving 3.2GB of clean pages in the cache. The workload iterates
> this several times. Once the machine runs out of free memory (2.5
> iterations in) performance drops by about 30% on average, but the
> drop varies between 20-60% randomly. I'm not concerned by a 30% drop
> when memory fills up - I'm concerned by the volatility of the drop
> that occurs. e.g:
>
> FSUse%        Count         Size    Files/sec     App Overhead
>       0       800000         4096      29938.0         13459475
>       0      1600000         4096      28023.7         15662387
>       0      2400000         4096      23704.6         16451761
>       0      3200000         4096      16976.8         15029056
>       0      4000000         4096      21858.3         15591604
>
> Iteration 3 is where memory fills, and you can see that performance
> dropped by 25%. Iteration 4 drops another 25%, then iteration 5
> regains it. If I keep running the workload for more iterations, this
> is pretty typical of the iteration-to-iteration variability, even
> though every iteration is identical in behaviour as are the initial
> conditions (i.e. memory full of clean, used-once pages).
>
> This didn't happen in 3.15.0, but the behaviour may have been masked
> by the block layer tag allocator CPU overhead dominating the system
> behaviour.

OK, that's reassuring. I'll do some testing with the cyclic wait queues, 
but probably not until Thursday. Alexanders patches might potentially 
fix the variability as well, but if we can make-do without the multiple 
wait queues, I'd much rather just kill it.

Did you see any spinlock contention with BT_WAIT_QUEUES = 1?

>> On the road the next few days, so might take me a few days to get
>> back to this. I was able to reproduce the horrible contention on the
>> wait queue, but everything seemed to behave nicely with just the
>> exclusive_wait/batch_wakeup for me. Looks like I might have to fire
>> up kvm and set it you like you.
>
> I'm guessing that the difference is that you weren't driving the
> machine into memory reclaim at the same time.

My testing was purely O_DIRECT, so no, no reclaim at all. I was only 
focused on the lock contention in there, and the patch completely got 
rid of that.

-- 
Jens Axboe

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  2:24                           ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-18  2:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs, linux-mm

On 2014-06-17 17:28, Dave Chinner wrote:
> [cc linux-mm]
>
> On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
>> On 2014-06-16 16:27, Dave Chinner wrote:
>>> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>>>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>>>> results in several thousand worker threads dispatching IO
>>>>>>> concurrently across 16 CPUs:
>>>>>>>
>>>>>>> $ ps -ef |grep kworker |wc -l
>>>>>>> 4693
>>>>>>> $
>>>>>>>
>>>>>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>>>>>
>>>>>>> -  51.33%  [kernel]            [k] percpu_ida_alloc
>>>>>>>     - percpu_ida_alloc
>>>>>>>        + 85.73% blk_mq_wait_for_tags
>>>>>>>        + 14.23% blk_mq_get_tag
>>>>>>> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>>>>>>>     - _raw_spin_unlock_irqrestore
>>>>>>>        - 66.26% virtio_queue_rq
>>>>>>>           - __blk_mq_run_hw_queue
>>>>>>>              - 99.65% blk_mq_run_hw_queue
>>>>>>>                 + 99.47% blk_mq_insert_requests
>>>>>>>                 + 0.53% blk_mq_insert_request
>>>>>>> .....
>>>>>>> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>>>>>>>     - _raw_spin_unlock_irq
>>>>>>>        - 69.59% __schedule
>>>>>>>           - 86.49% schedule
>>>>>>>              + 47.72% percpu_ida_alloc
>>>>>>>              + 21.75% worker_thread
>>>>>>>              + 19.12% schedule_timeout
>>>>>>> ....
>>>>>>>        + 18.06% blk_mq_make_request
>>>>>>>
>>>>>>> Runtime:
>>>>>>>
>>>>>>> real    4m1.243s
>>>>>>> user    0m47.724s
>>>>>>> sys     11m56.724s
>>>>>>>
>>>>>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>>>>>> XFS is barely showing up in the profiles at all - the IDA tag
>>>>>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>>>>>
>>>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>>>> patch. The results are all over the place - from several sequential
>>>>>>> runs of the same test (removing the files in between so each tests
>>>>>>> starts from an empty fs):
>>>>>>>
>>>>>>> Wall time	sys time	IOPS	 files/s
>>>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>>>
>>>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>>>> dispatch scalability, yet adds significant variability to IO
>>>>>>> performance.
>>>>>>>
>>>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>>>> this comes from the new tag allocator:
>>>>>>>
>>>>>>> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>>>>>     - _raw_spin_unlock_irqrestore
>>>>>>>        - 58.22% prepare_to_wait
>>>>>>>             100.00% bt_get
>>>>>>>                blk_mq_get_tag
>>>>>>>                __blk_mq_alloc_request
>>>>>>>                blk_mq_map_request
>>>>>>>                blk_sq_make_request
>>>>>>>                generic_make_request
>>>>>>>        - 22.51% virtio_queue_rq
>>>>>>>             __blk_mq_run_hw_queue
>>>>>>> ....
>>>>>>> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>>>>>>>     - _raw_spin_unlock_irq
>>>>>>>        - 58.73% __schedule
>>>>>>>           - 53.42% io_schedule
>>>>>>>                99.88% bt_get
>>>>>>>                   blk_mq_get_tag
>>>>>>>                   __blk_mq_alloc_request
>>>>>>>                   blk_mq_map_request
>>>>>>>                   blk_sq_make_request
>>>>>>>                   generic_make_request
>>>>>>>           - 35.58% schedule
>>>>>>>              + 49.31% worker_thread
>>>>>>>              + 32.45% schedule_timeout
>>>>>>>              + 10.35% _xfs_log_force_lsn
>>>>>>>              + 3.10% xlog_cil_force_lsn
>>>>>>> ....
>>> .....
>>>> Can you try with this patch?
>>>
>>> Ok, context switches are back down in the realm of 400,000/s. It's
>>> better, but it's still a bit higher than that the 3.15 code. XFS is
>>> actually showing up in the context switch path profiles now...
>>>
>>> However, performance is still excitingly variable and not much
>>> different to not having this patch applied. System time is unchanged
>>> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
>>> show significant variance (at least +/-25%) from run to run. The
>>> worst case is not as slow as the unpatched kernel, but it's no
>>> better than the 3.15 worst case.
>>>
>>> Profiles on a slow run look like:
>>>
>>> -  43.43%  [kernel]  [k] _raw_spin_unlock_irq
>>>     - _raw_spin_unlock_irq
>>>        - 64.23% blk_sq_make_request
>>>             generic_make_request
>>>            submit_bio                                                                                                                                                  ?
>>>        + 26.79% __schedule
>>> ...
>>> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>>>     - _raw_spin_unlock_irqrestore
>>>        - 39.81% virtio_queue_rq
>>>             __blk_mq_run_hw_queue
>>>        + 24.13% complete
>>>        + 17.74% prepare_to_wait_exclusive
>>>        + 9.66% remove_wait_queue
>>>
>>> Looks like the main contention problem is in blk_sq_make_request().
>>> Also, there looks to be quite a bit of lock contention on the tag
>>> wait queues given that this patch made prepare_to_wait_exclusive()
>>> suddenly show up in the profiles.
>>>
>>> FWIW, on a fast run there is very little time in
>>> blk_sq_make_request() lock contention, and overall spin lock/unlock
>>> overhead of these two functions is around 10% each....
>>>
>>> So, yes, the patch reduces context switches but doesn't really
>>> reduce system time, improve performance noticably or address the
>>> run-to-run variability issue...
>>
>> OK, so one more thing to try. With the same patch still applied,
>> could you edit block/blk-mq-tag.h and change
>>
>>          BT_WAIT_QUEUES  = 8,
>>
>> to
>>
>>          BT_WAIT_QUEUES  = 1,
>>
>> and see if that smoothes things out?
>
> Ok, that smoothes things out to the point where I can see the
> trigger for the really nasty variable performance. The trigger is
> the machine running out of free memory. i.e. direct reclaim of clean
> pages for the data in the new files in the page cache drives the
> performance down by 25-50% and introduces significant variability.
>
> So the variability doesn't seem to be solely related to the tag
> allocator; it is contributing some via wait queue contention,
> but it's definitely not the main contributor, nor the trigger...
>
> MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
> each step in the workload is generating 3.2GB of dirty pages (i.e.
> just on the dirty throttling threshold). It then does a concurrent
> asynchronous fsync of the 800,000 dirty files it just created,
> leaving 3.2GB of clean pages in the cache. The workload iterates
> this several times. Once the machine runs out of free memory (2.5
> iterations in) performance drops by about 30% on average, but the
> drop varies between 20-60% randomly. I'm not concerned by a 30% drop
> when memory fills up - I'm concerned by the volatility of the drop
> that occurs. e.g:
>
> FSUse%        Count         Size    Files/sec     App Overhead
>       0       800000         4096      29938.0         13459475
>       0      1600000         4096      28023.7         15662387
>       0      2400000         4096      23704.6         16451761
>       0      3200000         4096      16976.8         15029056
>       0      4000000         4096      21858.3         15591604
>
> Iteration 3 is where memory fills, and you can see that performance
> dropped by 25%. Iteration 4 drops another 25%, then iteration 5
> regains it. If I keep running the workload for more iterations, this
> is pretty typical of the iteration-to-iteration variability, even
> though every iteration is identical in behaviour as are the initial
> conditions (i.e. memory full of clean, used-once pages).
>
> This didn't happen in 3.15.0, but the behaviour may have been masked
> by the block layer tag allocator CPU overhead dominating the system
> behaviour.

OK, that's reassuring. I'll do some testing with the cyclic wait queues, 
but probably not until Thursday. Alexanders patches might potentially 
fix the variability as well, but if we can make-do without the multiple 
wait queues, I'd much rather just kill it.

Did you see any spinlock contention with BT_WAIT_QUEUES = 1?

>> On the road the next few days, so might take me a few days to get
>> back to this. I was able to reproduce the horrible contention on the
>> wait queue, but everything seemed to behave nicely with just the
>> exclusive_wait/batch_wakeup for me. Looks like I might have to fire
>> up kvm and set it you like you.
>
> I'm guessing that the difference is that you weren't driving the
> machine into memory reclaim at the same time.

My testing was purely O_DIRECT, so no, no reclaim at all. I was only 
focused on the lock contention in there, and the patch completely got 
rid of that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-18  2:24                           ` Jens Axboe
@ 2014-06-18  3:13                             ` Dave Chinner
  -1 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  3:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs, linux-mm

On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> On 2014-06-17 17:28, Dave Chinner wrote:
> >[cc linux-mm]
> >
> >On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> >>On 2014-06-16 16:27, Dave Chinner wrote:
> >>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>>>results in several thousand worker threads dispatching IO
> >>>>>>>concurrently across 16 CPUs:
....
> >>>>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>>>patch. The results are all over the place - from several sequential
> >>>>>>>runs of the same test (removing the files in between so each tests
> >>>>>>>starts from an empty fs):
> >>>>>>>
> >>>>>>>Wall time	sys time	IOPS	 files/s
> >>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>>>4m12.704s	12m15.720s	50,000	 17,000
> >>>>>>>
> >>>>>>>The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >>>>>>>generally only +/-10s in runtime and not much change in system time.
> >>>>>>>The files/s rate reported by fs_mark doesn't vary that much, either.
> >>>>>>>So the new tag allocator seems to be no better in terms of IO
> >>>>>>>dispatch scalability, yet adds significant variability to IO
> >>>>>>>performance.
> >>>>>>>
> >>>>>>>What I noticed is a massive jump in context switch overhead: from
> >>>>>>>around 250,000/s to over 800,000/s and the CPU profiles show that
> >>>>>>>this comes from the new tag allocator:
....
> >>>>Can you try with this patch?
> >>>
> >>>Ok, context switches are back down in the realm of 400,000/s. It's
> >>>better, but it's still a bit higher than that the 3.15 code. XFS is
> >>>actually showing up in the context switch path profiles now...
> >>>
> >>>However, performance is still excitingly variable and not much
> >>>different to not having this patch applied. System time is unchanged
> >>>(still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> >>>show significant variance (at least +/-25%) from run to run. The
> >>>worst case is not as slow as the unpatched kernel, but it's no
> >>>better than the 3.15 worst case.
....
> >>>Looks like the main contention problem is in blk_sq_make_request().
> >>>Also, there looks to be quite a bit of lock contention on the tag
> >>>wait queues given that this patch made prepare_to_wait_exclusive()
> >>>suddenly show up in the profiles.
> >>>
> >>>FWIW, on a fast run there is very little time in
> >>>blk_sq_make_request() lock contention, and overall spin lock/unlock
> >>>overhead of these two functions is around 10% each....
> >>>
> >>>So, yes, the patch reduces context switches but doesn't really
> >>>reduce system time, improve performance noticably or address the
> >>>run-to-run variability issue...
> >>
> >>OK, so one more thing to try. With the same patch still applied,
> >>could you edit block/blk-mq-tag.h and change
> >>
> >>         BT_WAIT_QUEUES  = 8,
> >>
> >>to
> >>
> >>         BT_WAIT_QUEUES  = 1,
> >>
> >>and see if that smoothes things out?
> >
> >Ok, that smoothes things out to the point where I can see the
> >trigger for the really nasty variable performance. The trigger is
> >the machine running out of free memory. i.e. direct reclaim of clean
> >pages for the data in the new files in the page cache drives the
> >performance down by 25-50% and introduces significant variability.
> >
> >So the variability doesn't seem to be solely related to the tag
> >allocator; it is contributing some via wait queue contention,
> >but it's definitely not the main contributor, nor the trigger...
> >
> >MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
> >each step in the workload is generating 3.2GB of dirty pages (i.e.
> >just on the dirty throttling threshold). It then does a concurrent
> >asynchronous fsync of the 800,000 dirty files it just created,
> >leaving 3.2GB of clean pages in the cache. The workload iterates
> >this several times. Once the machine runs out of free memory (2.5
> >iterations in) performance drops by about 30% on average, but the
> >drop varies between 20-60% randomly. I'm not concerned by a 30% drop
> >when memory fills up - I'm concerned by the volatility of the drop
> >that occurs. e.g:
> >
> >FSUse%        Count         Size    Files/sec     App Overhead
> >      0       800000         4096      29938.0         13459475
> >      0      1600000         4096      28023.7         15662387
> >      0      2400000         4096      23704.6         16451761
> >      0      3200000         4096      16976.8         15029056
> >      0      4000000         4096      21858.3         15591604
> >
> >Iteration 3 is where memory fills, and you can see that performance
> >dropped by 25%. Iteration 4 drops another 25%, then iteration 5
> >regains it. If I keep running the workload for more iterations, this
> >is pretty typical of the iteration-to-iteration variability, even
> >though every iteration is identical in behaviour as are the initial
> >conditions (i.e. memory full of clean, used-once pages).
> >
> >This didn't happen in 3.15.0, but the behaviour may have been masked
> >by the block layer tag allocator CPU overhead dominating the system
> >behaviour.
> 
> OK, that's reassuring. I'll do some testing with the cyclic wait
> queues, but probably not until Thursday. Alexanders patches might
> potentially fix the variability as well, but if we can make-do
> without the multiple wait queues, I'd much rather just kill it.
> 
> Did you see any spinlock contention with BT_WAIT_QUEUES = 1?

Yes. During the 15-20s of high IOPS dispatch rates the profile looks
like this:

-  36.00%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.72% blk_sq_make_request
           generic_make_request
         + submit_bio
      + 24.81% __schedule
....
-  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 32.87% prepare_to_wait_exclusive
           bt_get
           blk_mq_get_tag
           __blk_mq_alloc_request
           blk_mq_map_request
           blk_sq_make_request
           generic_make_request
         + submit_bio
      - 29.21% virtio_queue_rq
           __blk_mq_run_hw_queue
      + 11.69% complete
      + 8.21% finish_wait
        8.10% remove_wait_queue

But the IOPS rate has definitely increased with this config
- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
workload (the above profile is from the 100k IOPS period). However,
the wall time was still only 3m58s, which again tends to implicate
the write() portion of the benchmark for causing the slowdowns
rather than the fsync() portion that is dispatching all the IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  3:13                             ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  3:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-mm, linux-man

On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> On 2014-06-17 17:28, Dave Chinner wrote:
> >[cc linux-mm]
> >
> >On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> >>On 2014-06-16 16:27, Dave Chinner wrote:
> >>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>>>results in several thousand worker threads dispatching IO
> >>>>>>>concurrently across 16 CPUs:
....
> >>>>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>>>patch. The results are all over the place - from several sequential
> >>>>>>>runs of the same test (removing the files in between so each tests
> >>>>>>>starts from an empty fs):
> >>>>>>>
> >>>>>>>Wall time	sys time	IOPS	 files/s
> >>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>>>4m12.704s	12m15.720s	50,000	 17,000
> >>>>>>>
> >>>>>>>The 3.15 based kernel was pretty consistent around the 4m10 mark,
> >>>>>>>generally only +/-10s in runtime and not much change in system time.
> >>>>>>>The files/s rate reported by fs_mark doesn't vary that much, either.
> >>>>>>>So the new tag allocator seems to be no better in terms of IO
> >>>>>>>dispatch scalability, yet adds significant variability to IO
> >>>>>>>performance.
> >>>>>>>
> >>>>>>>What I noticed is a massive jump in context switch overhead: from
> >>>>>>>around 250,000/s to over 800,000/s and the CPU profiles show that
> >>>>>>>this comes from the new tag allocator:
....
> >>>>Can you try with this patch?
> >>>
> >>>Ok, context switches are back down in the realm of 400,000/s. It's
> >>>better, but it's still a bit higher than that the 3.15 code. XFS is
> >>>actually showing up in the context switch path profiles now...
> >>>
> >>>However, performance is still excitingly variable and not much
> >>>different to not having this patch applied. System time is unchanged
> >>>(still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> >>>show significant variance (at least +/-25%) from run to run. The
> >>>worst case is not as slow as the unpatched kernel, but it's no
> >>>better than the 3.15 worst case.
....
> >>>Looks like the main contention problem is in blk_sq_make_request().
> >>>Also, there looks to be quite a bit of lock contention on the tag
> >>>wait queues given that this patch made prepare_to_wait_exclusive()
> >>>suddenly show up in the profiles.
> >>>
> >>>FWIW, on a fast run there is very little time in
> >>>blk_sq_make_request() lock contention, and overall spin lock/unlock
> >>>overhead of these two functions is around 10% each....
> >>>
> >>>So, yes, the patch reduces context switches but doesn't really
> >>>reduce system time, improve performance noticably or address the
> >>>run-to-run variability issue...
> >>
> >>OK, so one more thing to try. With the same patch still applied,
> >>could you edit block/blk-mq-tag.h and change
> >>
> >>         BT_WAIT_QUEUES  = 8,
> >>
> >>to
> >>
> >>         BT_WAIT_QUEUES  = 1,
> >>
> >>and see if that smoothes things out?
> >
> >Ok, that smoothes things out to the point where I can see the
> >trigger for the really nasty variable performance. The trigger is
> >the machine running out of free memory. i.e. direct reclaim of clean
> >pages for the data in the new files in the page cache drives the
> >performance down by 25-50% and introduces significant variability.
> >
> >So the variability doesn't seem to be solely related to the tag
> >allocator; it is contributing some via wait queue contention,
> >but it's definitely not the main contributor, nor the trigger...
> >
> >MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
> >each step in the workload is generating 3.2GB of dirty pages (i.e.
> >just on the dirty throttling threshold). It then does a concurrent
> >asynchronous fsync of the 800,000 dirty files it just created,
> >leaving 3.2GB of clean pages in the cache. The workload iterates
> >this several times. Once the machine runs out of free memory (2.5
> >iterations in) performance drops by about 30% on average, but the
> >drop varies between 20-60% randomly. I'm not concerned by a 30% drop
> >when memory fills up - I'm concerned by the volatility of the drop
> >that occurs. e.g:
> >
> >FSUse%        Count         Size    Files/sec     App Overhead
> >      0       800000         4096      29938.0         13459475
> >      0      1600000         4096      28023.7         15662387
> >      0      2400000         4096      23704.6         16451761
> >      0      3200000         4096      16976.8         15029056
> >      0      4000000         4096      21858.3         15591604
> >
> >Iteration 3 is where memory fills, and you can see that performance
> >dropped by 25%. Iteration 4 drops another 25%, then iteration 5
> >regains it. If I keep running the workload for more iterations, this
> >is pretty typical of the iteration-to-iteration variability, even
> >though every iteration is identical in behaviour as are the initial
> >conditions (i.e. memory full of clean, used-once pages).
> >
> >This didn't happen in 3.15.0, but the behaviour may have been masked
> >by the block layer tag allocator CPU overhead dominating the system
> >behaviour.
> 
> OK, that's reassuring. I'll do some testing with the cyclic wait
> queues, but probably not until Thursday. Alexanders patches might
> potentially fix the variability as well, but if we can make-do
> without the multiple wait queues, I'd much rather just kill it.
> 
> Did you see any spinlock contention with BT_WAIT_QUEUES = 1?

Yes. During the 15-20s of high IOPS dispatch rates the profile looks
like this:

-  36.00%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      - 69.72% blk_sq_make_request
           generic_make_request
         + submit_bio
      + 24.81% __schedule
....
-  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 32.87% prepare_to_wait_exclusive
           bt_get
           blk_mq_get_tag
           __blk_mq_alloc_request
           blk_mq_map_request
           blk_sq_make_request
           generic_make_request
         + submit_bio
      - 29.21% virtio_queue_rq
           __blk_mq_run_hw_queue
      + 11.69% complete
      + 8.21% finish_wait
        8.10% remove_wait_queue

But the IOPS rate has definitely increased with this config
- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
workload (the above profile is from the 100k IOPS period). However,
the wall time was still only 3m58s, which again tends to implicate
the write() portion of the benchmark for causing the slowdowns
rather than the fsync() portion that is dispatching all the IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-18  3:13                             ` Dave Chinner
  (?)
@ 2014-06-18  3:20                               ` Jens Axboe
  -1 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-18  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

[-- Attachment #1: Type: text/plain, Size: 7927 bytes --]

On 2014-06-17 20:13, Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
>> On 2014-06-17 17:28, Dave Chinner wrote:
>>> [cc linux-mm]
>>>
>>> On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
>>>> On 2014-06-16 16:27, Dave Chinner wrote:
>>>>> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>>>>>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>>>>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>>>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>>>>>> results in several thousand worker threads dispatching IO
>>>>>>>>> concurrently across 16 CPUs:
> ....
>>>>>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>>>>>> patch. The results are all over the place - from several sequential
>>>>>>>>> runs of the same test (removing the files in between so each tests
>>>>>>>>> starts from an empty fs):
>>>>>>>>>
>>>>>>>>> Wall time	sys time	IOPS	 files/s
>>>>>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>>>>>
>>>>>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>>>>>> dispatch scalability, yet adds significant variability to IO
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>>>>>> this comes from the new tag allocator:
> ....
>>>>>> Can you try with this patch?
>>>>>
>>>>> Ok, context switches are back down in the realm of 400,000/s. It's
>>>>> better, but it's still a bit higher than that the 3.15 code. XFS is
>>>>> actually showing up in the context switch path profiles now...
>>>>>
>>>>> However, performance is still excitingly variable and not much
>>>>> different to not having this patch applied. System time is unchanged
>>>>> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
>>>>> show significant variance (at least +/-25%) from run to run. The
>>>>> worst case is not as slow as the unpatched kernel, but it's no
>>>>> better than the 3.15 worst case.
> ....
>>>>> Looks like the main contention problem is in blk_sq_make_request().
>>>>> Also, there looks to be quite a bit of lock contention on the tag
>>>>> wait queues given that this patch made prepare_to_wait_exclusive()
>>>>> suddenly show up in the profiles.
>>>>>
>>>>> FWIW, on a fast run there is very little time in
>>>>> blk_sq_make_request() lock contention, and overall spin lock/unlock
>>>>> overhead of these two functions is around 10% each....
>>>>>
>>>>> So, yes, the patch reduces context switches but doesn't really
>>>>> reduce system time, improve performance noticably or address the
>>>>> run-to-run variability issue...
>>>>
>>>> OK, so one more thing to try. With the same patch still applied,
>>>> could you edit block/blk-mq-tag.h and change
>>>>
>>>>          BT_WAIT_QUEUES  = 8,
>>>>
>>>> to
>>>>
>>>>          BT_WAIT_QUEUES  = 1,
>>>>
>>>> and see if that smoothes things out?
>>>
>>> Ok, that smoothes things out to the point where I can see the
>>> trigger for the really nasty variable performance. The trigger is
>>> the machine running out of free memory. i.e. direct reclaim of clean
>>> pages for the data in the new files in the page cache drives the
>>> performance down by 25-50% and introduces significant variability.
>>>
>>> So the variability doesn't seem to be solely related to the tag
>>> allocator; it is contributing some via wait queue contention,
>>> but it's definitely not the main contributor, nor the trigger...
>>>
>>> MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
>>> each step in the workload is generating 3.2GB of dirty pages (i.e.
>>> just on the dirty throttling threshold). It then does a concurrent
>>> asynchronous fsync of the 800,000 dirty files it just created,
>>> leaving 3.2GB of clean pages in the cache. The workload iterates
>>> this several times. Once the machine runs out of free memory (2.5
>>> iterations in) performance drops by about 30% on average, but the
>>> drop varies between 20-60% randomly. I'm not concerned by a 30% drop
>>> when memory fills up - I'm concerned by the volatility of the drop
>>> that occurs. e.g:
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>       0       800000         4096      29938.0         13459475
>>>       0      1600000         4096      28023.7         15662387
>>>       0      2400000         4096      23704.6         16451761
>>>       0      3200000         4096      16976.8         15029056
>>>       0      4000000         4096      21858.3         15591604
>>>
>>> Iteration 3 is where memory fills, and you can see that performance
>>> dropped by 25%. Iteration 4 drops another 25%, then iteration 5
>>> regains it. If I keep running the workload for more iterations, this
>>> is pretty typical of the iteration-to-iteration variability, even
>>> though every iteration is identical in behaviour as are the initial
>>> conditions (i.e. memory full of clean, used-once pages).
>>>
>>> This didn't happen in 3.15.0, but the behaviour may have been masked
>>> by the block layer tag allocator CPU overhead dominating the system
>>> behaviour.
>>
>> OK, that's reassuring. I'll do some testing with the cyclic wait
>> queues, but probably not until Thursday. Alexanders patches might
>> potentially fix the variability as well, but if we can make-do
>> without the multiple wait queues, I'd much rather just kill it.
>>
>> Did you see any spinlock contention with BT_WAIT_QUEUES = 1?
>
> Yes. During the 15-20s of high IOPS dispatch rates the profile looks
> like this:
>
> -  36.00%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 69.72% blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        + 24.81% __schedule
> ....
> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 32.87% prepare_to_wait_exclusive
>             bt_get
>             blk_mq_get_tag
>             __blk_mq_alloc_request
>             blk_mq_map_request
>             blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        - 29.21% virtio_queue_rq
>             __blk_mq_run_hw_queue
>        + 11.69% complete
>        + 8.21% finish_wait
>          8.10% remove_wait_queue
>
> But the IOPS rate has definitely increased with this config
> - I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> workload (the above profile is from the 100k IOPS period). However,
> the wall time was still only 3m58s, which again tends to implicate
> the write() portion of the benchmark for causing the slowdowns
> rather than the fsync() portion that is dispatching all the IO...

Some contention for this case is hard to avoid, and the above looks 
better than 3.15 does. So the big question is whether it's worth fixing 
the gaps with multiple waitqueues (and if that actually still buys us 
anything), or whether we should just disable them.

If I can get you to try one more thing, can you apply this patch and 
give that a whirl? Get rid of the other patches I sent first, this has 
everything.

-- 
Jens Axboe


[-- Attachment #2: wake-all.patch --]
[-- Type: text/x-patch, Size: 4641 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1aab39f71d95..d376669769e7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -43,9 +43,16 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 	return bt_has_free_tags(&tags->bitmap_tags);
 }
 
-static inline void bt_index_inc(unsigned int *index)
+static inline int bt_index_inc(int index)
 {
-	*index = (*index + 1) & (BT_WAIT_QUEUES - 1);
+	return (index + 1) & (BT_WAIT_QUEUES - 1);
+}
+
+static inline void bt_index_atomic_inc(atomic_t *index)
+{
+	int old = atomic_read(index);
+	int new = bt_index_inc(old);
+	atomic_cmpxchg(index, old, new);
 }
 
 /*
@@ -69,14 +76,14 @@ static void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags)
 	int i, wake_index;
 
 	bt = &tags->bitmap_tags;
-	wake_index = bt->wake_index;
+	wake_index = atomic_read(&bt->wake_index);
 	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		struct bt_wait_state *bs = &bt->bs[wake_index];
 
 		if (waitqueue_active(&bs->wait))
-			wake_up(&bs->wait);
+			wake_up_all(&bs->wait);
 
-		bt_index_inc(&wake_index);
+		wake_index = bt_index_inc(wake_index);
 	}
 }
 
@@ -212,12 +219,14 @@ static struct bt_wait_state *bt_wait_ptr(struct blk_mq_bitmap_tags *bt,
 					 struct blk_mq_hw_ctx *hctx)
 {
 	struct bt_wait_state *bs;
+	int wait_index;
 
 	if (!hctx)
 		return &bt->bs[0];
 
-	bs = &bt->bs[hctx->wait_index];
-	bt_index_inc(&hctx->wait_index);
+	wait_index = atomic_read(&hctx->wait_index);
+	bs = &bt->bs[wait_index];
+	bt_index_atomic_inc(&hctx->wait_index);
 	return bs;
 }
 
@@ -239,18 +248,13 @@ static int bt_get(struct blk_mq_alloc_data *data,
 
 	bs = bt_wait_ptr(bt, hctx);
 	do {
-		bool was_empty;
-
-		was_empty = list_empty(&wait.task_list);
-		prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait_exclusive(&bs->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
 
 		tag = __bt_get(hctx, bt, last_tag);
 		if (tag != -1)
 			break;
 
-		if (was_empty)
-			atomic_set(&bs->wait_cnt, bt->wake_cnt);
-
 		blk_mq_put_ctx(data->ctx);
 
 		io_schedule();
@@ -313,18 +317,19 @@ static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
 {
 	int i, wake_index;
 
-	wake_index = bt->wake_index;
+	wake_index = atomic_read(&bt->wake_index);
 	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		struct bt_wait_state *bs = &bt->bs[wake_index];
 
 		if (waitqueue_active(&bs->wait)) {
-			if (wake_index != bt->wake_index)
-				bt->wake_index = wake_index;
+			int o = atomic_read(&bt->wake_index);
+			if (wake_index != o)
+				atomic_cmpxchg(&bt->wake_index, o, wake_index);
 
 			return bs;
 		}
 
-		bt_index_inc(&wake_index);
+		wake_index = bt_index_inc(wake_index);
 	}
 
 	return NULL;
@@ -334,6 +339,7 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 {
 	const int index = TAG_TO_INDEX(bt, tag);
 	struct bt_wait_state *bs;
+	int wait_cnt;
 
 	/*
 	 * The unlock memory barrier need to order access to req in free
@@ -342,10 +348,19 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 	clear_bit_unlock(TAG_TO_BIT(bt, tag), &bt->map[index].word);
 
 	bs = bt_wake_ptr(bt);
-	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
-		atomic_set(&bs->wait_cnt, bt->wake_cnt);
-		bt_index_inc(&bt->wake_index);
-		wake_up(&bs->wait);
+	if (!bs)
+		return;
+
+	wait_cnt = atomic_dec_return(&bs->wait_cnt);
+	if (wait_cnt == 0) {
+wake:
+		atomic_add(bt->wake_cnt, &bs->wait_cnt);
+		bt_index_atomic_inc(&bt->wake_index);
+		wake_up_nr(&bs->wait, bt->wake_cnt);
+	} else if (wait_cnt < 0) {
+		wait_cnt = atomic_inc_return(&bs->wait_cnt);
+		if (!wait_cnt)
+			goto wake;
 	}
 }
 
@@ -499,10 +514,13 @@ static int bt_alloc(struct blk_mq_bitmap_tags *bt, unsigned int depth,
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < BT_WAIT_QUEUES; i++)
+	bt_update_count(bt, depth);
+
+	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		init_waitqueue_head(&bt->bs[i].wait);
+		atomic_set(&bt->bs[i].wait_cnt, bt->wake_cnt);
+	}
 
-	bt_update_count(bt, depth);
 	return 0;
 }
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 98696a65d4d4..6206ed17ef76 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -24,7 +24,7 @@ struct blk_mq_bitmap_tags {
 	unsigned int map_nr;
 	struct blk_align_bitmap *map;
 
-	unsigned int wake_index;
+	atomic_t wake_index;
 	struct bt_wait_state *bs;
 };
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a002cf191427..eb726b9c5762 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -42,7 +42,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		nr_ctx;
 	struct blk_mq_ctx	**ctxs;
 
-	unsigned int		wait_index;
+	atomic_t		wait_index;
 
 	struct blk_mq_tags	*tags;
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  3:20                               ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-18  3:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-mm, linux-man

[-- Attachment #1: Type: text/plain, Size: 7927 bytes --]

On 2014-06-17 20:13, Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
>> On 2014-06-17 17:28, Dave Chinner wrote:
>>> [cc linux-mm]
>>>
>>> On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
>>>> On 2014-06-16 16:27, Dave Chinner wrote:
>>>>> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>>>>>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>>>>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>>>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>>>>>> results in several thousand worker threads dispatching IO
>>>>>>>>> concurrently across 16 CPUs:
> ....
>>>>>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>>>>>> patch. The results are all over the place - from several sequential
>>>>>>>>> runs of the same test (removing the files in between so each tests
>>>>>>>>> starts from an empty fs):
>>>>>>>>>
>>>>>>>>> Wall time	sys time	IOPS	 files/s
>>>>>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>>>>>
>>>>>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>>>>>> dispatch scalability, yet adds significant variability to IO
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>>>>>> this comes from the new tag allocator:
> ....
>>>>>> Can you try with this patch?
>>>>>
>>>>> Ok, context switches are back down in the realm of 400,000/s. It's
>>>>> better, but it's still a bit higher than that the 3.15 code. XFS is
>>>>> actually showing up in the context switch path profiles now...
>>>>>
>>>>> However, performance is still excitingly variable and not much
>>>>> different to not having this patch applied. System time is unchanged
>>>>> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
>>>>> show significant variance (at least +/-25%) from run to run. The
>>>>> worst case is not as slow as the unpatched kernel, but it's no
>>>>> better than the 3.15 worst case.
> ....
>>>>> Looks like the main contention problem is in blk_sq_make_request().
>>>>> Also, there looks to be quite a bit of lock contention on the tag
>>>>> wait queues given that this patch made prepare_to_wait_exclusive()
>>>>> suddenly show up in the profiles.
>>>>>
>>>>> FWIW, on a fast run there is very little time in
>>>>> blk_sq_make_request() lock contention, and overall spin lock/unlock
>>>>> overhead of these two functions is around 10% each....
>>>>>
>>>>> So, yes, the patch reduces context switches but doesn't really
>>>>> reduce system time, improve performance noticably or address the
>>>>> run-to-run variability issue...
>>>>
>>>> OK, so one more thing to try. With the same patch still applied,
>>>> could you edit block/blk-mq-tag.h and change
>>>>
>>>>          BT_WAIT_QUEUES  = 8,
>>>>
>>>> to
>>>>
>>>>          BT_WAIT_QUEUES  = 1,
>>>>
>>>> and see if that smoothes things out?
>>>
>>> Ok, that smoothes things out to the point where I can see the
>>> trigger for the really nasty variable performance. The trigger is
>>> the machine running out of free memory. i.e. direct reclaim of clean
>>> pages for the data in the new files in the page cache drives the
>>> performance down by 25-50% and introduces significant variability.
>>>
>>> So the variability doesn't seem to be solely related to the tag
>>> allocator; it is contributing some via wait queue contention,
>>> but it's definitely not the main contributor, nor the trigger...
>>>
>>> MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
>>> each step in the workload is generating 3.2GB of dirty pages (i.e.
>>> just on the dirty throttling threshold). It then does a concurrent
>>> asynchronous fsync of the 800,000 dirty files it just created,
>>> leaving 3.2GB of clean pages in the cache. The workload iterates
>>> this several times. Once the machine runs out of free memory (2.5
>>> iterations in) performance drops by about 30% on average, but the
>>> drop varies between 20-60% randomly. I'm not concerned by a 30% drop
>>> when memory fills up - I'm concerned by the volatility of the drop
>>> that occurs. e.g:
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>       0       800000         4096      29938.0         13459475
>>>       0      1600000         4096      28023.7         15662387
>>>       0      2400000         4096      23704.6         16451761
>>>       0      3200000         4096      16976.8         15029056
>>>       0      4000000         4096      21858.3         15591604
>>>
>>> Iteration 3 is where memory fills, and you can see that performance
>>> dropped by 25%. Iteration 4 drops another 25%, then iteration 5
>>> regains it. If I keep running the workload for more iterations, this
>>> is pretty typical of the iteration-to-iteration variability, even
>>> though every iteration is identical in behaviour as are the initial
>>> conditions (i.e. memory full of clean, used-once pages).
>>>
>>> This didn't happen in 3.15.0, but the behaviour may have been masked
>>> by the block layer tag allocator CPU overhead dominating the system
>>> behaviour.
>>
>> OK, that's reassuring. I'll do some testing with the cyclic wait
>> queues, but probably not until Thursday. Alexanders patches might
>> potentially fix the variability as well, but if we can make-do
>> without the multiple wait queues, I'd much rather just kill it.
>>
>> Did you see any spinlock contention with BT_WAIT_QUEUES = 1?
>
> Yes. During the 15-20s of high IOPS dispatch rates the profile looks
> like this:
>
> -  36.00%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 69.72% blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        + 24.81% __schedule
> ....
> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 32.87% prepare_to_wait_exclusive
>             bt_get
>             blk_mq_get_tag
>             __blk_mq_alloc_request
>             blk_mq_map_request
>             blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        - 29.21% virtio_queue_rq
>             __blk_mq_run_hw_queue
>        + 11.69% complete
>        + 8.21% finish_wait
>          8.10% remove_wait_queue
>
> But the IOPS rate has definitely increased with this config
> - I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> workload (the above profile is from the 100k IOPS period). However,
> the wall time was still only 3m58s, which again tends to implicate
> the write() portion of the benchmark for causing the slowdowns
> rather than the fsync() portion that is dispatching all the IO...

Some contention for this case is hard to avoid, and the above looks 
better than 3.15 does. So the big question is whether it's worth fixing 
the gaps with multiple waitqueues (and if that actually still buys us 
anything), or whether we should just disable them.

If I can get you to try one more thing, can you apply this patch and 
give that a whirl? Get rid of the other patches I sent first, this has 
everything.

-- 
Jens Axboe


[-- Attachment #2: wake-all.patch --]
[-- Type: text/x-patch, Size: 4641 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1aab39f71d95..d376669769e7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -43,9 +43,16 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 	return bt_has_free_tags(&tags->bitmap_tags);
 }
 
-static inline void bt_index_inc(unsigned int *index)
+static inline int bt_index_inc(int index)
 {
-	*index = (*index + 1) & (BT_WAIT_QUEUES - 1);
+	return (index + 1) & (BT_WAIT_QUEUES - 1);
+}
+
+static inline void bt_index_atomic_inc(atomic_t *index)
+{
+	int old = atomic_read(index);
+	int new = bt_index_inc(old);
+	atomic_cmpxchg(index, old, new);
 }
 
 /*
@@ -69,14 +76,14 @@ static void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags)
 	int i, wake_index;
 
 	bt = &tags->bitmap_tags;
-	wake_index = bt->wake_index;
+	wake_index = atomic_read(&bt->wake_index);
 	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		struct bt_wait_state *bs = &bt->bs[wake_index];
 
 		if (waitqueue_active(&bs->wait))
-			wake_up(&bs->wait);
+			wake_up_all(&bs->wait);
 
-		bt_index_inc(&wake_index);
+		wake_index = bt_index_inc(wake_index);
 	}
 }
 
@@ -212,12 +219,14 @@ static struct bt_wait_state *bt_wait_ptr(struct blk_mq_bitmap_tags *bt,
 					 struct blk_mq_hw_ctx *hctx)
 {
 	struct bt_wait_state *bs;
+	int wait_index;
 
 	if (!hctx)
 		return &bt->bs[0];
 
-	bs = &bt->bs[hctx->wait_index];
-	bt_index_inc(&hctx->wait_index);
+	wait_index = atomic_read(&hctx->wait_index);
+	bs = &bt->bs[wait_index];
+	bt_index_atomic_inc(&hctx->wait_index);
 	return bs;
 }
 
@@ -239,18 +248,13 @@ static int bt_get(struct blk_mq_alloc_data *data,
 
 	bs = bt_wait_ptr(bt, hctx);
 	do {
-		bool was_empty;
-
-		was_empty = list_empty(&wait.task_list);
-		prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait_exclusive(&bs->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
 
 		tag = __bt_get(hctx, bt, last_tag);
 		if (tag != -1)
 			break;
 
-		if (was_empty)
-			atomic_set(&bs->wait_cnt, bt->wake_cnt);
-
 		blk_mq_put_ctx(data->ctx);
 
 		io_schedule();
@@ -313,18 +317,19 @@ static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
 {
 	int i, wake_index;
 
-	wake_index = bt->wake_index;
+	wake_index = atomic_read(&bt->wake_index);
 	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		struct bt_wait_state *bs = &bt->bs[wake_index];
 
 		if (waitqueue_active(&bs->wait)) {
-			if (wake_index != bt->wake_index)
-				bt->wake_index = wake_index;
+			int o = atomic_read(&bt->wake_index);
+			if (wake_index != o)
+				atomic_cmpxchg(&bt->wake_index, o, wake_index);
 
 			return bs;
 		}
 
-		bt_index_inc(&wake_index);
+		wake_index = bt_index_inc(wake_index);
 	}
 
 	return NULL;
@@ -334,6 +339,7 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 {
 	const int index = TAG_TO_INDEX(bt, tag);
 	struct bt_wait_state *bs;
+	int wait_cnt;
 
 	/*
 	 * The unlock memory barrier need to order access to req in free
@@ -342,10 +348,19 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 	clear_bit_unlock(TAG_TO_BIT(bt, tag), &bt->map[index].word);
 
 	bs = bt_wake_ptr(bt);
-	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
-		atomic_set(&bs->wait_cnt, bt->wake_cnt);
-		bt_index_inc(&bt->wake_index);
-		wake_up(&bs->wait);
+	if (!bs)
+		return;
+
+	wait_cnt = atomic_dec_return(&bs->wait_cnt);
+	if (wait_cnt == 0) {
+wake:
+		atomic_add(bt->wake_cnt, &bs->wait_cnt);
+		bt_index_atomic_inc(&bt->wake_index);
+		wake_up_nr(&bs->wait, bt->wake_cnt);
+	} else if (wait_cnt < 0) {
+		wait_cnt = atomic_inc_return(&bs->wait_cnt);
+		if (!wait_cnt)
+			goto wake;
 	}
 }
 
@@ -499,10 +514,13 @@ static int bt_alloc(struct blk_mq_bitmap_tags *bt, unsigned int depth,
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < BT_WAIT_QUEUES; i++)
+	bt_update_count(bt, depth);
+
+	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		init_waitqueue_head(&bt->bs[i].wait);
+		atomic_set(&bt->bs[i].wait_cnt, bt->wake_cnt);
+	}
 
-	bt_update_count(bt, depth);
 	return 0;
 }
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 98696a65d4d4..6206ed17ef76 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -24,7 +24,7 @@ struct blk_mq_bitmap_tags {
 	unsigned int map_nr;
 	struct blk_align_bitmap *map;
 
-	unsigned int wake_index;
+	atomic_t wake_index;
 	struct bt_wait_state *bs;
 };
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a002cf191427..eb726b9c5762 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -42,7 +42,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		nr_ctx;
 	struct blk_mq_ctx	**ctxs;
 
-	unsigned int		wait_index;
+	atomic_t		wait_index;
 
 	struct blk_mq_tags	*tags;
 

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  3:20                               ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2014-06-18  3:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7927 bytes --]

On 2014-06-17 20:13, Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
>> On 2014-06-17 17:28, Dave Chinner wrote:
>>> [cc linux-mm]
>>>
>>> On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
>>>> On 2014-06-16 16:27, Dave Chinner wrote:
>>>>> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>>>>>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>>>>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>>>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>>>>>> running isn't anything related to XFS.  The async fsync workqueue
>>>>>>>>> results in several thousand worker threads dispatching IO
>>>>>>>>> concurrently across 16 CPUs:
> ....
>>>>>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>>>>>> patch. The results are all over the place - from several sequential
>>>>>>>>> runs of the same test (removing the files in between so each tests
>>>>>>>>> starts from an empty fs):
>>>>>>>>>
>>>>>>>>> Wall time	sys time	IOPS	 files/s
>>>>>>>>> 4m58.151s	11m12.648s	30,000	 13,500
>>>>>>>>> 4m35.075s	12m45.900s	45,000	 15,000
>>>>>>>>> 3m10.665s	11m15.804s	65,000	 21,000
>>>>>>>>> 3m27.384s	11m54.723s	85,000	 20,000
>>>>>>>>> 3m59.574s	11m12.012s	50,000	 16,500
>>>>>>>>> 4m12.704s	12m15.720s	50,000	 17,000
>>>>>>>>>
>>>>>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>>>>>> dispatch scalability, yet adds significant variability to IO
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>>>>>> this comes from the new tag allocator:
> ....
>>>>>> Can you try with this patch?
>>>>>
>>>>> Ok, context switches are back down in the realm of 400,000/s. It's
>>>>> better, but it's still a bit higher than that the 3.15 code. XFS is
>>>>> actually showing up in the context switch path profiles now...
>>>>>
>>>>> However, performance is still excitingly variable and not much
>>>>> different to not having this patch applied. System time is unchanged
>>>>> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
>>>>> show significant variance (at least +/-25%) from run to run. The
>>>>> worst case is not as slow as the unpatched kernel, but it's no
>>>>> better than the 3.15 worst case.
> ....
>>>>> Looks like the main contention problem is in blk_sq_make_request().
>>>>> Also, there looks to be quite a bit of lock contention on the tag
>>>>> wait queues given that this patch made prepare_to_wait_exclusive()
>>>>> suddenly show up in the profiles.
>>>>>
>>>>> FWIW, on a fast run there is very little time in
>>>>> blk_sq_make_request() lock contention, and overall spin lock/unlock
>>>>> overhead of these two functions is around 10% each....
>>>>>
>>>>> So, yes, the patch reduces context switches but doesn't really
>>>>> reduce system time, improve performance noticably or address the
>>>>> run-to-run variability issue...
>>>>
>>>> OK, so one more thing to try. With the same patch still applied,
>>>> could you edit block/blk-mq-tag.h and change
>>>>
>>>>          BT_WAIT_QUEUES  = 8,
>>>>
>>>> to
>>>>
>>>>          BT_WAIT_QUEUES  = 1,
>>>>
>>>> and see if that smoothes things out?
>>>
>>> Ok, that smoothes things out to the point where I can see the
>>> trigger for the really nasty variable performance. The trigger is
>>> the machine running out of free memory. i.e. direct reclaim of clean
>>> pages for the data in the new files in the page cache drives the
>>> performance down by 25-50% and introduces significant variability.
>>>
>>> So the variability doesn't seem to be solely related to the tag
>>> allocator; it is contributing some via wait queue contention,
>>> but it's definitely not the main contributor, nor the trigger...
>>>
>>> MM-folk - the VM is running fake-numa=4 and has 16GB of RAM, and
>>> each step in the workload is generating 3.2GB of dirty pages (i.e.
>>> just on the dirty throttling threshold). It then does a concurrent
>>> asynchronous fsync of the 800,000 dirty files it just created,
>>> leaving 3.2GB of clean pages in the cache. The workload iterates
>>> this several times. Once the machine runs out of free memory (2.5
>>> iterations in) performance drops by about 30% on average, but the
>>> drop varies between 20-60% randomly. I'm not concerned by a 30% drop
>>> when memory fills up - I'm concerned by the volatility of the drop
>>> that occurs. e.g:
>>>
>>> FSUse%        Count         Size    Files/sec     App Overhead
>>>       0       800000         4096      29938.0         13459475
>>>       0      1600000         4096      28023.7         15662387
>>>       0      2400000         4096      23704.6         16451761
>>>       0      3200000         4096      16976.8         15029056
>>>       0      4000000         4096      21858.3         15591604
>>>
>>> Iteration 3 is where memory fills, and you can see that performance
>>> dropped by 25%. Iteration 4 drops another 25%, then iteration 5
>>> regains it. If I keep running the workload for more iterations, this
>>> is pretty typical of the iteration-to-iteration variability, even
>>> though every iteration is identical in behaviour as are the initial
>>> conditions (i.e. memory full of clean, used-once pages).
>>>
>>> This didn't happen in 3.15.0, but the behaviour may have been masked
>>> by the block layer tag allocator CPU overhead dominating the system
>>> behaviour.
>>
>> OK, that's reassuring. I'll do some testing with the cyclic wait
>> queues, but probably not until Thursday. Alexanders patches might
>> potentially fix the variability as well, but if we can make-do
>> without the multiple wait queues, I'd much rather just kill it.
>>
>> Did you see any spinlock contention with BT_WAIT_QUEUES = 1?
>
> Yes. During the 15-20s of high IOPS dispatch rates the profile looks
> like this:
>
> -  36.00%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 69.72% blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        + 24.81% __schedule
> ....
> -  15.00%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 32.87% prepare_to_wait_exclusive
>             bt_get
>             blk_mq_get_tag
>             __blk_mq_alloc_request
>             blk_mq_map_request
>             blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        - 29.21% virtio_queue_rq
>             __blk_mq_run_hw_queue
>        + 11.69% complete
>        + 8.21% finish_wait
>          8.10% remove_wait_queue
>
> But the IOPS rate has definitely increased with this config
> - I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> workload (the above profile is from the 100k IOPS period). However,
> the wall time was still only 3m58s, which again tends to implicate
> the write() portion of the benchmark for causing the slowdowns
> rather than the fsync() portion that is dispatching all the IO...

Some contention for this case is hard to avoid, and the above looks 
better than 3.15 does. So the big question is whether it's worth fixing 
the gaps with multiple waitqueues (and if that actually still buys us 
anything), or whether we should just disable them.

If I can get you to try one more thing, can you apply this patch and 
give that a whirl? Get rid of the other patches I sent first, this has 
everything.

-- 
Jens Axboe


[-- Attachment #2: wake-all.patch --]
[-- Type: text/x-patch, Size: 4641 bytes --]

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1aab39f71d95..d376669769e7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -43,9 +43,16 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 	return bt_has_free_tags(&tags->bitmap_tags);
 }
 
-static inline void bt_index_inc(unsigned int *index)
+static inline int bt_index_inc(int index)
 {
-	*index = (*index + 1) & (BT_WAIT_QUEUES - 1);
+	return (index + 1) & (BT_WAIT_QUEUES - 1);
+}
+
+static inline void bt_index_atomic_inc(atomic_t *index)
+{
+	int old = atomic_read(index);
+	int new = bt_index_inc(old);
+	atomic_cmpxchg(index, old, new);
 }
 
 /*
@@ -69,14 +76,14 @@ static void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags)
 	int i, wake_index;
 
 	bt = &tags->bitmap_tags;
-	wake_index = bt->wake_index;
+	wake_index = atomic_read(&bt->wake_index);
 	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		struct bt_wait_state *bs = &bt->bs[wake_index];
 
 		if (waitqueue_active(&bs->wait))
-			wake_up(&bs->wait);
+			wake_up_all(&bs->wait);
 
-		bt_index_inc(&wake_index);
+		wake_index = bt_index_inc(wake_index);
 	}
 }
 
@@ -212,12 +219,14 @@ static struct bt_wait_state *bt_wait_ptr(struct blk_mq_bitmap_tags *bt,
 					 struct blk_mq_hw_ctx *hctx)
 {
 	struct bt_wait_state *bs;
+	int wait_index;
 
 	if (!hctx)
 		return &bt->bs[0];
 
-	bs = &bt->bs[hctx->wait_index];
-	bt_index_inc(&hctx->wait_index);
+	wait_index = atomic_read(&hctx->wait_index);
+	bs = &bt->bs[wait_index];
+	bt_index_atomic_inc(&hctx->wait_index);
 	return bs;
 }
 
@@ -239,18 +248,13 @@ static int bt_get(struct blk_mq_alloc_data *data,
 
 	bs = bt_wait_ptr(bt, hctx);
 	do {
-		bool was_empty;
-
-		was_empty = list_empty(&wait.task_list);
-		prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait_exclusive(&bs->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
 
 		tag = __bt_get(hctx, bt, last_tag);
 		if (tag != -1)
 			break;
 
-		if (was_empty)
-			atomic_set(&bs->wait_cnt, bt->wake_cnt);
-
 		blk_mq_put_ctx(data->ctx);
 
 		io_schedule();
@@ -313,18 +317,19 @@ static struct bt_wait_state *bt_wake_ptr(struct blk_mq_bitmap_tags *bt)
 {
 	int i, wake_index;
 
-	wake_index = bt->wake_index;
+	wake_index = atomic_read(&bt->wake_index);
 	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		struct bt_wait_state *bs = &bt->bs[wake_index];
 
 		if (waitqueue_active(&bs->wait)) {
-			if (wake_index != bt->wake_index)
-				bt->wake_index = wake_index;
+			int o = atomic_read(&bt->wake_index);
+			if (wake_index != o)
+				atomic_cmpxchg(&bt->wake_index, o, wake_index);
 
 			return bs;
 		}
 
-		bt_index_inc(&wake_index);
+		wake_index = bt_index_inc(wake_index);
 	}
 
 	return NULL;
@@ -334,6 +339,7 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 {
 	const int index = TAG_TO_INDEX(bt, tag);
 	struct bt_wait_state *bs;
+	int wait_cnt;
 
 	/*
 	 * The unlock memory barrier need to order access to req in free
@@ -342,10 +348,19 @@ static void bt_clear_tag(struct blk_mq_bitmap_tags *bt, unsigned int tag)
 	clear_bit_unlock(TAG_TO_BIT(bt, tag), &bt->map[index].word);
 
 	bs = bt_wake_ptr(bt);
-	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
-		atomic_set(&bs->wait_cnt, bt->wake_cnt);
-		bt_index_inc(&bt->wake_index);
-		wake_up(&bs->wait);
+	if (!bs)
+		return;
+
+	wait_cnt = atomic_dec_return(&bs->wait_cnt);
+	if (wait_cnt == 0) {
+wake:
+		atomic_add(bt->wake_cnt, &bs->wait_cnt);
+		bt_index_atomic_inc(&bt->wake_index);
+		wake_up_nr(&bs->wait, bt->wake_cnt);
+	} else if (wait_cnt < 0) {
+		wait_cnt = atomic_inc_return(&bs->wait_cnt);
+		if (!wait_cnt)
+			goto wake;
 	}
 }
 
@@ -499,10 +514,13 @@ static int bt_alloc(struct blk_mq_bitmap_tags *bt, unsigned int depth,
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < BT_WAIT_QUEUES; i++)
+	bt_update_count(bt, depth);
+
+	for (i = 0; i < BT_WAIT_QUEUES; i++) {
 		init_waitqueue_head(&bt->bs[i].wait);
+		atomic_set(&bt->bs[i].wait_cnt, bt->wake_cnt);
+	}
 
-	bt_update_count(bt, depth);
 	return 0;
 }
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 98696a65d4d4..6206ed17ef76 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -24,7 +24,7 @@ struct blk_mq_bitmap_tags {
 	unsigned int map_nr;
 	struct blk_align_bitmap *map;
 
-	unsigned int wake_index;
+	atomic_t wake_index;
 	struct bt_wait_state *bs;
 };
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a002cf191427..eb726b9c5762 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -42,7 +42,7 @@ struct blk_mq_hw_ctx {
 	unsigned int		nr_ctx;
 	struct blk_mq_ctx	**ctxs;
 
-	unsigned int		wait_index;
+	atomic_t		wait_index;
 
 	struct blk_mq_tags	*tags;
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-18  3:20                               ` Jens Axboe
  (?)
@ 2014-06-18  5:02                                   ` Dave Chinner
  -1 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  5:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-man-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

On Tue, Jun 17, 2014 at 08:20:55PM -0700, Jens Axboe wrote:
> On 2014-06-17 20:13, Dave Chinner wrote:
> >On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> >>On 2014-06-17 17:28, Dave Chinner wrote:
> >>>[cc linux-mm]
> >>>
> >>>On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> >>>>On 2014-06-16 16:27, Dave Chinner wrote:
> >>>>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>>>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>>>>>results in several thousand worker threads dispatching IO
> >>>>>>>>>concurrently across 16 CPUs:
> >....
> >>>>>>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>>>>>patch. The results are all over the place - from several sequential
> >>>>>>>>>runs of the same test (removing the files in between so each tests
> >>>>>>>>>starts from an empty fs):
> >>>>>>>>>
> >>>>>>>>>Wall time	sys time	IOPS	 files/s
> >>>>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>>>>>4m12.704s	12m15.720s	50,000	 17,000

....
> >But the IOPS rate has definitely increased with this config
> >- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> >workload (the above profile is from the 100k IOPS period). However,
> >the wall time was still only 3m58s, which again tends to implicate
> >the write() portion of the benchmark for causing the slowdowns
> >rather than the fsync() portion that is dispatching all the IO...
> 
> Some contention for this case is hard to avoid, and the above looks
> better than 3.15 does. So the big question is whether it's worth
> fixing the gaps with multiple waitqueues (and if that actually still
> buys us anything), or whether we should just disable them.
> 
> If I can get you to try one more thing, can you apply this patch and
> give that a whirl? Get rid of the other patches I sent first, this
> has everything.

Not much difference in the CPU usage profiles or base line
performance. It runs at 3m10s from empty memory, and ~3m45s when
memory starts full of clean pages. system time varies from 10m40s to
12m55s with no real correlation to overall runtime.

>From observation of all the performance metrics I graph in real
time, however, the pattern of the peaks and troughs from run to run
and even iteration to iteration is much more regular than the
previous patches. So from that perspective it is an improvement.
Again, all the variability in the graphs show up when free memory
runs out...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  5:02                                   ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  5:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-mm, linux-man

On Tue, Jun 17, 2014 at 08:20:55PM -0700, Jens Axboe wrote:
> On 2014-06-17 20:13, Dave Chinner wrote:
> >On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> >>On 2014-06-17 17:28, Dave Chinner wrote:
> >>>[cc linux-mm]
> >>>
> >>>On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> >>>>On 2014-06-16 16:27, Dave Chinner wrote:
> >>>>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>>>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>>>>>results in several thousand worker threads dispatching IO
> >>>>>>>>>concurrently across 16 CPUs:
> >....
> >>>>>>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>>>>>patch. The results are all over the place - from several sequential
> >>>>>>>>>runs of the same test (removing the files in between so each tests
> >>>>>>>>>starts from an empty fs):
> >>>>>>>>>
> >>>>>>>>>Wall time	sys time	IOPS	 files/s
> >>>>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>>>>>4m12.704s	12m15.720s	50,000	 17,000

....
> >But the IOPS rate has definitely increased with this config
> >- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> >workload (the above profile is from the 100k IOPS period). However,
> >the wall time was still only 3m58s, which again tends to implicate
> >the write() portion of the benchmark for causing the slowdowns
> >rather than the fsync() portion that is dispatching all the IO...
> 
> Some contention for this case is hard to avoid, and the above looks
> better than 3.15 does. So the big question is whether it's worth
> fixing the gaps with multiple waitqueues (and if that actually still
> buys us anything), or whether we should just disable them.
> 
> If I can get you to try one more thing, can you apply this patch and
> give that a whirl? Get rid of the other patches I sent first, this
> has everything.

Not much difference in the CPU usage profiles or base line
performance. It runs at 3m10s from empty memory, and ~3m45s when
memory starts full of clean pages. system time varies from 10m40s to
12m55s with no real correlation to overall runtime.

>From observation of all the performance metrics I graph in real
time, however, the pattern of the peaks and troughs from run to run
and even iteration to iteration is much more regular than the
previous patches. So from that perspective it is an improvement.
Again, all the variability in the graphs show up when free memory
runs out...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  5:02                                   ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  5:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, xfs, linux-mm

On Tue, Jun 17, 2014 at 08:20:55PM -0700, Jens Axboe wrote:
> On 2014-06-17 20:13, Dave Chinner wrote:
> >On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> >>On 2014-06-17 17:28, Dave Chinner wrote:
> >>>[cc linux-mm]
> >>>
> >>>On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> >>>>On 2014-06-16 16:27, Dave Chinner wrote:
> >>>>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> >>>>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> >>>>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> >>>>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> >>>>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> >>>>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> >>>>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> >>>>>>>>>results in several thousand worker threads dispatching IO
> >>>>>>>>>concurrently across 16 CPUs:
> >....
> >>>>>>>>>I know that the tag allocator has been rewritten, so I tested
> >>>>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> >>>>>>>>>patch. The results are all over the place - from several sequential
> >>>>>>>>>runs of the same test (removing the files in between so each tests
> >>>>>>>>>starts from an empty fs):
> >>>>>>>>>
> >>>>>>>>>Wall time	sys time	IOPS	 files/s
> >>>>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> >>>>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> >>>>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> >>>>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> >>>>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> >>>>>>>>>4m12.704s	12m15.720s	50,000	 17,000

....
> >But the IOPS rate has definitely increased with this config
> >- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> >workload (the above profile is from the 100k IOPS period). However,
> >the wall time was still only 3m58s, which again tends to implicate
> >the write() portion of the benchmark for causing the slowdowns
> >rather than the fsync() portion that is dispatching all the IO...
> 
> Some contention for this case is hard to avoid, and the above looks
> better than 3.15 does. So the big question is whether it's worth
> fixing the gaps with multiple waitqueues (and if that actually still
> buys us anything), or whether we should just disable them.
> 
> If I can get you to try one more thing, can you apply this patch and
> give that a whirl? Get rid of the other patches I sent first, this
> has everything.

Not much difference in the CPU usage profiles or base line
performance. It runs at 3m10s from empty memory, and ~3m45s when
memory starts full of clean pages. system time varies from 10m40s to
12m55s with no real correlation to overall runtime.

>From observation of all the performance metrics I graph in real
time, however, the pattern of the peaks and troughs from run to run
and even iteration to iteration is much more regular than the
previous patches. So from that perspective it is an improvement.
Again, all the variability in the graphs show up when free memory
runs out...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
  2014-06-18  5:02                                   ` Dave Chinner
@ 2014-06-18  6:13                                     ` Dave Chinner
  -1 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  6:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, xfs, linux-mm, linux-man

On Wed, Jun 18, 2014 at 03:02:30PM +1000, Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 08:20:55PM -0700, Jens Axboe wrote:
> > On 2014-06-17 20:13, Dave Chinner wrote:
> > >On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> > >>On 2014-06-17 17:28, Dave Chinner wrote:
> > >>>[cc linux-mm]
> > >>>
> > >>>On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> > >>>>On 2014-06-16 16:27, Dave Chinner wrote:
> > >>>>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> > >>>>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> > >>>>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> > >>>>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> > >>>>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> > >>>>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> > >>>>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> > >>>>>>>>>results in several thousand worker threads dispatching IO
> > >>>>>>>>>concurrently across 16 CPUs:
> > >....
> > >>>>>>>>>I know that the tag allocator has been rewritten, so I tested
> > >>>>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> > >>>>>>>>>patch. The results are all over the place - from several sequential
> > >>>>>>>>>runs of the same test (removing the files in between so each tests
> > >>>>>>>>>starts from an empty fs):
> > >>>>>>>>>
> > >>>>>>>>>Wall time	sys time	IOPS	 files/s
> > >>>>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> > >>>>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> > >>>>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> > >>>>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> > >>>>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> > >>>>>>>>>4m12.704s	12m15.720s	50,000	 17,000
> 
> ....
> > >But the IOPS rate has definitely increased with this config
> > >- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> > >workload (the above profile is from the 100k IOPS period). However,
> > >the wall time was still only 3m58s, which again tends to implicate
> > >the write() portion of the benchmark for causing the slowdowns
> > >rather than the fsync() portion that is dispatching all the IO...
> > 
> > Some contention for this case is hard to avoid, and the above looks
> > better than 3.15 does. So the big question is whether it's worth
> > fixing the gaps with multiple waitqueues (and if that actually still
> > buys us anything), or whether we should just disable them.
> > 
> > If I can get you to try one more thing, can you apply this patch and
> > give that a whirl? Get rid of the other patches I sent first, this
> > has everything.
> 
> Not much difference in the CPU usage profiles or base line
> performance. It runs at 3m10s from empty memory, and ~3m45s when
> memory starts full of clean pages. system time varies from 10m40s to
> 12m55s with no real correlation to overall runtime.
> 
> From observation of all the performance metrics I graph in real
> time, however, the pattern of the peaks and troughs from run to run
> and even iteration to iteration is much more regular than the
> previous patches. So from that perspective it is an improvement.
> Again, all the variability in the graphs show up when free memory
> runs out...

And I've identified the commit that caused the memory reclaim
behaviour to go south:

commit 1f6d64829db78a7e1d63e15c9f48f0a5d2b5a679
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jun 6 15:59:59 2014 +1000

    xfs: block allocation work needs to be kswapd aware
    
    Upon memory pressure, kswapd calls xfs_vm_writepage() from
    shrink_page_list(). This can result in delayed allocation occurring
    and that gets deferred to the the allocation workqueue.
    
    The allocation then runs outside kswapd context, which means if it
    needs memory (and it does to demand page metadata from disk) it can
    block in shrink_inactive_list() waiting for IO congestion. These
    blocking waits are normally avoiding in kswapd context, so under
    memory pressure writeback from kswapd can be arbitrarily delayed by
    memory reclaim.
    
    To avoid this, pass the kswapd context to the allocation being done
    by the workqueue, so that memory reclaim understands correctly that
    the work is being done for kswapd and therefore it is not blocked
    and does not delay memory reclaim.
    
    To avoid issues with int->char conversion of flag fields (as noticed
    in v1 of this patch) convert the flag fields in the struct
    xfs_bmalloca to bool types. pahole indicates these variables are
    still single byte variables, so no extra space is consumed by this
    change.
    
    cc: <stable@vger.kernel.org>
    Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Reverting this patch results in runtimes of between 3m and 3m10s
regardless of the amount of free memory when the test starts.

I'm probably going to have to revert this and make sure it stays out
of the stable kernels now - I think that unbalancing memory reclaim
and introducing performance degradations of 25-30% to work around a
problem that is only hit by an extreme memory pressure stress test
is a bad trade-off.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] [RFC] xfs: wire up aio_fsync method
@ 2014-06-18  6:13                                     ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2014-06-18  6:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-fsdevel, linux-man, linux-mm, xfs

On Wed, Jun 18, 2014 at 03:02:30PM +1000, Dave Chinner wrote:
> On Tue, Jun 17, 2014 at 08:20:55PM -0700, Jens Axboe wrote:
> > On 2014-06-17 20:13, Dave Chinner wrote:
> > >On Tue, Jun 17, 2014 at 07:24:10PM -0700, Jens Axboe wrote:
> > >>On 2014-06-17 17:28, Dave Chinner wrote:
> > >>>[cc linux-mm]
> > >>>
> > >>>On Tue, Jun 17, 2014 at 07:23:58AM -0600, Jens Axboe wrote:
> > >>>>On 2014-06-16 16:27, Dave Chinner wrote:
> > >>>>>On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
> > >>>>>>On 06/16/2014 01:19 AM, Dave Chinner wrote:
> > >>>>>>>On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
> > >>>>>>>>On 2014-06-15 20:00, Dave Chinner wrote:
> > >>>>>>>>>On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
> > >>>>>>>>>FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> > >>>>>>>>>running isn't anything related to XFS.  The async fsync workqueue
> > >>>>>>>>>results in several thousand worker threads dispatching IO
> > >>>>>>>>>concurrently across 16 CPUs:
> > >....
> > >>>>>>>>>I know that the tag allocator has been rewritten, so I tested
> > >>>>>>>>>against a current a current Linus kernel with the XFS aio-fsync
> > >>>>>>>>>patch. The results are all over the place - from several sequential
> > >>>>>>>>>runs of the same test (removing the files in between so each tests
> > >>>>>>>>>starts from an empty fs):
> > >>>>>>>>>
> > >>>>>>>>>Wall time	sys time	IOPS	 files/s
> > >>>>>>>>>4m58.151s	11m12.648s	30,000	 13,500
> > >>>>>>>>>4m35.075s	12m45.900s	45,000	 15,000
> > >>>>>>>>>3m10.665s	11m15.804s	65,000	 21,000
> > >>>>>>>>>3m27.384s	11m54.723s	85,000	 20,000
> > >>>>>>>>>3m59.574s	11m12.012s	50,000	 16,500
> > >>>>>>>>>4m12.704s	12m15.720s	50,000	 17,000
> 
> ....
> > >But the IOPS rate has definitely increased with this config
> > >- I just saw 90k, 100k and 110k IOPS in the last 3 iterations of the
> > >workload (the above profile is from the 100k IOPS period). However,
> > >the wall time was still only 3m58s, which again tends to implicate
> > >the write() portion of the benchmark for causing the slowdowns
> > >rather than the fsync() portion that is dispatching all the IO...
> > 
> > Some contention for this case is hard to avoid, and the above looks
> > better than 3.15 does. So the big question is whether it's worth
> > fixing the gaps with multiple waitqueues (and if that actually still
> > buys us anything), or whether we should just disable them.
> > 
> > If I can get you to try one more thing, can you apply this patch and
> > give that a whirl? Get rid of the other patches I sent first, this
> > has everything.
> 
> Not much difference in the CPU usage profiles or base line
> performance. It runs at 3m10s from empty memory, and ~3m45s when
> memory starts full of clean pages. system time varies from 10m40s to
> 12m55s with no real correlation to overall runtime.
> 
> From observation of all the performance metrics I graph in real
> time, however, the pattern of the peaks and troughs from run to run
> and even iteration to iteration is much more regular than the
> previous patches. So from that perspective it is an improvement.
> Again, all the variability in the graphs show up when free memory
> runs out...

And I've identified the commit that caused the memory reclaim
behaviour to go south:

commit 1f6d64829db78a7e1d63e15c9f48f0a5d2b5a679
Author: Dave Chinner <dchinner@redhat.com>
Date:   Fri Jun 6 15:59:59 2014 +1000

    xfs: block allocation work needs to be kswapd aware
    
    Upon memory pressure, kswapd calls xfs_vm_writepage() from
    shrink_page_list(). This can result in delayed allocation occurring
    and that gets deferred to the the allocation workqueue.
    
    The allocation then runs outside kswapd context, which means if it
    needs memory (and it does to demand page metadata from disk) it can
    block in shrink_inactive_list() waiting for IO congestion. These
    blocking waits are normally avoiding in kswapd context, so under
    memory pressure writeback from kswapd can be arbitrarily delayed by
    memory reclaim.
    
    To avoid this, pass the kswapd context to the allocation being done
    by the workqueue, so that memory reclaim understands correctly that
    the work is being done for kswapd and therefore it is not blocked
    and does not delay memory reclaim.
    
    To avoid issues with int->char conversion of flag fields (as noticed
    in v1 of this patch) convert the flag fields in the struct
    xfs_bmalloca to bool types. pahole indicates these variables are
    still single byte variables, so no extra space is consumed by this
    change.
    
    cc: <stable@vger.kernel.org>
    Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

Reverting this patch results in runtimes of between 3m and 3m10s
regardless of the amount of free memory when the test starts.

I'm probably going to have to revert this and make sure it stays out
of the stable kernels now - I think that unbalancing memory reclaim
and introducing performance degradations of 25-30% to work around a
problem that is only hit by an extreme memory pressure stress test
is a bad trade-off.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-06-18  6:13 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-12  8:34 [PATCH] [RFC] xfs: wire up aio_fsync method Dave Chinner
2014-06-12  8:56 ` [PATCH] [RFC] fs_mark: add asynchronous fsync Dave Chinner
2014-06-12 14:13 ` [PATCH] [RFC] xfs: wire up aio_fsync method Christoph Hellwig
2014-06-12 23:44   ` Dave Chinner
2014-06-13 16:23     ` Christoph Hellwig
2014-06-15 22:33       ` Dave Chinner
2014-06-15 22:33         ` Dave Chinner
2014-06-16  2:00         ` Dave Chinner
2014-06-16  2:00           ` Dave Chinner
2014-06-16  2:58           ` Jens Axboe
2014-06-16  2:58             ` Jens Axboe
     [not found]             ` <539E5D66.8040605-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-16  7:19               ` Dave Chinner
2014-06-16  7:19                 ` Dave Chinner
2014-06-16 19:30                 ` Jens Axboe
2014-06-16 19:30                   ` Jens Axboe
2014-06-16 22:27                   ` Dave Chinner
2014-06-17 13:23                     ` Jens Axboe
2014-06-17 13:23                       ` Jens Axboe
2014-06-18  0:28                       ` Dave Chinner
2014-06-18  0:28                         ` Dave Chinner
2014-06-18  0:28                         ` Dave Chinner
2014-06-18  2:24                         ` Jens Axboe
2014-06-18  2:24                           ` Jens Axboe
2014-06-18  2:24                           ` Jens Axboe
2014-06-18  3:13                           ` Dave Chinner
2014-06-18  3:13                             ` Dave Chinner
2014-06-18  3:20                             ` Jens Axboe
2014-06-18  3:20                               ` Jens Axboe
2014-06-18  3:20                               ` Jens Axboe
     [not found]                               ` <53A10597.6020707-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-18  5:02                                 ` Dave Chinner
2014-06-18  5:02                                   ` Dave Chinner
2014-06-18  5:02                                   ` Dave Chinner
2014-06-18  6:13                                   ` Dave Chinner
2014-06-18  6:13                                     ` Dave Chinner
     [not found]       ` <20140613162352.GB23394-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-06-16 21:06         ` Michael Kerrisk (man-pages)
2014-06-16 21:06           ` Michael Kerrisk (man-pages)
2014-06-17 14:01           ` Christoph Hellwig
2014-06-12 15:24 ` Brian Foster
2014-06-12 23:57   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.