linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fadvise: perform WILLNEED readahead in a workqueue
@ 2012-12-15  0:54 Eric Wong
  2012-12-15 22:34 ` Alan Cox
  2012-12-16  2:45 ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Eric Wong @ 2012-12-15  0:54 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Applications streaming large files may want to reduce disk spinups and
I/O latency by performing large amounts of readahead up front.
Applications also tend to read files soon after opening them, so waiting
on a slow fadvise may cause unpleasant latency when the application
starts reading the file.

As a userspace hacker, I'm sometimes tempted to create a background
thread in my app to run readahead().  However, I believe doing this
in the kernel will make life easier for other userspace hackers.

Since fadvise makes no guarantees about when (or even if) readahead
is performed, this change should not hurt existing applications.

"strace -T" timing on an uncached, one gigabyte file:

 Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
  After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>

Signed-off-by: Eric Wong <normalperson@yhbt.net>
---
 N.B.: I'm not sure if I'm misusing any kernel APIs here.  I managed to
 compile, boot, and run fadvise in a loop without anything blowing up.
 I've verified readahead gets performed via mincore().

 If the workqueue approach is acceptable, I'll proceed with
 changing MADV_WILLNEED, too.

 include/linux/mm.h |    3 +++
 mm/fadvise.c       |   10 ++++-----
 mm/readahead.c     |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..17ab7d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1473,6 +1473,9 @@ void task_dirty_inc(struct task_struct *tsk);
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 
+void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
+			pgoff_t offset, unsigned long nr_to_read);
+
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index a47f0f5..cf3bd4c 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -102,12 +102,10 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 		if (!nrpages)
 			nrpages = ~0UL;
 
-		/*
-		 * Ignore return value because fadvise() shall return
-		 * success even if filesystem can't retrieve a hint,
-		 */
-		force_page_cache_readahead(mapping, f.file, start_index,
-					   nrpages);
+		get_file(f.file); /* fput() is called by workqueue */
+
+		/* queue up the request, don't care if it fails */
+		wq_page_cache_readahead(mapping, f.file, start_index, nrpages);
 		break;
 	case POSIX_FADV_NOREUSE:
 		break;
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..56a80a9 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,27 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/file.h>
+#include <linux/workqueue.h>
+
+static struct workqueue_struct *readahead_wq __read_mostly;
+
+struct wq_ra_req {
+	struct work_struct work;
+	struct address_space *mapping;
+	struct file *file;
+	pgoff_t offset;
+	unsigned long nr_to_read;
+};
+
+static int __init init_readahead_wq(void)
+{
+	readahead_wq = alloc_workqueue("readahead", WQ_UNBOUND,
+					WQ_UNBOUND_MAX_ACTIVE);
+	BUG_ON(!readahead_wq);
+	return 0;
+}
+
+early_initcall(init_readahead_wq);
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -204,6 +225,47 @@ out:
 	return ret;
 }
 
+static void wq_ra_req_fn(struct work_struct *work)
+{
+	struct wq_ra_req *req = container_of(work, struct wq_ra_req, work);
+
+	/* ignore errors, caller wanted fire-and-forget operation */
+	force_page_cache_readahead(req->mapping, req->file,
+				req->offset, req->nr_to_read);
+
+	fput(req->file);
+	kfree(req);
+}
+
+/*
+ * Fire-and-forget readahead using a workqueue, this allocates pages
+ * inside a workqueue and returns as soon as possible.
+ */
+void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
+		pgoff_t offset, unsigned long nr_to_read)
+{
+	struct wq_ra_req *req;
+
+	req = kzalloc(sizeof(*req), GFP_ATOMIC);
+
+	/*
+	 * we are fire-and-forget, not having enough memory means readahead
+	 * is not worth doing anyways
+	 */
+	if (!req) {
+		fput(filp);
+		return;
+	}
+
+	INIT_WORK(&req->work, wq_ra_req_fn);
+	req->mapping = mapping;
+	req->file = filp;
+	req->offset = offset;
+	req->nr_to_read = nr_to_read;
+
+	queue_work(readahead_wq, &req->work);
+}
+
 /*
  * Chunk the readahead into 2 megabyte units, so that we don't pin too much
  * memory at once.
-- 
Eric Wong

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-15  0:54 [PATCH] fadvise: perform WILLNEED readahead in a workqueue Eric Wong
@ 2012-12-15 22:34 ` Alan Cox
  2012-12-16  0:25   ` Eric Wong
  2012-12-16  2:45 ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Alan Cox @ 2012-12-15 22:34 UTC (permalink / raw)
  To: Eric Wong; +Cc: linux-mm, linux-kernel

On Sat, 15 Dec 2012 00:54:48 +0000
Eric Wong <normalperson@yhbt.net> wrote:

> Applications streaming large files may want to reduce disk spinups and
> I/O latency by performing large amounts of readahead up front


How does it compare benchmark wise with a user thread or using the
readahead() call ?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-15 22:34 ` Alan Cox
@ 2012-12-16  0:25   ` Eric Wong
  2012-12-16  3:03     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2012-12-16  0:25 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-mm, linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Sat, 15 Dec 2012 00:54:48 +0000
> Eric Wong <normalperson@yhbt.net> wrote:
> 
> > Applications streaming large files may want to reduce disk spinups and
> > I/O latency by performing large amounts of readahead up front
> 
> How does it compare benchmark wise with a user thread or using the
> readahead() call ?

Very well.

My main concern is for the speed of the initial pread()/read() call
after open().

Setting EARLY_EXIT means my test program _exit()s immediately after the
first pread().  In my test program (below), I wait for the background
thread to become ready before open() so I would not take overhead from
pthread_create() into account.

RA=1 uses a pthread + readahead()
Not setting RA uses fadvise (with my patch)

# readahead + pthread.
$ EARLY_EXIT=1 RA=1 time  ./first_read 1G
0.00user 0.05system 0:01.37elapsed 3%CPU (0avgtext+0avgdata 600maxresident)k
0inputs+0outputs (1major+187minor)pagefaults 0swaps

# patched fadvise
$ EARLY_EXIT=1 time ./first_read 1G
0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 564maxresident)k
0inputs+0outputs (1major+178minor)pagefaults 0swaps

Perhaps I screwed up my readahead() + threads path badly, but there
seems to be a huge benefit in using fadvise with my patch.  I'm not sure
why readahead() + thread does so badly, even...

Even if I badly screwed up my use of readahead(), the benefit of my
patch spares others from screwing up when using threads+readahead() :)

FULL_READ
---------
While full, fast reads are not my target use case, there's no noticeable
regression here, either.  Results for doing a full, fast read on the file
are closer and fluctuate more between runs.

# readahead + pthread.
$ FULL_READ=1 EARLY_EXIT=1 RA=1 time ./first_read 1G
0.01user 1.10system 0:09.24elapsed 12%CPU (0avgtext+0avgdata 596maxresident)k
0inputs+0outputs (1major+186minor)pagefaults 0swaps

# patched fadvise
FULL_READ=1 EARLY_EXIT=1 time ./first_read 1G
0.01user 1.04system 0:09.22elapsed 11%CPU (0avgtext+0avgdata 564maxresident)k
0inputs+0outputs (1major+178minor)pagefaults 0swaps

--------------------------------- 8< ------------------------------
/* gcc -O2 -Wall -lpthread -o first_read first_read.c */
#define _GNU_SOURCE
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <assert.h>
#include <sched.h>
#include <sys/eventfd.h>

static int efd1;
static int efd2;

static void * start_ra(void *unused)
{
	struct stat st;
	eventfd_t val;
	int fd;

	/* tell parent to open() */
	assert(eventfd_write(efd1, 1) == 0);

	/* wait for parent to tell us fd is ready */
	assert(eventfd_read(efd2, &val) == 0);
	fd = (int)val;

	assert(fstat(fd, &st) == 0);
	assert(readahead(fd, 0, st.st_size) == 0);

	return NULL;
}

int main(int argc, char *argv[])
{
	char buf[16384];
	pthread_t thr;
	int fd;
	char *do_ra = getenv("RA");

	if (argc != 2) {
		fprintf(stderr, "Usage: strace -T %s LARGE_FILE\n", argv[0]);
		return 1;
	}

	if (do_ra) {
		eventfd_t val;
		efd1 = eventfd(0, 0);
		efd2 = eventfd(0, 0);
		assert(efd1 >= 0 && efd2 >= 0 && "eventfd failed");
		assert(pthread_create(&thr, NULL, start_ra, NULL) == 0);

		/* wait for child thread to spawn */
		assert(eventfd_read(efd1, &val) == 0);
	}

	fd = open(argv[1], O_RDONLY);
	assert(fd >= 0 && "open failed");

	if (do_ra) {
		/* wake up the child thread, give it a chance to run */
		assert(eventfd_write(efd2, fd) == 0);
		sched_yield();
	} else
		assert(posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED) == 0);

	assert(pread(fd, buf, sizeof(buf), 0) == sizeof(buf));

	if (getenv("FULL_READ")) {
		ssize_t r;
		do {
			r = read(fd, buf, sizeof(buf));
		} while (r > 0);
		assert(r == 0 && "EOF not reached");
	}

	if (getenv("EXIT_EARLY"))
		_exit(0);

	if (do_ra) {
		assert(pthread_join(thr, NULL) == 0);
		assert(close(efd1) == 0);
		assert(close(efd2) == 0);
	}

	assert(close(fd) == 0);

	return 0;
}
--------------------------------- 8< ------------------------------

Thanks for your interest in this!

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-15  0:54 [PATCH] fadvise: perform WILLNEED readahead in a workqueue Eric Wong
  2012-12-15 22:34 ` Alan Cox
@ 2012-12-16  2:45 ` Dave Chinner
  2012-12-16  3:04   ` Eric Wong
  2013-02-22 16:45   ` Phillip Susi
  1 sibling, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2012-12-16  2:45 UTC (permalink / raw)
  To: Eric Wong; +Cc: linux-mm, linux-kernel

On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> Applications streaming large files may want to reduce disk spinups and
> I/O latency by performing large amounts of readahead up front.
> Applications also tend to read files soon after opening them, so waiting
> on a slow fadvise may cause unpleasant latency when the application
> starts reading the file.
> 
> As a userspace hacker, I'm sometimes tempted to create a background
> thread in my app to run readahead().  However, I believe doing this
> in the kernel will make life easier for other userspace hackers.
> 
> Since fadvise makes no guarantees about when (or even if) readahead
> is performed, this change should not hurt existing applications.
> 
> "strace -T" timing on an uncached, one gigabyte file:
> 
>  Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
>   After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>

You've basically asked fadvise() to readahead the entire file if it
can. That means it is likely to issue enough readahead to fill the
IO queue, and that's where all the latency is coming from. If all
you are trying to do is reduce the latency of the first read, then
only readahead the initial range that you are going to need to read...

Also, Pushing readahead off to a workqueue potentially allows
someone to DOS the system because readahead won't ever get throttled
in the syscall context...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  0:25   ` Eric Wong
@ 2012-12-16  3:03     ` Dave Chinner
  2012-12-16  3:35       ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2012-12-16  3:03 UTC (permalink / raw)
  To: Eric Wong; +Cc: Alan Cox, linux-mm, linux-kernel

On Sun, Dec 16, 2012 at 12:25:49AM +0000, Eric Wong wrote:
> Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > On Sat, 15 Dec 2012 00:54:48 +0000
> > Eric Wong <normalperson@yhbt.net> wrote:
> > 
> > > Applications streaming large files may want to reduce disk spinups and
> > > I/O latency by performing large amounts of readahead up front
> > 
> > How does it compare benchmark wise with a user thread or using the
> > readahead() call ?
> 
> Very well.
> 
> My main concern is for the speed of the initial pread()/read() call
> after open().
> 
> Setting EARLY_EXIT means my test program _exit()s immediately after the
> first pread().  In my test program (below), I wait for the background
> thread to become ready before open() so I would not take overhead from
> pthread_create() into account.
> 
> RA=1 uses a pthread + readahead()
> Not setting RA uses fadvise (with my patch)

And if you don't use fadvise/readahead at all?

> # readahead + pthread.
> $ EARLY_EXIT=1 RA=1 time  ./first_read 1G
> 0.00user 0.05system 0:01.37elapsed 3%CPU (0avgtext+0avgdata 600maxresident)k
> 0inputs+0outputs (1major+187minor)pagefaults 0swaps
> 
> # patched fadvise
> $ EARLY_EXIT=1 time ./first_read 1G
> 0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 564maxresident)k
> 0inputs+0outputs (1major+178minor)pagefaults 0swaps

You're not timing how long the first pread() takes at all. You're
timing the entire set of operations, including cloning a thread and
for the readahead(2) call and messages to be passed back and forth
through the eventfd interface to read the entire file.

Why even bother with another thread for readahead()? It implements
*exactly* the same operation as fadvise(WILL_NEED) (ie.
force_page_cache_readahead), so should perform identically when
called in exactly the same manner...

But again, you are interesting in the latency of the first read of
16k from the file, but you are asking to readahead 1GB of data.
Perhaps your shoul dbe asking for readahead of something more
appropriate to what you care about - the first read....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  2:45 ` Dave Chinner
@ 2012-12-16  3:04   ` Eric Wong
  2012-12-16  3:09     ` Eric Wong
  2012-12-16  3:36     ` Dave Chinner
  2013-02-22 16:45   ` Phillip Susi
  1 sibling, 2 replies; 18+ messages in thread
From: Eric Wong @ 2012-12-16  3:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-kernel

Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> > Applications streaming large files may want to reduce disk spinups and
> > I/O latency by performing large amounts of readahead up front.
> > Applications also tend to read files soon after opening them, so waiting
> > on a slow fadvise may cause unpleasant latency when the application
> > starts reading the file.
> > 
> > As a userspace hacker, I'm sometimes tempted to create a background
> > thread in my app to run readahead().  However, I believe doing this
> > in the kernel will make life easier for other userspace hackers.
> > 
> > Since fadvise makes no guarantees about when (or even if) readahead
> > is performed, this change should not hurt existing applications.
> > 
> > "strace -T" timing on an uncached, one gigabyte file:
> > 
> >  Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
> >   After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> 
> You've basically asked fadvise() to readahead the entire file if it
> can. That means it is likely to issue enough readahead to fill the
> IO queue, and that's where all the latency is coming from. If all
> you are trying to do is reduce the latency of the first read, then
> only readahead the initial range that you are going to need to read...

Yes, I do want to read the whole file, eventually.  So I want to put
the file into the page cache ASAP and allow the disk to spin down.
But I also want the first read() to be fast.

> Also, Pushing readahead off to a workqueue potentially allows
> someone to DOS the system because readahead won't ever get throttled
> in the syscall context...

Yes, I'm a little worried about this, too.
Perhaps squashing something like the following will work?

diff --git a/mm/readahead.c b/mm/readahead.c
index 56a80a9..51dc58e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -246,16 +246,18 @@ void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
 {
 	struct wq_ra_req *req;
 
+	nr_to_read = max_sane_readahead(nr_to_read);
+	if (!nr_to_read)
+		goto skip_ra;
+
 	req = kzalloc(sizeof(*req), GFP_ATOMIC);
 
 	/*
 	 * we are fire-and-forget, not having enough memory means readahead
 	 * is not worth doing anyways
 	 */
-	if (!req) {
-		fput(filp);
-		return;
-	}
+	if (!req)
+		goto skip_ra;
 
 	INIT_WORK(&req->work, wq_ra_req_fn);
 	req->mapping = mapping;
@@ -264,6 +266,9 @@ void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	req->nr_to_read = nr_to_read;
 
 	queue_work(readahead_wq, &req->work);
+
+skip_ra:
+	fput(filp);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  3:04   ` Eric Wong
@ 2012-12-16  3:09     ` Eric Wong
  2012-12-16  3:36     ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Wong @ 2012-12-16  3:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-kernel

Eric Wong <normalperson@yhbt.net> wrote:
> Perhaps squashing something like the following will work?

Last hunk should've had a return before skip_ra:

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -264,6 +266,10 @@ void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	req->nr_to_read = nr_to_read;
 
 	queue_work(readahead_wq, &req->work);
+
+	return;
+skip_ra:
+	fput(filp);
 }
 
 /*
-- 
Eric Wong

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  3:03     ` Dave Chinner
@ 2012-12-16  3:35       ` Eric Wong
  2012-12-16  4:15         ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2012-12-16  3:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alan Cox, linux-mm, linux-kernel

Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Dec 16, 2012 at 12:25:49AM +0000, Eric Wong wrote:
> > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > On Sat, 15 Dec 2012 00:54:48 +0000
> > > Eric Wong <normalperson@yhbt.net> wrote:
> > > 
> > > > Applications streaming large files may want to reduce disk spinups and
> > > > I/O latency by performing large amounts of readahead up front
> > > 
> > > How does it compare benchmark wise with a user thread or using the
> > > readahead() call ?
> > 
> > Very well.
> > 
> > My main concern is for the speed of the initial pread()/read() call
> > after open().
> > 
> > Setting EARLY_EXIT means my test program _exit()s immediately after the
> > first pread().  In my test program (below), I wait for the background
> > thread to become ready before open() so I would not take overhead from
> > pthread_create() into account.
> > 
> > RA=1 uses a pthread + readahead()
> > Not setting RA uses fadvise (with my patch)
> 
> And if you don't use fadvise/readahead at all?

Sorry for the confusion.  I believe my other reply to you summarized
what I wanted to say in my commit message and also reply to Alan.

I want all the following things:

- I want the first read to be fast.
- I want to read the whole file eventually (probably slowly,
  as processing takes a while).
- I want to let my disk spin down for as long as possible.

This could also be a use case for an audio/video player.

> You're not timing how long the first pread() takes at all. You're
> timing the entire set of operations, including cloning a thread and
> for the readahead(2) call and messages to be passed back and forth
> through the eventfd interface to read the entire file.

You're right, I screwed up the measurement.  Using clock_gettime(),
there's hardly a difference between the approaches and I can't
get consistent timings between them.

So no, there's no difference that matters between the approaches.
But I think doing this in the kernel is easier for userspace users.

---------------------------------- 8<----------------------------
/* gcc -O2 -Wall -lpthread -lrt -o first_read first_read.c */
#define _GNU_SOURCE
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <assert.h>
#include <sched.h>
#include <sys/eventfd.h>
#include <time.h>

static int efd1;
static int efd2;

static void clock_diff(struct timespec *a, const struct timespec *b)
{
        a->tv_sec -= b->tv_sec;
        a->tv_nsec -= b->tv_nsec;
        if (a->tv_nsec < 0) {
                --a->tv_sec;
                a->tv_nsec += 1000000000;
        }
}

static void * start_ra(void *unused)
{
	struct stat st;
	eventfd_t val;
	int fd;

	/* tell parent to open() */
	assert(eventfd_write(efd1, 1) == 0);

	/* wait for parent to tell us fd is ready */
	assert(eventfd_read(efd2, &val) == 0);
	fd = (int)val;

	assert(fstat(fd, &st) == 0);
	assert(readahead(fd, 0, st.st_size) == 0);

	return NULL;
}

int main(int argc, char *argv[])
{
	char buf[16384];
	pthread_t thr;
	int fd;
	struct timespec start;
	struct timespec finish;
	char *do_ra = getenv("RA");

	if (argc != 2) {
		fprintf(stderr, "Usage: strace -T %s LARGE_FILE\n", argv[0]);
		return 1;
	}

	if (do_ra) {
		eventfd_t val;
		efd1 = eventfd(0, 0);
		efd2 = eventfd(0, 0);
		assert(efd1 >= 0 && efd2 >= 0 && "eventfd failed");
		assert(pthread_create(&thr, NULL, start_ra, NULL) == 0);

		/* wait for child thread to spawn */
		assert(eventfd_read(efd1, &val) == 0);
	}

	fd = open(argv[1], O_RDONLY);
	assert(fd >= 0 && "open failed");

	assert(clock_gettime(CLOCK_MONOTONIC, &start) == 0);

	if (do_ra) {
		/* wake up the child thread, give it a chance to run */
		assert(eventfd_write(efd2, fd) == 0);
		sched_yield();
	} else
		assert(posix_fadvise(fd, 0, 0, POSIX_FADV_WILLNEED) == 0);

	assert(pread(fd, buf, sizeof(buf), 0) == sizeof(buf));
	assert(clock_gettime(CLOCK_MONOTONIC, &finish) == 0);
	clock_diff(&finish, &start);
	fprintf(stderr, "elapsed: %lu.%09lu\n", finish.tv_sec, finish.tv_nsec);

	if (getenv("FULL_READ")) {
		ssize_t r;
		do {
			r = read(fd, buf, sizeof(buf));
		} while (r > 0);
		assert(r == 0 && "EOF not reached");
	}

	if (getenv("EXIT_EARLY"))
		_exit(0);

	if (do_ra) {
		assert(pthread_join(thr, NULL) == 0);
		assert(close(efd1) == 0);
		assert(close(efd2) == 0);
	}

	assert(close(fd) == 0);

	return 0;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  3:04   ` Eric Wong
  2012-12-16  3:09     ` Eric Wong
@ 2012-12-16  3:36     ` Dave Chinner
  2012-12-16  3:59       ` Eric Wong
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2012-12-16  3:36 UTC (permalink / raw)
  To: Eric Wong; +Cc: linux-mm, linux-kernel

On Sun, Dec 16, 2012 at 03:04:42AM +0000, Eric Wong wrote:
> Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> > > Applications streaming large files may want to reduce disk spinups and
> > > I/O latency by performing large amounts of readahead up front.
> > > Applications also tend to read files soon after opening them, so waiting
> > > on a slow fadvise may cause unpleasant latency when the application
> > > starts reading the file.
> > > 
> > > As a userspace hacker, I'm sometimes tempted to create a background
> > > thread in my app to run readahead().  However, I believe doing this
> > > in the kernel will make life easier for other userspace hackers.
> > > 
> > > Since fadvise makes no guarantees about when (or even if) readahead
> > > is performed, this change should not hurt existing applications.
> > > 
> > > "strace -T" timing on an uncached, one gigabyte file:
> > > 
> > >  Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
> > >   After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> > 
> > You've basically asked fadvise() to readahead the entire file if it
> > can. That means it is likely to issue enough readahead to fill the
> > IO queue, and that's where all the latency is coming from. If all
> > you are trying to do is reduce the latency of the first read, then
> > only readahead the initial range that you are going to need to read...
> 
> Yes, I do want to read the whole file, eventually.  So I want to put
> the file into the page cache ASAP and allow the disk to spin down.

Issuing readahead is not going to speed up the first read. Either
you will spend more time issuing all the readahead, or you block
waiting for the first read to complete. And the way you are issuing
readahead does not guarantee the entire file is brought into the
page cache....

> But I also want the first read() to be fast.

You can't have a pony, sorry.

> > Also, Pushing readahead off to a workqueue potentially allows
> > someone to DOS the system because readahead won't ever get throttled
> > in the syscall context...
> 
> Yes, I'm a little worried about this, too.
> Perhaps squashing something like the following will work?
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 56a80a9..51dc58e 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -246,16 +246,18 @@ void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  {
>  	struct wq_ra_req *req;
>  
> +	nr_to_read = max_sane_readahead(nr_to_read);
> +	if (!nr_to_read)
> +		goto skip_ra;

You do realise that anything you read ahead will be accounted as
inactive pages, so nr_to_read doesn't decrease at all as you fill
memory with readahead pages...

> +
>  	req = kzalloc(sizeof(*req), GFP_ATOMIC);

GFP_ATOMIC? Really?

In reality, I think you are looking in the wrong place to fix your
"first read" latency problem. No matter what you do, there is going
to be IO latency on the first read. And readahead doesn't guarantee
that the pages are brought into the page cache (ever heard of
readahead thrashing?) so the way you are doing your readahead is not
going to result in you being able to spin the disk down after
issuing a readahead command...

You've really got two problems - minimal initial latency, and
reading the file quickly and pinning it in memory until you get
around to needing it. The first can't be made faster by using
readahead, and the second can not be guaranteed by using readahead.

IOWs, readahead is the wrong tool for solving your problems. Minimal
IO latency from the first read will come from just issuing pread()
after open(), and ensuring that the file is read quickly and pinned
in memory can really only be done by allocating RAM in the
application to hold it until it is needed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  3:36     ` Dave Chinner
@ 2012-12-16  3:59       ` Eric Wong
  2012-12-16  4:26         ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2012-12-16  3:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-kernel

Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Dec 16, 2012 at 03:04:42AM +0000, Eric Wong wrote:
> > Dave Chinner <david@fromorbit.com> wrote:
> > > On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> > > > 
> > > >  Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
> > > >   After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> > > 
> > > You've basically asked fadvise() to readahead the entire file if it
> > > can. That means it is likely to issue enough readahead to fill the
> > > IO queue, and that's where all the latency is coming from. If all
> > > you are trying to do is reduce the latency of the first read, then
> > > only readahead the initial range that you are going to need to read...
> > 
> > Yes, I do want to read the whole file, eventually.  So I want to put
> > the file into the page cache ASAP and allow the disk to spin down.
> 
> Issuing readahead is not going to speed up the first read. Either
> you will spend more time issuing all the readahead, or you block
> waiting for the first read to complete. And the way you are issuing
> readahead does not guarantee the entire file is brought into the
> page cache....

I'm not relying on readahead to speed up the first read.

By using fadvise/readahead, I want a _best-effort_ attempt to
keep the file in cache.

> > But I also want the first read() to be fast.
> 
> You can't have a pony, sorry.

I want the first read() to happen sooner than it would under current
fadvise.  If it's slightly slower that w/o fadvise, that's fine.
The 1-2s slower with current fadvise is what bothers me.

> > > Also, Pushing readahead off to a workqueue potentially allows
> > > someone to DOS the system because readahead won't ever get throttled
> > > in the syscall context...
> > 
> > Yes, I'm a little worried about this, too.
> > Perhaps squashing something like the following will work?
> > 
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 56a80a9..51dc58e 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -246,16 +246,18 @@ void wq_page_cache_readahead(struct address_space *mapping, struct file *filp,
> >  {
> >  	struct wq_ra_req *req;
> >  
> > +	nr_to_read = max_sane_readahead(nr_to_read);
> > +	if (!nr_to_read)
> > +		goto skip_ra;
> 
> You do realise that anything you read ahead will be accounted as
> inactive pages, so nr_to_read doesn't decrease at all as you fill
> memory with readahead pages...

Ah, ok, I'll see if I can rework it.

> >  	req = kzalloc(sizeof(*req), GFP_ATOMIC);
> 
> GFP_ATOMIC? Really?

Sorry, I'm really new at this.

> In reality, I think you are looking in the wrong place to fix your
> "first read" latency problem. No matter what you do, there is going
> to be IO latency on the first read. And readahead doesn't guarantee
> that the pages are brought into the page cache (ever heard of
> readahead thrashing?) so the way you are doing your readahead is not
> going to result in you being able to spin the disk down after
> issuing a readahead command...

Right, I want a _best-effort_ readahead (which seems to be what an
advisory interface should offer).

> You've really got two problems - minimal initial latency, and
> reading the file quickly and pinning it in memory until you get
> around to needing it. The first can't be made faster by using
> readahead, and the second can not be guaranteed by using readahead.

Agreed.  I think I overstated the requirements.

I want "less-bad" initial latency than I was getting.

So I don't mind if open()+fadvise()+read() is a couple of milliseconds
slower than just open()+read(), but I do mind if fadvise() takes 1-2
seconds.

> IOWs, readahead is the wrong tool for solving your problems. Minimal
> IO latency from the first read will come from just issuing pread()
> after open(), and ensuring that the file is read quickly and pinned
> in memory can really only be done by allocating RAM in the
> application to hold it until it is needed....

I definitely only want a best-effort method to put a file into memory.
I want the kernel to decide whether or not to cache it.

Thanks for looking at this!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  3:35       ` Eric Wong
@ 2012-12-16  4:15         ` Dave Chinner
  2012-12-16  5:23           ` Eric Wong
  2012-12-16  8:48           ` Zheng Liu
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2012-12-16  4:15 UTC (permalink / raw)
  To: Eric Wong; +Cc: Alan Cox, linux-mm, linux-kernel

On Sun, Dec 16, 2012 at 03:35:49AM +0000, Eric Wong wrote:
> Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Dec 16, 2012 at 12:25:49AM +0000, Eric Wong wrote:
> > > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > > On Sat, 15 Dec 2012 00:54:48 +0000
> > > > Eric Wong <normalperson@yhbt.net> wrote:
> > > > 
> > > > > Applications streaming large files may want to reduce disk spinups and
> > > > > I/O latency by performing large amounts of readahead up front

> This could also be a use case for an audio/video player.

Sure, but this can all be handled by a userspace application. If you
want to avoid/batch IO to enable longer spindown times, then you
have to load the file into RAM somewhere, and you don't need special
kernel support for that.

> So no, there's no difference that matters between the approaches.
> But I think doing this in the kernel is easier for userspace users.

The kernel provides mechanisms for applications to use. You have not
mentioned anything new that requires a new kernel mechanism to
acheive - you just need to have the knowledge to put the pieces
together properly.  People have been solving this same problem for
the last 20 years without needing to tweak fadvise(). Or even having
an fadvise() syscall...

Nothing about low latency IO or streaming IO is simple or easy, and
changing how readahead works doesn't change that fact. All it does
is change the behaviour of every other application that uses
fadvise() to minimise IO latency....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  3:59       ` Eric Wong
@ 2012-12-16  4:26         ` Dave Chinner
  2012-12-16  5:17           ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2012-12-16  4:26 UTC (permalink / raw)
  To: Eric Wong; +Cc: linux-mm, linux-kernel

On Sun, Dec 16, 2012 at 03:59:53AM +0000, Eric Wong wrote:
> Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Dec 16, 2012 at 03:04:42AM +0000, Eric Wong wrote:
> > > Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> > > > > 
> > > > >  Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832>
> > > > >   After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> > > > 
> > > > You've basically asked fadvise() to readahead the entire file if it
> > > > can. That means it is likely to issue enough readahead to fill the
> > > > IO queue, and that's where all the latency is coming from. If all
> > > > you are trying to do is reduce the latency of the first read, then
> > > > only readahead the initial range that you are going to need to read...
> > > 
> > > Yes, I do want to read the whole file, eventually.  So I want to put
> > > the file into the page cache ASAP and allow the disk to spin down.
> > 
> > Issuing readahead is not going to speed up the first read. Either
> > you will spend more time issuing all the readahead, or you block
> > waiting for the first read to complete. And the way you are issuing
> > readahead does not guarantee the entire file is brought into the
> > page cache....
> 
> I'm not relying on readahead to speed up the first read.
> 
> By using fadvise/readahead, I want a _best-effort_ attempt to
> keep the file in cache.
> 
> > > But I also want the first read() to be fast.
> > 
> > You can't have a pony, sorry.
> 
> I want the first read() to happen sooner than it would under current
> fadvise. 

You're not listening.  You do not need the kernel to be modified to
avoid the latency of issuing 1GB of readahead on a file.

You don't need to do readahead before the first read. Nor do you do
need to wait for 1GB of readhead to be issued before you do the
first read.

You could do readahead *concurrently* with the first read, so the
first read only blocks until the readahead of the first part of the
file completes.  i.e. just do readahead() in a background thread and
don't wait for it to complete before doing the first read.

You could even do readahead *after* the first read, when the time it
takes *doesn't matter* to the processing of the incoming data...

> I want "less-bad" initial latency than I was getting.

And you can do that by changing how you issue readahead from
userspace.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  4:26         ` Dave Chinner
@ 2012-12-16  5:17           ` Eric Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2012-12-16  5:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-mm, linux-kernel

Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Dec 16, 2012 at 03:59:53AM +0000, Eric Wong wrote:
> > I want the first read() to happen sooner than it would under current
> > fadvise. 
> 
> You're not listening.  You do not need the kernel to be modified to
> avoid the latency of issuing 1GB of readahead on a file.
> 
> You don't need to do readahead before the first read. Nor do you do
> need to wait for 1GB of readhead to be issued before you do the
> first read.
> 
> You could do readahead *concurrently* with the first read, so the
> first read only blocks until the readahead of the first part of the
> file completes.  i.e. just do readahead() in a background thread and
> don't wait for it to complete before doing the first read.

What you describe with concurrent readahead() is _exactly_ what my test
program (in other email) does with the RA environment variable set.

I know I do not _need_ fadvise + background WILLNEED support in the
kernel.

But I think the kernel can make life easier and allow us to avoid doing
background threads or writing our own (inferior) caching in userspace.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  4:15         ` Dave Chinner
@ 2012-12-16  5:23           ` Eric Wong
  2012-12-16 21:31             ` Dave Chinner
  2012-12-16  8:48           ` Zheng Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Wong @ 2012-12-16  5:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alan Cox, linux-mm, linux-kernel

Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Dec 16, 2012 at 03:35:49AM +0000, Eric Wong wrote:
> > Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Dec 16, 2012 at 12:25:49AM +0000, Eric Wong wrote:
> > > > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > > > On Sat, 15 Dec 2012 00:54:48 +0000
> > > > > Eric Wong <normalperson@yhbt.net> wrote:
> > > > > 
> > > > > > Applications streaming large files may want to reduce disk spinups and
> > > > > > I/O latency by performing large amounts of readahead up front
> 
> > This could also be a use case for an audio/video player.
> 
> Sure, but this can all be handled by a userspace application. If you
> want to avoid/batch IO to enable longer spindown times, then you
> have to load the file into RAM somewhere, and you don't need special
> kernel support for that.

>From userspace, I don't know when/if I'm caching too much and possibly
getting the userspace cache itself swapped out.

> > So no, there's no difference that matters between the approaches.
> > But I think doing this in the kernel is easier for userspace users.
> 
> The kernel provides mechanisms for applications to use. You have not
> mentioned anything new that requires a new kernel mechanism to
> acheive - you just need to have the knowledge to put the pieces
> together properly.  People have been solving this same problem for
> the last 20 years without needing to tweak fadvise(). Or even having
> an fadvise() syscall...

fadvise() is fairly new, and AFAIK few apps use it.  Perhaps if it
were improved, more people would use it and not have to reinvent
the wheel.

> Nothing about low latency IO or streaming IO is simple or easy, and
> changing how readahead works doesn't change that fact. All it does
> is change the behaviour of every other application that uses
> fadvise() to minimise IO latency....

I don't want to introduce regressions, either.

Perhaps if part of the FADV_WILLNEED read-ahead were handled
synchronously (maybe 2M?) and humongous large readaheads (like mine)
went to the background, that would be a good trade off?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  4:15         ` Dave Chinner
  2012-12-16  5:23           ` Eric Wong
@ 2012-12-16  8:48           ` Zheng Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Zheng Liu @ 2012-12-16  8:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Wong, Alan Cox, linux-mm, linux-kernel

On Sun, Dec 16, 2012 at 03:15:49PM +1100, Dave Chinner wrote:
> On Sun, Dec 16, 2012 at 03:35:49AM +0000, Eric Wong wrote:
> > Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Dec 16, 2012 at 12:25:49AM +0000, Eric Wong wrote:
> > > > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > > > On Sat, 15 Dec 2012 00:54:48 +0000
> > > > > Eric Wong <normalperson@yhbt.net> wrote:
> > > > > 
> > > > > > Applications streaming large files may want to reduce disk spinups and
> > > > > > I/O latency by performing large amounts of readahead up front
> 
> > This could also be a use case for an audio/video player.
> 
> Sure, but this can all be handled by a userspace application. If you
> want to avoid/batch IO to enable longer spindown times, then you
> have to load the file into RAM somewhere, and you don't need special
> kernel support for that.
> 
> > So no, there's no difference that matters between the approaches.
> > But I think doing this in the kernel is easier for userspace users.
> 
> The kernel provides mechanisms for applications to use. You have not
> mentioned anything new that requires a new kernel mechanism to
> acheive - you just need to have the knowledge to put the pieces
> together properly.  People have been solving this same problem for
> the last 20 years without needing to tweak fadvise(). Or even having
> an fadvise() syscall...
> 
> Nothing about low latency IO or streaming IO is simple or easy, and
> changing how readahead works doesn't change that fact. All it does
> is change the behaviour of every other application that uses
> fadvise() to minimise IO latency....

Hi Dave,

I am wondering this patch might be a good idea to reduce the latency of
fadvise() syscall itself.  I do a really simple test in my desktop to
measure the latency of fadvise syscall.  Before applying this patch,
fadvise syscall takes 32 microseconds.  After applying the patch, it
only takes 4 microseconds. (I was surprised that it takes a very long
time!)

Actually we observe a latency after using fadvise.  But I don't find a
proper time to look at this problem.  So I guess this patch might be
useful to reduce latency.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  5:23           ` Eric Wong
@ 2012-12-16 21:31             ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2012-12-16 21:31 UTC (permalink / raw)
  To: Eric Wong; +Cc: Alan Cox, linux-mm, linux-kernel

On Sun, Dec 16, 2012 at 05:23:02AM +0000, Eric Wong wrote:
> Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Dec 16, 2012 at 03:35:49AM +0000, Eric Wong wrote:
> > > Dave Chinner <david@fromorbit.com> wrote:
> > > > On Sun, Dec 16, 2012 at 12:25:49AM +0000, Eric Wong wrote:
> > > > > Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > > > > On Sat, 15 Dec 2012 00:54:48 +0000
> > > > > > Eric Wong <normalperson@yhbt.net> wrote:
> > > > > > 
> > > > > > > Applications streaming large files may want to reduce disk spinups and
> > > > > > > I/O latency by performing large amounts of readahead up front
> > 
> > > This could also be a use case for an audio/video player.
> > 
> > Sure, but this can all be handled by a userspace application. If you
> > want to avoid/batch IO to enable longer spindown times, then you
> > have to load the file into RAM somewhere, and you don't need special
> > kernel support for that.
> 
> From userspace, I don't know when/if I'm caching too much and possibly
> getting the userspace cache itself swapped out.

Which causes th disk to spin up. Now you start to see the complexity
of what you are trying to acheive...

> > > So no, there's no difference that matters between the approaches.
> > > But I think doing this in the kernel is easier for userspace users.
> > 
> > The kernel provides mechanisms for applications to use. You have not
> > mentioned anything new that requires a new kernel mechanism to
> > acheive - you just need to have the knowledge to put the pieces
> > together properly.  People have been solving this same problem for
> > the last 20 years without needing to tweak fadvise(). Or even having
> > an fadvise() syscall...
> 
> fadvise() is fairly new, and AFAIK few apps use it.  Perhaps if it
> were improved, more people would use it and not have to reinvent
> the wheel.

fadvise() is not "fairly new". It's been around for many, many
years - it was there whan the linux kernel moved to git in 2005, and
I haven't bothered to look any further back in history...

> > Nothing about low latency IO or streaming IO is simple or easy, and
> > changing how readahead works doesn't change that fact. All it does
> > is change the behaviour of every other application that uses
> > fadvise() to minimise IO latency....
> 
> I don't want to introduce regressions, either.
> 
> Perhaps if part of the FADV_WILLNEED read-ahead were handled
> synchronously (maybe 2M?) and humongous large readaheads (like mine)
> went to the background, that would be a good trade off?

Which you can already do in userspace yourself without changing the
kernel. i.e:

	main thread			background thread:

	readahead(0, 2MB)
	spawn background thread
	read(0, len)
					readahead(2MB,1GB);

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2012-12-16  2:45 ` Dave Chinner
  2012-12-16  3:04   ` Eric Wong
@ 2013-02-22 16:45   ` Phillip Susi
  2013-02-22 21:13     ` Eric Wong
  1 sibling, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2013-02-22 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Wong, linux-mm, linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/15/2012 9:45 PM, Dave Chinner wrote:
> On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
>> Applications streaming large files may want to reduce disk
>> spinups and I/O latency by performing large amounts of readahead
>> up front. Applications also tend to read files soon after opening
>> them, so waiting on a slow fadvise may cause unpleasant latency
>> when the application starts reading the file.
>> 
>> As a userspace hacker, I'm sometimes tempted to create a
>> background thread in my app to run readahead().  However, I
>> believe doing this in the kernel will make life easier for other
>> userspace hackers.
>> 
>> Since fadvise makes no guarantees about when (or even if)
>> readahead is performed, this change should not hurt existing
>> applications.
>> 
>> "strace -T" timing on an uncached, one gigabyte file:
>> 
>> Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832> 
>> After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> 
> You've basically asked fadvise() to readahead the entire file if
> it can. That means it is likely to issue enough readahead to fill
> the IO queue, and that's where all the latency is coming from. If
> all you are trying to do is reduce the latency of the first read,
> then only readahead the initial range that you are going to need to
> read...

It shouldn't take 2 seconds to queue up some async reads.  Are you
using ext3?  The blocks have to be mapped in order to queue the reads,
and without ext4 extents, this means the indirect blocks have to be
read and can cause fadvise to block.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRJ6CjAAoJEJrBOlT6nu759s8IAKmIyZYDk1JSRP6oJaGaGZ/r
aZCBH52wTPH8DaqFGe+62L8lyIQ5hD15Y+zTuaWh+fJ7C1k/lU8F/QbKCG2D+xCB
vfLF0WRx63fWLLg8xZTRU1x8X6sG+Byp+UYWNspTDrL15ChlaqqGGmwLNo4JxLa8
+AGQVt1WMU3TitD9CUMUfYFWGUQsMR0gWeJkJnjHiEZ7VoGzft2PTlnvElzIk76u
3cmwfoKHrnXzi50rPtP2gonRjMwd8VY859qOk0zlHoMDMcXklAWeIN9PEUIMx+VP
fMnBm6u48cKXPYGvQrGMOdjxlt7k4LhGDZxIlvmwNHWUSaifmkJ8oBMvfbAYtUA=
=G5rE
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] fadvise: perform WILLNEED readahead in a workqueue
  2013-02-22 16:45   ` Phillip Susi
@ 2013-02-22 21:13     ` Eric Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2013-02-22 21:13 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Dave Chinner, linux-mm, linux-kernel

Phillip Susi <psusi@ubuntu.com> wrote:
> > On Sat, Dec 15, 2012 at 12:54:48AM +0000, Eric Wong wrote:
> >> "strace -T" timing on an uncached, one gigabyte file:
> >> 
> >> Before: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <2.484832> 
> >> After: fadvise64(3, 0, 0, POSIX_FADV_WILLNEED) = 0 <0.000061>
> 
> It shouldn't take 2 seconds to queue up some async reads.  Are you
> using ext3?  The blocks have to be mapped in order to queue the reads,
> and without ext4 extents, this means the indirect blocks have to be
> read and can cause fadvise to block.

You're right, I originally tested on ext3.

I just tested an unpatched 3.7.9 kernel with ext4 and is much faster
(~250ms).  I consider ~250ms acceptable for my needs.  Will migrate
the rest of my setup to ext4 soon, thanks for the tip!

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-22 21:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-15  0:54 [PATCH] fadvise: perform WILLNEED readahead in a workqueue Eric Wong
2012-12-15 22:34 ` Alan Cox
2012-12-16  0:25   ` Eric Wong
2012-12-16  3:03     ` Dave Chinner
2012-12-16  3:35       ` Eric Wong
2012-12-16  4:15         ` Dave Chinner
2012-12-16  5:23           ` Eric Wong
2012-12-16 21:31             ` Dave Chinner
2012-12-16  8:48           ` Zheng Liu
2012-12-16  2:45 ` Dave Chinner
2012-12-16  3:04   ` Eric Wong
2012-12-16  3:09     ` Eric Wong
2012-12-16  3:36     ` Dave Chinner
2012-12-16  3:59       ` Eric Wong
2012-12-16  4:26         ` Dave Chinner
2012-12-16  5:17           ` Eric Wong
2013-02-22 16:45   ` Phillip Susi
2013-02-22 21:13     ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).