All of lore.kernel.org
 help / color / mirror / Atom feed
* Lockless page cache test results
@ 2006-04-26 13:53 Jens Axboe
  2006-04-26 14:43   ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 13:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Nick Piggin, Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Hi,

Running a splice benchmark on a 4-way IPF box, I decided to give the
lockless page cache patches from Nick a spin. I've attached the results
as a png, it pretty much speaks for itself.

The test in question splices a 1GiB file to a pipe and then splices that
to some output. Normally that output would be something interesting, in
this case it's simply /dev/null. So it tests the input side of things
only, which is what I wanted to do here. To get adequate runtime, the
operation is repeated a number of times (120 in this example). The
benchmark does that number of loops with 1, 2, 3, and 4 clients each
pinned to a private CPU. The pinning is mainly done for more stable
results.

-- 
Jens Axboe


[-- Attachment #2: lockless.png --]
[-- Type: image/png, Size: 3928 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 13:53 Lockless page cache test results Jens Axboe
@ 2006-04-26 14:43   ` Nick Piggin
  2006-04-26 16:55   ` Andrew Morton
  2006-04-28  9:10   ` Pavel Machek
  2 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-26 14:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

Jens Axboe wrote:
> Hi,
> 
> Running a splice benchmark on a 4-way IPF box, I decided to give the
> lockless page cache patches from Nick a spin. I've attached the results
> as a png, it pretty much speaks for itself.
> 
> The test in question splices a 1GiB file to a pipe and then splices that
> to some output. Normally that output would be something interesting, in
> this case it's simply /dev/null. So it tests the input side of things
> only, which is what I wanted to do here. To get adequate runtime, the
> operation is repeated a number of times (120 in this example). The
> benchmark does that number of loops with 1, 2, 3, and 4 clients each
> pinned to a private CPU. The pinning is mainly done for more stable
> results.

Thanks Jens!

It's interesting, single threaded performance is down a little. Is
this significant? In some other results you showed me with 3 splices
each running on their own file (ie. no tree_lock contention), lockless
looked slightly faster on the same machine.

In my microbenchmarks, single threaded lockless is quite a bit faster
than vanilla on both P4 and G5.

It could well be that the speculative get_page operation is naturally
a bit slower on Itanium CPUs -- there is a different mix of barriers,
reads, writes, etc. If only someone gave me an IPF system... ;)

As you said, it would be nice to see how this goes when the other end
are 4 gigabit pipes or so... And then things like specweb and file
serving workloads.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 14:43   ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-26 14:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

Jens Axboe wrote:
> Hi,
> 
> Running a splice benchmark on a 4-way IPF box, I decided to give the
> lockless page cache patches from Nick a spin. I've attached the results
> as a png, it pretty much speaks for itself.
> 
> The test in question splices a 1GiB file to a pipe and then splices that
> to some output. Normally that output would be something interesting, in
> this case it's simply /dev/null. So it tests the input side of things
> only, which is what I wanted to do here. To get adequate runtime, the
> operation is repeated a number of times (120 in this example). The
> benchmark does that number of loops with 1, 2, 3, and 4 clients each
> pinned to a private CPU. The pinning is mainly done for more stable
> results.

Thanks Jens!

It's interesting, single threaded performance is down a little. Is
this significant? In some other results you showed me with 3 splices
each running on their own file (ie. no tree_lock contention), lockless
looked slightly faster on the same machine.

In my microbenchmarks, single threaded lockless is quite a bit faster
than vanilla on both P4 and G5.

It could well be that the speculative get_page operation is naturally
a bit slower on Itanium CPUs -- there is a different mix of barriers,
reads, writes, etc. If only someone gave me an IPF system... ;)

As you said, it would be nice to see how this goes when the other end
are 4 gigabit pipes or so... And then things like specweb and file
serving workloads.

Nick

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 13:53 Lockless page cache test results Jens Axboe
@ 2006-04-26 16:55   ` Andrew Morton
  2006-04-26 16:55   ` Andrew Morton
  2006-04-28  9:10   ` Pavel Machek
  2 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 16:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> Running a splice benchmark on a 4-way IPF box, I decided to give the
>  lockless page cache patches from Nick a spin. I've attached the results
>  as a png, it pretty much speaks for itself.

It does.

What does the test do?

In particular, does it cause the kernel to take tree_lock once per page, or
once per batch-of-pages?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 16:55   ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 16:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> Running a splice benchmark on a 4-way IPF box, I decided to give the
>  lockless page cache patches from Nick a spin. I've attached the results
>  as a png, it pretty much speaks for itself.

It does.

What does the test do?

In particular, does it cause the kernel to take tree_lock once per page, or
once per batch-of-pages?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 16:55   ` Andrew Morton
@ 2006-04-26 17:42     ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > Running a splice benchmark on a 4-way IPF box, I decided to give the
> >  lockless page cache patches from Nick a spin. I've attached the results
> >  as a png, it pretty much speaks for itself.
> 
> It does.
> 
> What does the test do?
>
> In particular, does it cause the kernel to take tree_lock once per
> page, or once per batch-of-pages?

Once per page, it's basically exercising the generic_file_splice_read()
path. Basically X number of "clients" open the same file, and fill those
pages into a pipe using splice. The output end of the pipe is then
spliced to /dev/null to toss it away again. The top of the 4-client
vanilla run profile looks like this:

samples  %        symbol name
65328    47.8972  find_get_page

Basically the machine is fully pegged, about 7% idle time.

We can speedup the lookups with find_get_pages(). The test does 64k max,
so with luck we should be able to pull 16 pages in at the time. I'll try
and run such a test. But boy I wish find_get_pages_contig() was there
for that. I think I'd prefer adding that instead of coding that logic in
splice, it can get a little tricky.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 17:42     ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 17:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > Running a splice benchmark on a 4-way IPF box, I decided to give the
> >  lockless page cache patches from Nick a spin. I've attached the results
> >  as a png, it pretty much speaks for itself.
> 
> It does.
> 
> What does the test do?
>
> In particular, does it cause the kernel to take tree_lock once per
> page, or once per batch-of-pages?

Once per page, it's basically exercising the generic_file_splice_read()
path. Basically X number of "clients" open the same file, and fill those
pages into a pipe using splice. The output end of the pipe is then
spliced to /dev/null to toss it away again. The top of the 4-client
vanilla run profile looks like this:

samples  %        symbol name
65328    47.8972  find_get_page

Basically the machine is fully pegged, about 7% idle time.

We can speedup the lookups with find_get_pages(). The test does 64k max,
so with luck we should be able to pull 16 pages in at the time. I'll try
and run such a test. But boy I wish find_get_pages_contig() was there
for that. I think I'd prefer adding that instead of coding that logic in
splice, it can get a little tricky.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 17:42     ` Jens Axboe
@ 2006-04-26 18:10       ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 18:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> On Wed, Apr 26 2006, Andrew Morton wrote:
> > Jens Axboe <axboe@suse.de> wrote:
> > >
> > > Running a splice benchmark on a 4-way IPF box, I decided to give the
> > >  lockless page cache patches from Nick a spin. I've attached the results
> > >  as a png, it pretty much speaks for itself.
> > 
> > It does.
> > 
> > What does the test do?
> >
> > In particular, does it cause the kernel to take tree_lock once per
> > page, or once per batch-of-pages?
> 
> Once per page, it's basically exercising the generic_file_splice_read()
> path. Basically X number of "clients" open the same file, and fill those
> pages into a pipe using splice. The output end of the pipe is then
> spliced to /dev/null to toss it away again.

OK.  That doesn't sound like something which a real application is likely
to do ;)

> The top of the 4-client
> vanilla run profile looks like this:
> 
> samples  %        symbol name
> 65328    47.8972  find_get_page
> 
> Basically the machine is fully pegged, about 7% idle time.

Most of the time an acquisition of tree_lock is associated with a disk
read, or a page-size memset, or a page-size memcpy.  And often an
acquisition of tree_lock is associated with multiple pages, not just a
single page.

So although the graph looks good, I wouldn't view this as a super-strong
argument in favour of lockless pagecache.

> We can speedup the lookups with find_get_pages(). The test does 64k max,
> so with luck we should be able to pull 16 pages in at the time. I'll try
> and run such a test.

OK.

> But boy I wish find_get_pages_contig() was there
> for that. I think I'd prefer adding that instead of coding that logic in
> splice, it can get a little tricky.

I guess it'd make sense - we haven't had a need for such a thing before.

umm, something like...

unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
			    unsigned int nr_pages, struct page **pages)
{
	unsigned int i;
	unsigned int ret;
	pgoff_t index = start;

	read_lock_irq(&mapping->tree_lock);
	ret = radix_tree_gang_lookup(&mapping->page_tree,
				(void **)pages, start, nr_pages);
	for (i = 0; i < ret; i++) {
		if (pages[i]->mapping == NULL || pages[i]->index != index)
			break;
		page_cache_get(pages[i]);
		index++;
	}
	read_unlock_irq(&mapping->tree_lock);
	return i;
}


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:10       ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 18:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> On Wed, Apr 26 2006, Andrew Morton wrote:
> > Jens Axboe <axboe@suse.de> wrote:
> > >
> > > Running a splice benchmark on a 4-way IPF box, I decided to give the
> > >  lockless page cache patches from Nick a spin. I've attached the results
> > >  as a png, it pretty much speaks for itself.
> > 
> > It does.
> > 
> > What does the test do?
> >
> > In particular, does it cause the kernel to take tree_lock once per
> > page, or once per batch-of-pages?
> 
> Once per page, it's basically exercising the generic_file_splice_read()
> path. Basically X number of "clients" open the same file, and fill those
> pages into a pipe using splice. The output end of the pipe is then
> spliced to /dev/null to toss it away again.

OK.  That doesn't sound like something which a real application is likely
to do ;)

> The top of the 4-client
> vanilla run profile looks like this:
> 
> samples  %        symbol name
> 65328    47.8972  find_get_page
> 
> Basically the machine is fully pegged, about 7% idle time.

Most of the time an acquisition of tree_lock is associated with a disk
read, or a page-size memset, or a page-size memcpy.  And often an
acquisition of tree_lock is associated with multiple pages, not just a
single page.

So although the graph looks good, I wouldn't view this as a super-strong
argument in favour of lockless pagecache.

> We can speedup the lookups with find_get_pages(). The test does 64k max,
> so with luck we should be able to pull 16 pages in at the time. I'll try
> and run such a test.

OK.

> But boy I wish find_get_pages_contig() was there
> for that. I think I'd prefer adding that instead of coding that logic in
> splice, it can get a little tricky.

I guess it'd make sense - we haven't had a need for such a thing before.

umm, something like...

unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
			    unsigned int nr_pages, struct page **pages)
{
	unsigned int i;
	unsigned int ret;
	pgoff_t index = start;

	read_lock_irq(&mapping->tree_lock);
	ret = radix_tree_gang_lookup(&mapping->page_tree,
				(void **)pages, start, nr_pages);
	for (i = 0; i < ret; i++) {
		if (pages[i]->mapping == NULL || pages[i]->index != index)
			break;
		page_cache_get(pages[i]);
		index++;
	}
	read_unlock_irq(&mapping->tree_lock);
	return i;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:10       ` Andrew Morton
@ 2006-04-26 18:23         ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 18:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > On Wed, Apr 26 2006, Andrew Morton wrote:
> > > Jens Axboe <axboe@suse.de> wrote:
> > > >
> > > > Running a splice benchmark on a 4-way IPF box, I decided to give the
> > > >  lockless page cache patches from Nick a spin. I've attached the results
> > > >  as a png, it pretty much speaks for itself.
> > > 
> > > It does.
> > > 
> > > What does the test do?
> > >
> > > In particular, does it cause the kernel to take tree_lock once per
> > > page, or once per batch-of-pages?
> > 
> > Once per page, it's basically exercising the generic_file_splice_read()
> > path. Basically X number of "clients" open the same file, and fill those
> > pages into a pipe using splice. The output end of the pipe is then
> > spliced to /dev/null to toss it away again.
> 
> OK.  That doesn't sound like something which a real application is likely
> to do ;)

I don't think it's totally unlikely. Could be streaming a large media
file to many clients, for instance. Of course you are not going to push
gigabytes of data per seconds like this test, but it's still the same
type of workload.

> > The top of the 4-client
> > vanilla run profile looks like this:
> > 
> > samples  %        symbol name
> > 65328    47.8972  find_get_page
> > 
> > Basically the machine is fully pegged, about 7% idle time.
> 
> Most of the time an acquisition of tree_lock is associated with a disk
> read, or a page-size memset, or a page-size memcpy.  And often an
> acquisition of tree_lock is associated with multiple pages, not just a
> single page.

Yeah with mostly io then I'd be hard pressed to show a difference.

> So although the graph looks good, I wouldn't view this as a super-strong
> argument in favour of lockless pagecache.

I didn't claim it was, just trying to show some data on at least one
case where the lockless patches perform well and the stock kernel does
not :-)

Are there cases where the lockless page cache performs worse than the
current one?

> > But boy I wish find_get_pages_contig() was there
> > for that. I think I'd prefer adding that instead of coding that logic in
> > splice, it can get a little tricky.
> 
> I guess it'd make sense - we haven't had a need for such a thing before.
> 
> umm, something like...
> 
> unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> 			    unsigned int nr_pages, struct page **pages)
> {
> 	unsigned int i;
> 	unsigned int ret;
> 	pgoff_t index = start;
> 
> 	read_lock_irq(&mapping->tree_lock);
> 	ret = radix_tree_gang_lookup(&mapping->page_tree,
> 				(void **)pages, start, nr_pages);
> 	for (i = 0; i < ret; i++) {
> 		if (pages[i]->mapping == NULL || pages[i]->index != index)
> 			break;
> 		page_cache_get(pages[i]);
> 		index++;
> 	}
> 	read_unlock_irq(&mapping->tree_lock);
> 	return i;
> }

Ah clever, I didn't think of stopping on the first hole. Works well
since you need to manually get a reference on the page anyways.

Let me redo the numbers with this splice updated.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:23         ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 18:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > On Wed, Apr 26 2006, Andrew Morton wrote:
> > > Jens Axboe <axboe@suse.de> wrote:
> > > >
> > > > Running a splice benchmark on a 4-way IPF box, I decided to give the
> > > >  lockless page cache patches from Nick a spin. I've attached the results
> > > >  as a png, it pretty much speaks for itself.
> > > 
> > > It does.
> > > 
> > > What does the test do?
> > >
> > > In particular, does it cause the kernel to take tree_lock once per
> > > page, or once per batch-of-pages?
> > 
> > Once per page, it's basically exercising the generic_file_splice_read()
> > path. Basically X number of "clients" open the same file, and fill those
> > pages into a pipe using splice. The output end of the pipe is then
> > spliced to /dev/null to toss it away again.
> 
> OK.  That doesn't sound like something which a real application is likely
> to do ;)

I don't think it's totally unlikely. Could be streaming a large media
file to many clients, for instance. Of course you are not going to push
gigabytes of data per seconds like this test, but it's still the same
type of workload.

> > The top of the 4-client
> > vanilla run profile looks like this:
> > 
> > samples  %        symbol name
> > 65328    47.8972  find_get_page
> > 
> > Basically the machine is fully pegged, about 7% idle time.
> 
> Most of the time an acquisition of tree_lock is associated with a disk
> read, or a page-size memset, or a page-size memcpy.  And often an
> acquisition of tree_lock is associated with multiple pages, not just a
> single page.

Yeah with mostly io then I'd be hard pressed to show a difference.

> So although the graph looks good, I wouldn't view this as a super-strong
> argument in favour of lockless pagecache.

I didn't claim it was, just trying to show some data on at least one
case where the lockless patches perform well and the stock kernel does
not :-)

Are there cases where the lockless page cache performs worse than the
current one?

> > But boy I wish find_get_pages_contig() was there
> > for that. I think I'd prefer adding that instead of coding that logic in
> > splice, it can get a little tricky.
> 
> I guess it'd make sense - we haven't had a need for such a thing before.
> 
> umm, something like...
> 
> unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> 			    unsigned int nr_pages, struct page **pages)
> {
> 	unsigned int i;
> 	unsigned int ret;
> 	pgoff_t index = start;
> 
> 	read_lock_irq(&mapping->tree_lock);
> 	ret = radix_tree_gang_lookup(&mapping->page_tree,
> 				(void **)pages, start, nr_pages);
> 	for (i = 0; i < ret; i++) {
> 		if (pages[i]->mapping == NULL || pages[i]->index != index)
> 			break;
> 		page_cache_get(pages[i]);
> 		index++;
> 	}
> 	read_unlock_irq(&mapping->tree_lock);
> 	return i;
> }

Ah clever, I didn't think of stopping on the first hole. Works well
since you need to manually get a reference on the page anyways.

Let me redo the numbers with this splice updated.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:10       ` Andrew Morton
@ 2006-04-26 18:34         ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2006-04-26 18:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

On Wed, 26 Apr 2006, Andrew Morton wrote:

> OK.  That doesn't sound like something which a real application is likely
> to do ;)

A real application scenario may be an application that has lots of threads 
that are streaming data through multiple different disk channels (that 
are able to transfer data simultanouesly. e.g. connected to different 
nodes in a NUMA system) into the same address space.

Something like the above is fairly typical for multimedia filters 
processing large amounts of data.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:34         ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2006-04-26 18:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

On Wed, 26 Apr 2006, Andrew Morton wrote:

> OK.  That doesn't sound like something which a real application is likely
> to do ;)

A real application scenario may be an application that has lots of threads 
that are streaming data through multiple different disk channels (that 
are able to transfer data simultanouesly. e.g. connected to different 
nodes in a NUMA system) into the same address space.

Something like the above is fairly typical for multimedia filters 
processing large amounts of data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:23         ` Jens Axboe
@ 2006-04-26 18:46           ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 18:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> Are there cases where the lockless page cache performs worse than the
> current one?

Yeah - when human beings try to understand and maintain it.

The usual tradeoffs apply ;)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:46           ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 18:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> Are there cases where the lockless page cache performs worse than the
> current one?

Yeah - when human beings try to understand and maintain it.

The usual tradeoffs apply ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:34         ` Christoph Lameter
@ 2006-04-26 18:47           ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 18:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: axboe, linux-kernel, npiggin, linux-mm

Christoph Lameter <clameter@sgi.com> wrote:
>
> On Wed, 26 Apr 2006, Andrew Morton wrote:
> 
> > OK.  That doesn't sound like something which a real application is likely
> > to do ;)
> 
> A real application scenario may be an application that has lots of threads 
> that are streaming data through multiple different disk channels (that 
> are able to transfer data simultanouesly. e.g. connected to different 
> nodes in a NUMA system) into the same address space.
> 
> Something like the above is fairly typical for multimedia filters 
> processing large amounts of data.

>From the same file?

To /dev/null?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:47           ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 18:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: axboe, linux-kernel, npiggin, linux-mm

Christoph Lameter <clameter@sgi.com> wrote:
>
> On Wed, 26 Apr 2006, Andrew Morton wrote:
> 
> > OK.  That doesn't sound like something which a real application is likely
> > to do ;)
> 
> A real application scenario may be an application that has lots of threads 
> that are streaming data through multiple different disk channels (that 
> are able to transfer data simultanouesly. e.g. connected to different 
> nodes in a NUMA system) into the same address space.
> 
> Something like the above is fairly typical for multimedia filters 
> processing large amounts of data.

>From the same file?

To /dev/null?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:47           ` Andrew Morton
@ 2006-04-26 18:48             ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2006-04-26 18:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: axboe, linux-kernel, npiggin, linux-mm

On Wed, 26 Apr 2006, Andrew Morton wrote:

> > A real application scenario may be an application that has lots of threads 
> > that are streaming data through multiple different disk channels (that 
> > are able to transfer data simultanouesly. e.g. connected to different 
> > nodes in a NUMA system) into the same address space.
> > 
> > Something like the above is fairly typical for multimedia filters 
> > processing large amounts of data.
> 
> >From the same file?

Reading sections of the same file on multiple processors. This is done in 
order to obtain higher read performance than possible with single threaded
reading.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:48             ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2006-04-26 18:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: axboe, linux-kernel, npiggin, linux-mm

On Wed, 26 Apr 2006, Andrew Morton wrote:

> > A real application scenario may be an application that has lots of threads 
> > that are streaming data through multiple different disk channels (that 
> > are able to transfer data simultanouesly. e.g. connected to different 
> > nodes in a NUMA system) into the same address space.
> > 
> > Something like the above is fairly typical for multimedia filters 
> > processing large amounts of data.
> 
> >From the same file?

Reading sections of the same file on multiple processors. This is done in 
order to obtain higher read performance than possible with single threaded
reading.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:47           ` Andrew Morton
@ 2006-04-26 18:49             ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 18:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > On Wed, 26 Apr 2006, Andrew Morton wrote:
> > 
> > > OK.  That doesn't sound like something which a real application is likely
> > > to do ;)
> > 
> > A real application scenario may be an application that has lots of threads 
> > that are streaming data through multiple different disk channels (that 
> > are able to transfer data simultanouesly. e.g. connected to different 
> > nodes in a NUMA system) into the same address space.
> > 
> > Something like the above is fairly typical for multimedia filters 
> > processing large amounts of data.
> 
> >From the same file?
> 
> To /dev/null?

/dev/null doesn't have much to do with it, other than the fact that it
basically stresses only the input side of things. Same file is the
interesting bit of course, as that's the the granularity of the
tree_lock.

I haven't tested much else, I'll ask the tool to bench more files :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:49             ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 18:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > On Wed, 26 Apr 2006, Andrew Morton wrote:
> > 
> > > OK.  That doesn't sound like something which a real application is likely
> > > to do ;)
> > 
> > A real application scenario may be an application that has lots of threads 
> > that are streaming data through multiple different disk channels (that 
> > are able to transfer data simultanouesly. e.g. connected to different 
> > nodes in a NUMA system) into the same address space.
> > 
> > Something like the above is fairly typical for multimedia filters 
> > processing large amounts of data.
> 
> >From the same file?
> 
> To /dev/null?

/dev/null doesn't have much to do with it, other than the fact that it
basically stresses only the input side of things. Same file is the
interesting bit of course, as that's the the granularity of the
tree_lock.

I haven't tested much else, I'll ask the tool to bench more files :)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 17:42     ` Jens Axboe
  (?)
  (?)
@ 2006-04-26 18:57     ` Jens Axboe
  2006-04-27  2:19         ` KAMEZAWA Hiroyuki
  -1 siblings, 1 reply; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 18:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

On Wed, Apr 26 2006, Jens Axboe wrote:
> We can speedup the lookups with find_get_pages(). The test does 64k max,
> so with luck we should be able to pull 16 pages in at the time. I'll try
> and run such a test. But boy I wish find_get_pages_contig() was there
> for that. I think I'd prefer adding that instead of coding that logic in
> splice, it can get a little tricky.

Here's such a run, graphed with the other two. I'll redo the lockless
side as well now, it's only fair to compare with that batching as well.

-- 
Jens Axboe


[-- Attachment #2: lockless-2.png --]
[-- Type: image/png, Size: 4511 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:10       ` Andrew Morton
@ 2006-04-26 18:58         ` Christoph Hellwig
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2006-04-26 18:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

On Wed, Apr 26, 2006 at 11:10:54AM -0700, Andrew Morton wrote:
> > But boy I wish find_get_pages_contig() was there
> > for that. I think I'd prefer adding that instead of coding that logic in
> > splice, it can get a little tricky.
> 
> I guess it'd make sense - we haven't had a need for such a thing before.
> 
> umm, something like...

XFS would have a use for it, too.  In fact XFS would prefer a
find_or_create_pages-like thing which is the thing splice wants in
the end aswell.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 18:58         ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2006-04-26 18:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

On Wed, Apr 26, 2006 at 11:10:54AM -0700, Andrew Morton wrote:
> > But boy I wish find_get_pages_contig() was there
> > for that. I think I'd prefer adding that instead of coding that logic in
> > splice, it can get a little tricky.
> 
> I guess it'd make sense - we haven't had a need for such a thing before.
> 
> umm, something like...

XFS would have a use for it, too.  In fact XFS would prefer a
find_or_create_pages-like thing which is the thing splice wants in
the end aswell.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:10       ` Andrew Morton
@ 2006-04-26 19:00         ` Linus Torvalds
  -1 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2006-04-26 19:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm



On Wed, 26 Apr 2006, Andrew Morton wrote:

> Jens Axboe <axboe@suse.de> wrote:
> > 
> > Once per page, it's basically exercising the generic_file_splice_read()
> > path. Basically X number of "clients" open the same file, and fill those
> > pages into a pipe using splice. The output end of the pipe is then
> > spliced to /dev/null to toss it away again.
> 
> OK.  That doesn't sound like something which a real application is likely
> to do ;)

True, but on the other hand, it does kind of "distill" one (small) part of 
something that real apps _are_ likely to do.

The whole 'splice to /dev/null' part can be seen as totally irrelevant, 
but at the same time a way to ignore all the other parts of normal page 
cache usage (ie the other parts of page cache usage tend to be the "map it 
into user space" or the actual "memcpy_to/from_user()" or the "TCP send" 
part).

The question, of course, is whether the part that remains (the actual page 
lookup) is important enough to matter, once it is part of a bigger chain 
in a real application.

In other words, the splice() thing is just a way to isolate one part of a 
chain that is usually much more involved, and micro-benchmark just that 
one part.

Splice itself can be optimized to do the lookup locking only once per N 
pages (where N currently is on the order of ~16), but that may not be as 
easy for some other paths (ie the normal read path).

And the "reading from the same file in multiple threads" _is_ a real load. 
It may sound stupid, but it would happen for any server that has a lot of 
locality across clients (and that's very much true for web-servers, for 
example).

That said, under most real loads, the page cach elookup is obviously 
always going to be just a tiny tiny part (as shown by the fact that Jens 
quotes 35 GB/s throughput - possible only because splice to /dev/null 
doesn't need to actually ever even _touch_ the data).

The fact that it drops to "just" 3GB/s for four clients is somewhat 
interesting, though, since that does put a limit on how well we can serve 
the same file (of course, 3GB/s is still a lot faster than any modern 
network will ever be able to push things around, but it's getting closer 
to the possibilities for real hardware (ie IB over PCI-X should be able to 
do about 1GB/s in "real life")

So the fact that basically just lookup/locking overhead can limit things 
to 3GB/s is absolutely not totally uninteresting. Even if in practice 
there are other limits that would probably hit us much earlier.

It would be interesting to see where doing gang-lookup moves the target, 
but on the other hand, with smaller files (and small files are still 
common), gang lookup isn't going to help as much.

Of course, with small files, the actual filename lookup is likely to be 
the real limiter.

			Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 19:00         ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2006-04-26 19:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm


On Wed, 26 Apr 2006, Andrew Morton wrote:

> Jens Axboe <axboe@suse.de> wrote:
> > 
> > Once per page, it's basically exercising the generic_file_splice_read()
> > path. Basically X number of "clients" open the same file, and fill those
> > pages into a pipe using splice. The output end of the pipe is then
> > spliced to /dev/null to toss it away again.
> 
> OK.  That doesn't sound like something which a real application is likely
> to do ;)

True, but on the other hand, it does kind of "distill" one (small) part of 
something that real apps _are_ likely to do.

The whole 'splice to /dev/null' part can be seen as totally irrelevant, 
but at the same time a way to ignore all the other parts of normal page 
cache usage (ie the other parts of page cache usage tend to be the "map it 
into user space" or the actual "memcpy_to/from_user()" or the "TCP send" 
part).

The question, of course, is whether the part that remains (the actual page 
lookup) is important enough to matter, once it is part of a bigger chain 
in a real application.

In other words, the splice() thing is just a way to isolate one part of a 
chain that is usually much more involved, and micro-benchmark just that 
one part.

Splice itself can be optimized to do the lookup locking only once per N 
pages (where N currently is on the order of ~16), but that may not be as 
easy for some other paths (ie the normal read path).

And the "reading from the same file in multiple threads" _is_ a real load. 
It may sound stupid, but it would happen for any server that has a lot of 
locality across clients (and that's very much true for web-servers, for 
example).

That said, under most real loads, the page cach elookup is obviously 
always going to be just a tiny tiny part (as shown by the fact that Jens 
quotes 35 GB/s throughput - possible only because splice to /dev/null 
doesn't need to actually ever even _touch_ the data).

The fact that it drops to "just" 3GB/s for four clients is somewhat 
interesting, though, since that does put a limit on how well we can serve 
the same file (of course, 3GB/s is still a lot faster than any modern 
network will ever be able to push things around, but it's getting closer 
to the possibilities for real hardware (ie IB over PCI-X should be able to 
do about 1GB/s in "real life")

So the fact that basically just lookup/locking overhead can limit things 
to 3GB/s is absolutely not totally uninteresting. Even if in practice 
there are other limits that would probably hit us much earlier.

It would be interesting to see where doing gang-lookup moves the target, 
but on the other hand, with smaller files (and small files are still 
common), gang lookup isn't going to help as much.

Of course, with small files, the actual filename lookup is likely to be 
the real limiter.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:58         ` Christoph Hellwig
@ 2006-04-26 19:02           ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:02 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Christoph Hellwig wrote:
> On Wed, Apr 26, 2006 at 11:10:54AM -0700, Andrew Morton wrote:
> > > But boy I wish find_get_pages_contig() was there
> > > for that. I think I'd prefer adding that instead of coding that logic in
> > > splice, it can get a little tricky.
> > 
> > I guess it'd make sense - we haven't had a need for such a thing before.
> > 
> > umm, something like...
> 
> XFS would have a use for it, too.  In fact XFS would prefer a
> find_or_create_pages-like thing which is the thing splice wants in
> the end aswell.

Yes, but preferably without locking the page. So splice really wants a
find_get_or_create_pages(). But it wouldn't simplify splice very much in
the end, since the worst part of that function is trying to ascertain if
the page is good, needs to be read in, truncated, etc.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 19:02           ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:02 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Christoph Hellwig wrote:
> On Wed, Apr 26, 2006 at 11:10:54AM -0700, Andrew Morton wrote:
> > > But boy I wish find_get_pages_contig() was there
> > > for that. I think I'd prefer adding that instead of coding that logic in
> > > splice, it can get a little tricky.
> > 
> > I guess it'd make sense - we haven't had a need for such a thing before.
> > 
> > umm, something like...
> 
> XFS would have a use for it, too.  In fact XFS would prefer a
> find_or_create_pages-like thing which is the thing splice wants in
> the end aswell.

Yes, but preferably without locking the page. So splice really wants a
find_get_or_create_pages(). But it wouldn't simplify splice very much in
the end, since the worst part of that function is trying to ascertain if
the page is good, needs to be read in, truncated, etc.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 19:00         ` Linus Torvalds
@ 2006-04-26 19:15           ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Linus Torvalds wrote:
> 
> 
> On Wed, 26 Apr 2006, Andrew Morton wrote:
> 
> > Jens Axboe <axboe@suse.de> wrote:
> > > 
> > > Once per page, it's basically exercising the generic_file_splice_read()
> > > path. Basically X number of "clients" open the same file, and fill those
> > > pages into a pipe using splice. The output end of the pipe is then
> > > spliced to /dev/null to toss it away again.
> > 
> > OK.  That doesn't sound like something which a real application is likely
> > to do ;)
> 
> True, but on the other hand, it does kind of "distill" one (small) part of 
> something that real apps _are_ likely to do.
> 
> The whole 'splice to /dev/null' part can be seen as totally irrelevant, 
> but at the same time a way to ignore all the other parts of normal page 
> cache usage (ie the other parts of page cache usage tend to be the "map it 
> into user space" or the actual "memcpy_to/from_user()" or the "TCP send" 
> part).
> 
> The question, of course, is whether the part that remains (the actual page 
> lookup) is important enough to matter, once it is part of a bigger chain 
> in a real application.
> 
> In other words, the splice() thing is just a way to isolate one part of a 
> chain that is usually much more involved, and micro-benchmark just that 
> one part.

Nick called it a find_get_page() micro benchmark, which is pretty might
spot on. So naturally it shows the absolute best side of the lockless
page cache, but that is also very interesting. The /dev/null output can
just be seen as a "infinitely" fast output method, both from a
throughput and light weight POV.

> It would be interesting to see where doing gang-lookup moves the target, 
> but on the other hand, with smaller files (and small files are still 
> common), gang lookup isn't going to help as much.

With a 16-page gang lookup in splice, the top profile for the 4-client
case (which is now at 4GiB/sec instead of 3) are:

samples  %        symbol name
30396    36.7217  __do_page_cache_readahead
25843    31.2212  find_get_pages_contig
9699     11.7174  default_idle

Even disregarding that readahead contender that could probably be made a
little more clever, we are still spending an awful lot of time in the
page lookup. I didn't mention this before, but the get_page/put_page
overhead is also a lot smaller with the lockless patches.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 19:15           ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Linus Torvalds wrote:
> 
> 
> On Wed, 26 Apr 2006, Andrew Morton wrote:
> 
> > Jens Axboe <axboe@suse.de> wrote:
> > > 
> > > Once per page, it's basically exercising the generic_file_splice_read()
> > > path. Basically X number of "clients" open the same file, and fill those
> > > pages into a pipe using splice. The output end of the pipe is then
> > > spliced to /dev/null to toss it away again.
> > 
> > OK.  That doesn't sound like something which a real application is likely
> > to do ;)
> 
> True, but on the other hand, it does kind of "distill" one (small) part of 
> something that real apps _are_ likely to do.
> 
> The whole 'splice to /dev/null' part can be seen as totally irrelevant, 
> but at the same time a way to ignore all the other parts of normal page 
> cache usage (ie the other parts of page cache usage tend to be the "map it 
> into user space" or the actual "memcpy_to/from_user()" or the "TCP send" 
> part).
> 
> The question, of course, is whether the part that remains (the actual page 
> lookup) is important enough to matter, once it is part of a bigger chain 
> in a real application.
> 
> In other words, the splice() thing is just a way to isolate one part of a 
> chain that is usually much more involved, and micro-benchmark just that 
> one part.

Nick called it a find_get_page() micro benchmark, which is pretty might
spot on. So naturally it shows the absolute best side of the lockless
page cache, but that is also very interesting. The /dev/null output can
just be seen as a "infinitely" fast output method, both from a
throughput and light weight POV.

> It would be interesting to see where doing gang-lookup moves the target, 
> but on the other hand, with smaller files (and small files are still 
> common), gang lookup isn't going to help as much.

With a 16-page gang lookup in splice, the top profile for the 4-client
case (which is now at 4GiB/sec instead of 3) are:

samples  %        symbol name
30396    36.7217  __do_page_cache_readahead
25843    31.2212  find_get_pages_contig
9699     11.7174  default_idle

Even disregarding that readahead contender that could probably be made a
little more clever, we are still spending an awful lot of time in the
page lookup. I didn't mention this before, but the get_page/put_page
overhead is also a lot smaller with the lockless patches.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:46           ` Andrew Morton
@ 2006-04-26 19:21             ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > Are there cases where the lockless page cache performs worse than the
> > current one?
> 
> Yeah - when human beings try to understand and maintain it.
> 
> The usual tradeoffs apply ;)

Ah ok, thanks for clearing that up :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 19:21             ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, linux-mm

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > Are there cases where the lockless page cache performs worse than the
> > current one?
> 
> Yeah - when human beings try to understand and maintain it.
> 
> The usual tradeoffs apply ;)

Ah ok, thanks for clearing that up :-)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 14:43   ` Nick Piggin
@ 2006-04-26 19:46     ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

On Thu, Apr 27 2006, Nick Piggin wrote:
> Jens Axboe wrote:
> >Hi,
> >
> >Running a splice benchmark on a 4-way IPF box, I decided to give the
> >lockless page cache patches from Nick a spin. I've attached the results
> >as a png, it pretty much speaks for itself.
> >
> >The test in question splices a 1GiB file to a pipe and then splices that
> >to some output. Normally that output would be something interesting, in
> >this case it's simply /dev/null. So it tests the input side of things
> >only, which is what I wanted to do here. To get adequate runtime, the
> >operation is repeated a number of times (120 in this example). The
> >benchmark does that number of loops with 1, 2, 3, and 4 clients each
> >pinned to a private CPU. The pinning is mainly done for more stable
> >results.
> 
> Thanks Jens!
> 
> It's interesting, single threaded performance is down a little. Is
> this significant? In some other results you showed me with 3 splices
> each running on their own file (ie. no tree_lock contention), lockless
> looked slightly faster on the same machine.

I can't say for sure, as I haven't done enough of these runs to know for
a fact if it's just a little fluctuation or actually statistically
significant. The tests are quick to run, I'll do a series of single
thread runs tomorrow to tell you.

> It could well be that the speculative get_page operation is naturally
> a bit slower on Itanium CPUs -- there is a different mix of barriers,
> reads, writes, etc. If only someone gave me an IPF system... ;)

I'll gladly trade the heat and noise generation of that beast with you
:-)

I can do the same numbers on a 2-way em64t for comparison, that should
get us a little better coverage.

> As you said, it would be nice to see how this goes when the other end
> are 4 gigabit pipes or so... And then things like specweb and file
> serving workloads.

Yes, for now I just consider the /dev/null splicing an extremely fast
and extremely light weigth interconnect :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 19:46     ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-26 19:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

On Thu, Apr 27 2006, Nick Piggin wrote:
> Jens Axboe wrote:
> >Hi,
> >
> >Running a splice benchmark on a 4-way IPF box, I decided to give the
> >lockless page cache patches from Nick a spin. I've attached the results
> >as a png, it pretty much speaks for itself.
> >
> >The test in question splices a 1GiB file to a pipe and then splices that
> >to some output. Normally that output would be something interesting, in
> >this case it's simply /dev/null. So it tests the input side of things
> >only, which is what I wanted to do here. To get adequate runtime, the
> >operation is repeated a number of times (120 in this example). The
> >benchmark does that number of loops with 1, 2, 3, and 4 clients each
> >pinned to a private CPU. The pinning is mainly done for more stable
> >results.
> 
> Thanks Jens!
> 
> It's interesting, single threaded performance is down a little. Is
> this significant? In some other results you showed me with 3 splices
> each running on their own file (ie. no tree_lock contention), lockless
> looked slightly faster on the same machine.

I can't say for sure, as I haven't done enough of these runs to know for
a fact if it's just a little fluctuation or actually statistically
significant. The tests are quick to run, I'll do a series of single
thread runs tomorrow to tell you.

> It could well be that the speculative get_page operation is naturally
> a bit slower on Itanium CPUs -- there is a different mix of barriers,
> reads, writes, etc. If only someone gave me an IPF system... ;)

I'll gladly trade the heat and noise generation of that beast with you
:-)

I can do the same numbers on a 2-way em64t for comparison, that should
get us a little better coverage.

> As you said, it would be nice to see how this goes when the other end
> are 4 gigabit pipes or so... And then things like specweb and file
> serving workloads.

Yes, for now I just consider the /dev/null splicing an extremely fast
and extremely light weigth interconnect :-)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 19:15           ` Jens Axboe
@ 2006-04-26 20:12             ` Andrew Morton
  -1 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 20:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: torvalds, linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> With a 16-page gang lookup in splice, the top profile for the 4-client
> case (which is now at 4GiB/sec instead of 3) are:
> 
> samples  %        symbol name
> 30396    36.7217  __do_page_cache_readahead
> 25843    31.2212  find_get_pages_contig
> 9699     11.7174  default_idle

__do_page_cache_readahead() should use gang lookup.  We never got around to
that, mainly because nothing really demonstrated a need.

It's a problem that __do_page_cache_readahead() is being called at all -
with everything in pagecache we should be auto-turning-off readahead.  This
happens because splice is calling the low-level do_pagecache_readahead(). 
If you convert it to use page_cache_readahead(), that will all vanish if
readahead is working right.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 20:12             ` Andrew Morton
  0 siblings, 0 replies; 99+ messages in thread
From: Andrew Morton @ 2006-04-26 20:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: torvalds, linux-kernel, npiggin, linux-mm

Jens Axboe <axboe@suse.de> wrote:
>
> With a 16-page gang lookup in splice, the top profile for the 4-client
> case (which is now at 4GiB/sec instead of 3) are:
> 
> samples  %        symbol name
> 30396    36.7217  __do_page_cache_readahead
> 25843    31.2212  find_get_pages_contig
> 9699     11.7174  default_idle

__do_page_cache_readahead() should use gang lookup.  We never got around to
that, mainly because nothing really demonstrated a need.

It's a problem that __do_page_cache_readahead() is being called at all -
with everything in pagecache we should be auto-turning-off readahead.  This
happens because splice is calling the low-level do_pagecache_readahead(). 
If you convert it to use page_cache_readahead(), that will all vanish if
readahead is working right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:49             ` Jens Axboe
@ 2006-04-26 20:31               ` Christoph Lameter
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2006-04-26 20:31 UTC (permalink / raw)
  To: dgc; +Cc: Jens Axboe, Andrew Morton, linux-kernel, npiggin, linux-mm

Dave: Can you tell us more about the tree_lock contentions on I/O that you 
have seen?

On Wed, 26 Apr 2006, Jens Axboe wrote:

> On Wed, Apr 26 2006, Andrew Morton wrote:
> > Christoph Lameter <clameter@sgi.com> wrote:
> > >
> > > On Wed, 26 Apr 2006, Andrew Morton wrote:
> > > 
> > > > OK.  That doesn't sound like something which a real application is likely
> > > > to do ;)
> > > 
> > > A real application scenario may be an application that has lots of threads 
> > > that are streaming data through multiple different disk channels (that 
> > > are able to transfer data simultanouesly. e.g. connected to different 
> > > nodes in a NUMA system) into the same address space.
> > > 
> > > Something like the above is fairly typical for multimedia filters 
> > > processing large amounts of data.
> > 
> > >From the same file?
> > 
> > To /dev/null?
> 
> /dev/null doesn't have much to do with it, other than the fact that it
> basically stresses only the input side of things. Same file is the
> interesting bit of course, as that's the the granularity of the
> tree_lock.
> 
> I haven't tested much else, I'll ask the tool to bench more files :)
> 
> -- 
> Jens Axboe
> 
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-26 20:31               ` Christoph Lameter
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Lameter @ 2006-04-26 20:31 UTC (permalink / raw)
  To: dgc; +Cc: Jens Axboe, Andrew Morton, linux-kernel, npiggin, linux-mm

have seen?

On Wed, 26 Apr 2006, Jens Axboe wrote:

> On Wed, Apr 26 2006, Andrew Morton wrote:
> > Christoph Lameter <clameter@sgi.com> wrote:
> > >
> > > On Wed, 26 Apr 2006, Andrew Morton wrote:
> > > 
> > > > OK.  That doesn't sound like something which a real application is likely
> > > > to do ;)
> > > 
> > > A real application scenario may be an application that has lots of threads 
> > > that are streaming data through multiple different disk channels (that 
> > > are able to transfer data simultanouesly. e.g. connected to different 
> > > nodes in a NUMA system) into the same address space.
> > > 
> > > Something like the above is fairly typical for multimedia filters 
> > > processing large amounts of data.
> > 
> > >From the same file?
> > 
> > To /dev/null?
> 
> /dev/null doesn't have much to do with it, other than the fact that it
> basically stresses only the input side of things. Same file is the
> interesting bit of course, as that's the the granularity of the
> tree_lock.
> 
> I haven't tested much else, I'll ask the tool to bench more files :)
> 
> -- 
> Jens Axboe
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:57     ` Jens Axboe
@ 2006-04-27  2:19         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-27  2:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Wed, 26 Apr 2006 20:57:50 +0200
Jens Axboe <axboe@suse.de> wrote:

> On Wed, Apr 26 2006, Jens Axboe wrote:
> > We can speedup the lookups with find_get_pages(). The test does 64k max,
> > so with luck we should be able to pull 16 pages in at the time. I'll try
> > and run such a test. But boy I wish find_get_pages_contig() was there
> > for that. I think I'd prefer adding that instead of coding that logic in
> > splice, it can get a little tricky.
> 
> Here's such a run, graphed with the other two. I'll redo the lockless
> side as well now, it's only fair to compare with that batching as well.
> 

Hi, thank you for interesting tests.

>From user's view, I want to see the comparison among 
- splice(file,/dev/null),
- mmap+madvise(file,WILLNEED)/write(/dev/null),
- read(file)/write(/dev/null)
in this 1-4 threads test. 

This will show when splice() can be used effectively.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  2:19         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-27  2:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Wed, 26 Apr 2006 20:57:50 +0200
Jens Axboe <axboe@suse.de> wrote:

> On Wed, Apr 26 2006, Jens Axboe wrote:
> > We can speedup the lookups with find_get_pages(). The test does 64k max,
> > so with luck we should be able to pull 16 pages in at the time. I'll try
> > and run such a test. But boy I wish find_get_pages_contig() was there
> > for that. I think I'd prefer adding that instead of coding that logic in
> > splice, it can get a little tricky.
> 
> Here's such a run, graphed with the other two. I'll redo the lockless
> side as well now, it's only fair to compare with that batching as well.
> 

Hi, thank you for interesting tests.

>From user's view, I want to see the comparison among 
- splice(file,/dev/null),
- mmap+madvise(file,WILLNEED)/write(/dev/null),
- read(file)/write(/dev/null)
in this 1-4 threads test. 

This will show when splice() can be used effectively.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:10       ` Andrew Morton
@ 2006-04-27  5:22         ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  5:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

Andrew Morton wrote:

>>The top of the 4-client
>>vanilla run profile looks like this:
>>
>>samples  %        symbol name
>>65328    47.8972  find_get_page
>>
>>Basically the machine is fully pegged, about 7% idle time.
> 
> 
> Most of the time an acquisition of tree_lock is associated with a disk
> read, or a page-size memset, or a page-size memcpy.  And often an
> acquisition of tree_lock is associated with multiple pages, not just a
> single page.

Still, most of the times it is acquired would be once per page for
read, write, nopage.

For read and write, often it will be a full page memcpy but even such
a memcpy operation can quickly become insignificant compared to tl
contention.

Anyway, whatever. What needs to be demonstrated are real world
improvements at the end of the day.

> 
> So although the graph looks good, I wouldn't view this as a super-strong
> argument in favour of lockless pagecache.

No. Cool numbers though ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  5:22         ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  5:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

Andrew Morton wrote:

>>The top of the 4-client
>>vanilla run profile looks like this:
>>
>>samples  %        symbol name
>>65328    47.8972  find_get_page
>>
>>Basically the machine is fully pegged, about 7% idle time.
> 
> 
> Most of the time an acquisition of tree_lock is associated with a disk
> read, or a page-size memset, or a page-size memcpy.  And often an
> acquisition of tree_lock is associated with multiple pages, not just a
> single page.

Still, most of the times it is acquired would be once per page for
read, write, nopage.

For read and write, often it will be a full page memcpy but even such
a memcpy operation can quickly become insignificant compared to tl
contention.

Anyway, whatever. What needs to be demonstrated are real world
improvements at the end of the day.

> 
> So although the graph looks good, I wouldn't view this as a super-strong
> argument in favour of lockless pagecache.

No. Cool numbers though ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: Lockless page cache test results
  2006-04-26 19:46     ` Jens Axboe
@ 2006-04-27  5:39       ` Chen, Kenneth W
  -1 siblings, 0 replies; 99+ messages in thread
From: Chen, Kenneth W @ 2006-04-27  5:39 UTC (permalink / raw)
  To: 'Jens Axboe', 'Nick Piggin'
  Cc: linux-kernel, 'Nick Piggin', 'Andrew Morton', linux-mm

Jens Axboe wrote on Wednesday, April 26, 2006 12:46 PM
> > It's interesting, single threaded performance is down a little. Is
> > this significant? In some other results you showed me with 3 splices
> > each running on their own file (ie. no tree_lock contention), lockless
> > looked slightly faster on the same machine.
> 
> I can do the same numbers on a 2-way em64t for comparison, that should
> get us a little better coverage.


I throw the lockless patch and Jens splice-bench into our benchmark harness,
here are the numbers I collected, on the following hardware:

(1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
(2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
(3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)

Here are the graph:

(1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png

(2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png

(3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png

(4) everything on one graph:
http://kernel-perf.sourceforge.net/splice/splice.png

- Ken

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: Lockless page cache test results
@ 2006-04-27  5:39       ` Chen, Kenneth W
  0 siblings, 0 replies; 99+ messages in thread
From: Chen, Kenneth W @ 2006-04-27  5:39 UTC (permalink / raw)
  To: 'Jens Axboe', 'Nick Piggin'
  Cc: linux-kernel, 'Nick Piggin', 'Andrew Morton', linux-mm

Jens Axboe wrote on Wednesday, April 26, 2006 12:46 PM
> > It's interesting, single threaded performance is down a little. Is
> > this significant? In some other results you showed me with 3 splices
> > each running on their own file (ie. no tree_lock contention), lockless
> > looked slightly faster on the same machine.
> 
> I can do the same numbers on a 2-way em64t for comparison, that should
> get us a little better coverage.


I throw the lockless patch and Jens splice-bench into our benchmark harness,
here are the numbers I collected, on the following hardware:

(1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
(2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
(3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)

Here are the graph:

(1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png

(2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png

(3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png

(4) everything on one graph:
http://kernel-perf.sourceforge.net/splice/splice.png

- Ken

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 19:00         ` Linus Torvalds
@ 2006-04-27  5:49           ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  5:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm

Linus Torvalds wrote:

> The question, of course, is whether the part that remains (the actual page 
> lookup) is important enough to matter, once it is part of a bigger chain 
> in a real application.

It can be. If one applies a rough estimate, reading only 512MB/s
from the same file from pagecache on each CPU simply cannot scale
past about 16 CPUs before it becomes tree_lock bound (depending
on what sort of lock transfer latency numbers one plugs into the
model).

> And the "reading from the same file in multiple threads" _is_ a real load. 
> It may sound stupid, but it would happen for any server that has a lot of 
> locality across clients (and that's very much true for web-servers, for 
> example).

I think something like postgresql, which uses a WAL via pagecache, things
that use shared memory (which is more important now that fork got lazier.


And it isn't even reading from the same file in multiple threads (although
that's obviously my poster child).

If one has F files in cache, and N CPUs each accessing a random file, the
number of files whos tree_lock a CPU last touched is F/N. The chance of a
CPU next hitting one of these is (F/N) / F = 1/N -> 0 pretty quickly.
include/ perhaps.

Now this won't be so bad as everyone piling up on a single cacheline, but
it will still hurt, especially on systems with fsb bus, broadcast snoop
(Opteron, POWER) etc.

Nor will it only be the read side which is significant. Consider a single
CPU reading a very large file, and each node's kswapd running on other CPUs
to reclaim pagecache (which needs to take the write lock).

> 
> That said, under most real loads, the page cach elookup is obviously 
> always going to be just a tiny tiny part (as shown by the fact that Jens 
> quotes 35 GB/s throughput - possible only because splice to /dev/null 
> doesn't need to actually ever even _touch_ the data).
> 
> The fact that it drops to "just" 3GB/s for four clients is somewhat 
> interesting, though, since that does put a limit on how well we can serve 
> the same file (of course, 3GB/s is still a lot faster than any modern 
> network will ever be able to push things around, but it's getting closer 
> to the possibilities for real hardware (ie IB over PCI-X should be able to 
> do about 1GB/s in "real life")
> 
> So the fact that basically just lookup/locking overhead can limit things 
> to 3GB/s is absolutely not totally uninteresting. Even if in practice 
> there are other limits that would probably hit us much earlier.
> 
> It would be interesting to see where doing gang-lookup moves the target, 
> but on the other hand, with smaller files (and small files are still 
> common), gang lookup isn't going to help as much.
> 
> Of course, with small files, the actual filename lookup is likely to be 
> the real limiter.

Although that's lockless so it scales. find_get_page will overtake it
at some point.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  5:49           ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  5:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm

Linus Torvalds wrote:

> The question, of course, is whether the part that remains (the actual page 
> lookup) is important enough to matter, once it is part of a bigger chain 
> in a real application.

It can be. If one applies a rough estimate, reading only 512MB/s
from the same file from pagecache on each CPU simply cannot scale
past about 16 CPUs before it becomes tree_lock bound (depending
on what sort of lock transfer latency numbers one plugs into the
model).

> And the "reading from the same file in multiple threads" _is_ a real load. 
> It may sound stupid, but it would happen for any server that has a lot of 
> locality across clients (and that's very much true for web-servers, for 
> example).

I think something like postgresql, which uses a WAL via pagecache, things
that use shared memory (which is more important now that fork got lazier.


And it isn't even reading from the same file in multiple threads (although
that's obviously my poster child).

If one has F files in cache, and N CPUs each accessing a random file, the
number of files whos tree_lock a CPU last touched is F/N. The chance of a
CPU next hitting one of these is (F/N) / F = 1/N -> 0 pretty quickly.
include/ perhaps.

Now this won't be so bad as everyone piling up on a single cacheline, but
it will still hurt, especially on systems with fsb bus, broadcast snoop
(Opteron, POWER) etc.

Nor will it only be the read side which is significant. Consider a single
CPU reading a very large file, and each node's kswapd running on other CPUs
to reclaim pagecache (which needs to take the write lock).

> 
> That said, under most real loads, the page cach elookup is obviously 
> always going to be just a tiny tiny part (as shown by the fact that Jens 
> quotes 35 GB/s throughput - possible only because splice to /dev/null 
> doesn't need to actually ever even _touch_ the data).
> 
> The fact that it drops to "just" 3GB/s for four clients is somewhat 
> interesting, though, since that does put a limit on how well we can serve 
> the same file (of course, 3GB/s is still a lot faster than any modern 
> network will ever be able to push things around, but it's getting closer 
> to the possibilities for real hardware (ie IB over PCI-X should be able to 
> do about 1GB/s in "real life")
> 
> So the fact that basically just lookup/locking overhead can limit things 
> to 3GB/s is absolutely not totally uninteresting. Even if in practice 
> there are other limits that would probably hit us much earlier.
> 
> It would be interesting to see where doing gang-lookup moves the target, 
> but on the other hand, with smaller files (and small files are still 
> common), gang lookup isn't going to help as much.
> 
> Of course, with small files, the actual filename lookup is likely to be 
> the real limiter.

Although that's lockless so it scales. find_get_page will overtake it
at some point.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 18:46           ` Andrew Morton
@ 2006-04-27  5:58             ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  5:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> 
>>Are there cases where the lockless page cache performs worse than the
>>current one?
> 
> 
> Yeah - when human beings try to understand and maintain it.

Have any tried yet? ;)

I won't deny it is complex (because I don't like when I make the
same point and people go on to take great trouble to convince me
how simple it is!).

But I hope it isn't _too_ bad. It is basically a dozen line
function at the core, and that gets used to implement
find_get_page, find_lock_page. Their semantics remain the same,
so that's where the line is drawn (plus minor things, like an
addition for reclaim's remove-from-pagecache protocol).

IMO the rcu radix tree is probably the most complex bit... but
that pales in comparison to things like our prio tree, or RCU
trie.

> 
> The usual tradeoffs apply ;)

Definitely. It isn't fun if you just take the patch and merge it.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  5:58             ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  5:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, npiggin, linux-mm

Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> 
>>Are there cases where the lockless page cache performs worse than the
>>current one?
> 
> 
> Yeah - when human beings try to understand and maintain it.

Have any tried yet? ;)

I won't deny it is complex (because I don't like when I make the
same point and people go on to take great trouble to convince me
how simple it is!).

But I hope it isn't _too_ bad. It is basically a dozen line
function at the core, and that gets used to implement
find_get_page, find_lock_page. Their semantics remain the same,
so that's where the line is drawn (plus minor things, like an
addition for reclaim's remove-from-pagecache protocol).

IMO the rcu radix tree is probably the most complex bit... but
that pales in comparison to things like our prio tree, or RCU
trie.

> 
> The usual tradeoffs apply ;)

Definitely. It isn't fun if you just take the patch and merge it.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  5:39       ` Chen, Kenneth W
@ 2006-04-27  6:07         ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  6:07 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Jens Axboe', linux-kernel, 'Nick Piggin',
	'Andrew Morton',
	linux-mm

Chen, Kenneth W wrote:
> Jens Axboe wrote on Wednesday, April 26, 2006 12:46 PM
> 
>>>It's interesting, single threaded performance is down a little. Is
>>>this significant? In some other results you showed me with 3 splices
>>>each running on their own file (ie. no tree_lock contention), lockless
>>>looked slightly faster on the same machine.
>>
>>I can do the same numbers on a 2-way em64t for comparison, that should
>>get us a little better coverage.
> 
> 
> 
> I throw the lockless patch and Jens splice-bench into our benchmark harness,
> here are the numbers I collected, on the following hardware:
> 
> (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
> (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
> (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
> 
> Here are the graph:

Thanks a lot Ken.

So pagecache lookup performance goes up about 15-25% in single threaded
tests on your P4s. Phew, I wasn't dreaming it.

It is a pity that ipf hasn't improved similarly (and even slowed down a
bit, if Jens' numbers are significant to that range). Next time I spend
some cycles on lockless pagecache, I'll try to scrounge an ipf and see
if I can't improve it (I don't expect miracles).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  6:07         ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  6:07 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Jens Axboe', linux-kernel, 'Nick Piggin',
	'Andrew Morton',
	linux-mm

Chen, Kenneth W wrote:
> Jens Axboe wrote on Wednesday, April 26, 2006 12:46 PM
> 
>>>It's interesting, single threaded performance is down a little. Is
>>>this significant? In some other results you showed me with 3 splices
>>>each running on their own file (ie. no tree_lock contention), lockless
>>>looked slightly faster on the same machine.
>>
>>I can do the same numbers on a 2-way em64t for comparison, that should
>>get us a little better coverage.
> 
> 
> 
> I throw the lockless patch and Jens splice-bench into our benchmark harness,
> here are the numbers I collected, on the following hardware:
> 
> (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
> (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
> (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
> 
> Here are the graph:

Thanks a lot Ken.

So pagecache lookup performance goes up about 15-25% in single threaded
tests on your P4s. Phew, I wasn't dreaming it.

It is a pity that ipf hasn't improved similarly (and even slowed down a
bit, if Jens' numbers are significant to that range). Next time I spend
some cycles on lockless pagecache, I'll try to scrounge an ipf and see
if I can't improve it (I don't expect miracles).

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  5:39       ` Chen, Kenneth W
@ 2006-04-27  6:15         ` Andi Kleen
  -1 siblings, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2006-04-27  6:15 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Jens Axboe', 'Nick Piggin',
	linux-kernel, 'Nick Piggin', 'Andrew Morton',
	linux-mm

On Thursday 27 April 2006 07:39, Chen, Kenneth W wrote:
 
> (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
> http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png
> 
> (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
> http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png
> 
> (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
> http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png
> 
> (4) everything on one graph:
> http://kernel-perf.sourceforge.net/splice/splice.png

Looks like a clear improvement for lockless unless I'm misreading the graphs.
(Can you please use different colors next time?)

-Andi


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  6:15         ` Andi Kleen
  0 siblings, 0 replies; 99+ messages in thread
From: Andi Kleen @ 2006-04-27  6:15 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Jens Axboe', 'Nick Piggin',
	linux-kernel, 'Nick Piggin', 'Andrew Morton',
	linux-mm

On Thursday 27 April 2006 07:39, Chen, Kenneth W wrote:
 
> (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
> http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png
> 
> (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
> http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png
> 
> (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
> http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png
> 
> (4) everything on one graph:
> http://kernel-perf.sourceforge.net/splice/splice.png

Looks like a clear improvement for lockless unless I'm misreading the graphs.
(Can you please use different colors next time?)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 20:12             ` Andrew Morton
  (?)
@ 2006-04-27  7:45             ` Jens Axboe
  2006-04-27  7:47                 ` Jens Axboe
  2006-04-27  7:57                 ` Nick Piggin
  -1 siblings, 2 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  7:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel, npiggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On Wed, Apr 26 2006, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > With a 16-page gang lookup in splice, the top profile for the 4-client
> > case (which is now at 4GiB/sec instead of 3) are:
> > 
> > samples  %        symbol name
> > 30396    36.7217  __do_page_cache_readahead
> > 25843    31.2212  find_get_pages_contig
> > 9699     11.7174  default_idle
> 
> __do_page_cache_readahead() should use gang lookup.  We never got around to
> that, mainly because nothing really demonstrated a need.
> 
> It's a problem that __do_page_cache_readahead() is being called at all -
> with everything in pagecache we should be auto-turning-off readahead.  This
> happens because splice is calling the low-level do_pagecache_readahead(). 
> If you convert it to use page_cache_readahead(), that will all vanish if
> readahead is working right.

You are right, I modified it to use page_cache_readahead() and that
looks a lot better for the vanilla kernel. Here's a new graph of
2.6.17-rc3 and 2.6.17-rc3-lockless. Both base kernels are the splice
branch, so it contains the contig lookup change as well.

I'm still doing only up to nr_cpus clients, as numbers start to
fluctuate above that.

Things look pretty bad for the lockless kernel though, Nick any idea
what is going on there? The splice change is pretty simple, see the top
three patches here:

http://brick.kernel.dk/git/?p=linux-2.6-block.git;a=shortlog;h=splice

The top profile for the 4-client case looks like this:

samples  %        symbol name
77955    77.7141  find_get_pages_contig
8034      8.0092  splice_to_pipe
5092      5.0763  default_idle
4330      4.3166  do_splice_to
1323      1.3189  sys_splice

where vanilla is nowhere near that bad. I added a third graph, which is
lockless with the top patch backed out again so it's using
find_get_page() for each page. The top profile then looks like this for
the 4-client case:

samples  %        symbol name
10685    39.2730  default_idle
4120     15.1432  find_get_page
2608      9.5858  sys_splice
1729      6.3550  radix_tree_lookup_slot
1708      6.2778  splice_from_pipe
1071      3.9365  splice_to_pipe

Finally, I'll just note that the find_get_pages_contig() was modified
for the lockless kernel (read_lock -> spin_lock), so it isn't something
silly :-)

-- 
Jens Axboe


[-- Attachment #2: lockless-3.png --]
[-- Type: image/png, Size: 4089 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  7:45             ` Jens Axboe
@ 2006-04-27  7:47                 ` Jens Axboe
  2006-04-27  7:57                 ` Nick Piggin
  1 sibling, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  7:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Jens Axboe wrote:
> Things look pretty bad for the lockless kernel though, Nick any idea
> what is going on there? The splice change is pretty simple, see the top
> three patches here:

Oh, I think I spotted something silly after all, the gang splice patch
is buggy. Let me fix that up and retest.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  7:47                 ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  7:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: torvalds, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Jens Axboe wrote:
> Things look pretty bad for the lockless kernel though, Nick any idea
> what is going on there? The splice change is pretty simple, see the top
> three patches here:

Oh, I think I spotted something silly after all, the gang splice patch
is buggy. Let me fix that up and retest.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: Lockless page cache test results
  2006-04-27  6:15         ` Andi Kleen
@ 2006-04-27  7:51           ` Chen, Kenneth W
  -1 siblings, 0 replies; 99+ messages in thread
From: Chen, Kenneth W @ 2006-04-27  7:51 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: 'Jens Axboe', 'Nick Piggin',
	linux-kernel, 'Nick Piggin', 'Andrew Morton',
	linux-mm

Andi Kleen wrote on Wednesday, April 26, 2006 11:15 PM
> On Thursday 27 April 2006 07:39, Chen, Kenneth W wrote:
> > (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
> > http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png
> > 
> > (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
> > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png
> > 
> > (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
> > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png
> > 
> > (4) everything on one graph:
> > http://kernel-perf.sourceforge.net/splice/splice.png
> 
> Looks like a clear improvement for lockless unless I'm misreading the
> graphs. (Can you please use different colors next time?)


Sorry, I'm a bit rusty with gnuplot. Color charts are updated with the
same url.  On the last one, I was trying to plot same CPU type with same
color but different line weight for each kernel, 

plot "data" using 1:2 title "2P Xeon 3.4 GHz - vanilla" with linespoints lt 1 lw 10, \
     "data" using 1:3 title "2P Xeon 3.4 GHz - lockless" with linespoints lt 1 lw 1

gnuplot gives me the same color on both plotted lines, but the line weight
argument doesn't have any effect.  I looked for examples everywhere on the
web with no avail.  I must be missing some argument somewhere that I can't
figure out right now :-(

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: Lockless page cache test results
@ 2006-04-27  7:51           ` Chen, Kenneth W
  0 siblings, 0 replies; 99+ messages in thread
From: Chen, Kenneth W @ 2006-04-27  7:51 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: 'Jens Axboe', 'Nick Piggin',
	linux-kernel, 'Nick Piggin', 'Andrew Morton',
	linux-mm

Andi Kleen wrote on Wednesday, April 26, 2006 11:15 PM
> On Thursday 27 April 2006 07:39, Chen, Kenneth W wrote:
> > (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2
> > http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png
> > 
> > (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3
> > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png
> > 
> > (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core)
> > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png
> > 
> > (4) everything on one graph:
> > http://kernel-perf.sourceforge.net/splice/splice.png
> 
> Looks like a clear improvement for lockless unless I'm misreading the
> graphs. (Can you please use different colors next time?)


Sorry, I'm a bit rusty with gnuplot. Color charts are updated with the
same url.  On the last one, I was trying to plot same CPU type with same
color but different line weight for each kernel, 

plot "data" using 1:2 title "2P Xeon 3.4 GHz - vanilla" with linespoints lt 1 lw 10, \
     "data" using 1:3 title "2P Xeon 3.4 GHz - lockless" with linespoints lt 1 lw 1

gnuplot gives me the same color on both plotted lines, but the line weight
argument doesn't have any effect.  I looked for examples everywhere on the
web with no avail.  I must be missing some argument somewhere that I can't
figure out right now :-(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  7:45             ` Jens Axboe
@ 2006-04-27  7:57                 ` Nick Piggin
  2006-04-27  7:57                 ` Nick Piggin
  1 sibling, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  7:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

Jens Axboe wrote:

> Things look pretty bad for the lockless kernel though, Nick any idea
> what is going on there? The splice change is pretty simple, see the top
> three patches here:

Could just be the use of spin lock instead of read lock.

I don't think it would be hard to convert find_get_pages_contig
to be lockless.

Patched vanilla numbers look nicer, but I'm curious as to why
__do_page_cache was so bad before, if the file was in cache.
Presumably it should not more than double tree_lock acquisition...
it isn't getting called multiple times for each page, is it?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  7:57                 ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  7:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

Jens Axboe wrote:

> Things look pretty bad for the lockless kernel though, Nick any idea
> what is going on there? The splice change is pretty simple, see the top
> three patches here:

Could just be the use of spin lock instead of read lock.

I don't think it would be hard to convert find_get_pages_contig
to be lockless.

Patched vanilla numbers look nicer, but I'm curious as to why
__do_page_cache was so bad before, if the file was in cache.
Presumably it should not more than double tree_lock acquisition...
it isn't getting called multiple times for each page, is it?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  7:57                 ` Nick Piggin
@ 2006-04-27  8:02                   ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  8:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

Nick Piggin wrote:
> Jens Axboe wrote:
> 
>> Things look pretty bad for the lockless kernel though, Nick any idea
>> what is going on there? The splice change is pretty simple, see the top
>> three patches here:
> 
> 
> Could just be the use of spin lock instead of read lock.
> 
> I don't think it would be hard to convert find_get_pages_contig
> to be lockless.
> 
> Patched vanilla numbers look nicer, but I'm curious as to why
> __do_page_cache was so bad before, if the file was in cache.
> Presumably it should not more than double tree_lock acquisition...
> it isn't getting called multiple times for each page, is it?

Hmm, what's more, find_get_pages_contig shouldn't result in any
fewer tree_lock acquires than the open coded thing there now
(for the densely populated pagecache case).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  8:02                   ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27  8:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

Nick Piggin wrote:
> Jens Axboe wrote:
> 
>> Things look pretty bad for the lockless kernel though, Nick any idea
>> what is going on there? The splice change is pretty simple, see the top
>> three patches here:
> 
> 
> Could just be the use of spin lock instead of read lock.
> 
> I don't think it would be hard to convert find_get_pages_contig
> to be lockless.
> 
> Patched vanilla numbers look nicer, but I'm curious as to why
> __do_page_cache was so bad before, if the file was in cache.
> Presumably it should not more than double tree_lock acquisition...
> it isn't getting called multiple times for each page, is it?

Hmm, what's more, find_get_pages_contig shouldn't result in any
fewer tree_lock acquires than the open coded thing there now
(for the densely populated pagecache case).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  2:19         ` KAMEZAWA Hiroyuki
@ 2006-04-27  8:03           ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, KAMEZAWA Hiroyuki wrote:
> On Wed, 26 Apr 2006 20:57:50 +0200
> Jens Axboe <axboe@suse.de> wrote:
> 
> > On Wed, Apr 26 2006, Jens Axboe wrote:
> > > We can speedup the lookups with find_get_pages(). The test does 64k max,
> > > so with luck we should be able to pull 16 pages in at the time. I'll try
> > > and run such a test. But boy I wish find_get_pages_contig() was there
> > > for that. I think I'd prefer adding that instead of coding that logic in
> > > splice, it can get a little tricky.
> > 
> > Here's such a run, graphed with the other two. I'll redo the lockless
> > side as well now, it's only fair to compare with that batching as well.
> > 
> 
> Hi, thank you for interesting tests.
> 
> >From user's view, I want to see the comparison among 
> - splice(file,/dev/null),
> - mmap+madvise(file,WILLNEED)/write(/dev/null),
> - read(file)/write(/dev/null)
> in this 1-4 threads test. 
> 
> This will show when splice() can be used effectively.

Sure, should be easy enough to do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  8:03           ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, KAMEZAWA Hiroyuki wrote:
> On Wed, 26 Apr 2006 20:57:50 +0200
> Jens Axboe <axboe@suse.de> wrote:
> 
> > On Wed, Apr 26 2006, Jens Axboe wrote:
> > > We can speedup the lookups with find_get_pages(). The test does 64k max,
> > > so with luck we should be able to pull 16 pages in at the time. I'll try
> > > and run such a test. But boy I wish find_get_pages_contig() was there
> > > for that. I think I'd prefer adding that instead of coding that logic in
> > > splice, it can get a little tricky.
> > 
> > Here's such a run, graphed with the other two. I'll redo the lockless
> > side as well now, it's only fair to compare with that batching as well.
> > 
> 
> Hi, thank you for interesting tests.
> 
> >From user's view, I want to see the comparison among 
> - splice(file,/dev/null),
> - mmap+madvise(file,WILLNEED)/write(/dev/null),
> - read(file)/write(/dev/null)
> in this 1-4 threads test. 
> 
> This will show when splice() can be used effectively.

Sure, should be easy enough to do.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  7:57                 ` Nick Piggin
@ 2006-04-27  8:36                   ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  8:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Nick Piggin wrote:
> Jens Axboe wrote:
> 
> >Things look pretty bad for the lockless kernel though, Nick any idea
> >what is going on there? The splice change is pretty simple, see the top
> >three patches here:
> 
> Could just be the use of spin lock instead of read lock.
> 
> I don't think it would be hard to convert find_get_pages_contig
> to be lockless.

Ah, certainly, it's not lockless like find_get_page(). Care to do such a
patch?

> Patched vanilla numbers look nicer, but I'm curious as to why
> __do_page_cache was so bad before, if the file was in cache.
> Presumably it should not more than double tree_lock acquisition...
> it isn't getting called multiple times for each page, is it?

It still does a lot of extra work that's completely wasted. With
page_cache_readahead(), we should hit the RA_FLAG_INCACHE flag and be
done with it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  8:36                   ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  8:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Nick Piggin wrote:
> Jens Axboe wrote:
> 
> >Things look pretty bad for the lockless kernel though, Nick any idea
> >what is going on there? The splice change is pretty simple, see the top
> >three patches here:
> 
> Could just be the use of spin lock instead of read lock.
> 
> I don't think it would be hard to convert find_get_pages_contig
> to be lockless.

Ah, certainly, it's not lockless like find_get_page(). Care to do such a
patch?

> Patched vanilla numbers look nicer, but I'm curious as to why
> __do_page_cache was so bad before, if the file was in cache.
> Presumably it should not more than double tree_lock acquisition...
> it isn't getting called multiple times for each page, is it?

It still does a lot of extra work that's completely wasted. With
page_cache_readahead(), we should hit the RA_FLAG_INCACHE flag and be
done with it.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  8:02                   ` Nick Piggin
@ 2006-04-27  9:00                     ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  9:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Nick Piggin wrote:
> Nick Piggin wrote:
> >Jens Axboe wrote:
> >
> >>Things look pretty bad for the lockless kernel though, Nick any idea
> >>what is going on there? The splice change is pretty simple, see the top
> >>three patches here:
> >
> >
> >Could just be the use of spin lock instead of read lock.
> >
> >I don't think it would be hard to convert find_get_pages_contig
> >to be lockless.
> >
> >Patched vanilla numbers look nicer, but I'm curious as to why
> >__do_page_cache was so bad before, if the file was in cache.
> >Presumably it should not more than double tree_lock acquisition...
> >it isn't getting called multiple times for each page, is it?
> 
> Hmm, what's more, find_get_pages_contig shouldn't result in any
> fewer tree_lock acquires than the open coded thing there now
> (for the densely populated pagecache case).

How do you figure? The open coded one does a find_get_page() on each
page in that range, so for x number of pages we'll grab and release
->tree_lock x times.

For the fully populated page case, find_get_pages_contig() should return
the full range of x pages with just one grab/release of ->tree_lock.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27  9:00                     ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  9:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Nick Piggin wrote:
> Nick Piggin wrote:
> >Jens Axboe wrote:
> >
> >>Things look pretty bad for the lockless kernel though, Nick any idea
> >>what is going on there? The splice change is pretty simple, see the top
> >>three patches here:
> >
> >
> >Could just be the use of spin lock instead of read lock.
> >
> >I don't think it would be hard to convert find_get_pages_contig
> >to be lockless.
> >
> >Patched vanilla numbers look nicer, but I'm curious as to why
> >__do_page_cache was so bad before, if the file was in cache.
> >Presumably it should not more than double tree_lock acquisition...
> >it isn't getting called multiple times for each page, is it?
> 
> Hmm, what's more, find_get_pages_contig shouldn't result in any
> fewer tree_lock acquires than the open coded thing there now
> (for the densely populated pagecache case).

How do you figure? The open coded one does a find_get_page() on each
page in that range, so for x number of pages we'll grab and release
->tree_lock x times.

For the fully populated page case, find_get_pages_contig() should return
the full range of x pages with just one grab/release of ->tree_lock.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 19:00         ` Linus Torvalds
                           ` (2 preceding siblings ...)
  (?)
@ 2006-04-27  9:35         ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27  9:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel, npiggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Wed, Apr 26 2006, Linus Torvalds wrote:
> It would be interesting to see where doing gang-lookup moves the target, 
> but on the other hand, with smaller files (and small files are still 
> common), gang lookup isn't going to help as much.

Here is the graph for:

- 2.6.17-rc3 + splice reada fix (vanilla-no-gang)

vs

- 2.6.17-rc3 + splice reada fix + gang lookup (vanilla)

vs

- 2.6.17-rc3 + splice reada fix + lockless page cache (lockless-no-gang)

Average of 3 runs graphed, the lockless run didn't use the batch lookup
as that is a lot slower since it basically defeats the purpose of the
lockless page cache until Nick patches it up :-)

Conclusion: the gang lookup makes things faster for the vanilla kernel,
but it only makes it scale marginally (if at all) better.

-- 
Jens Axboe


[-- Attachment #2: lockless-4.png --]
[-- Type: image/png, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  8:03           ` Jens Axboe
@ 2006-04-27 11:16             ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27 11:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Jens Axboe wrote:
> On Thu, Apr 27 2006, KAMEZAWA Hiroyuki wrote:
> > On Wed, 26 Apr 2006 20:57:50 +0200
> > Jens Axboe <axboe@suse.de> wrote:
> > 
> > > On Wed, Apr 26 2006, Jens Axboe wrote:
> > > > We can speedup the lookups with find_get_pages(). The test does 64k max,
> > > > so with luck we should be able to pull 16 pages in at the time. I'll try
> > > > and run such a test. But boy I wish find_get_pages_contig() was there
> > > > for that. I think I'd prefer adding that instead of coding that logic in
> > > > splice, it can get a little tricky.
> > > 
> > > Here's such a run, graphed with the other two. I'll redo the lockless
> > > side as well now, it's only fair to compare with that batching as well.
> > > 
> > 
> > Hi, thank you for interesting tests.
> > 
> > >From user's view, I want to see the comparison among 
> > - splice(file,/dev/null),
> > - mmap+madvise(file,WILLNEED)/write(/dev/null),
> > - read(file)/write(/dev/null)
> > in this 1-4 threads test. 
> > 
> > This will show when splice() can be used effectively.
> 
> Sure, should be easy enough to do.

Added, 1 vs 2/3/4 clients isn't very interesting, so to keep it short
here are numbers for 2 clients to /dev/null and localhost.

Sending to /dev/null

ml370:/data # ./splice-bench -n2 -l10 -a -s -z file
Waiting for clients
Client1 (splice): 19030 MiB/sec (10240MiB in 551 msecs)
Client0 (splice): 18961 MiB/sec (10240MiB in 553 msecs)
Client1 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
Client0 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
Client1 (rw): 1691 MiB/sec (10240MiB in 6200 msecs)
Client0 (rw): 1690 MiB/sec (10240MiB in 6201 msecs)

Sending/receiving over lo

ml370:/data # ./splice-bench -n2 -l10 -a -s file
Waiting for clients
Client0 (splice): 3007 MiB/sec (10240MiB in 3486 msecs)
Client1 (splice): 3003 MiB/sec (10240MiB in 3491 msecs)
Client0 (mmap): 555 MiB/sec (8192MiB in 15094 msecs)
Client1 (mmap): 580 MiB/sec (9216MiB in 16257 msecs)
Client0 (rw): 538 MiB/sec (8192MiB in 15573 msecs)
Client1 (rw): 541 MiB/sec (8192MiB in 15498 msecs)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27 11:16             ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27 11:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, Jens Axboe wrote:
> On Thu, Apr 27 2006, KAMEZAWA Hiroyuki wrote:
> > On Wed, 26 Apr 2006 20:57:50 +0200
> > Jens Axboe <axboe@suse.de> wrote:
> > 
> > > On Wed, Apr 26 2006, Jens Axboe wrote:
> > > > We can speedup the lookups with find_get_pages(). The test does 64k max,
> > > > so with luck we should be able to pull 16 pages in at the time. I'll try
> > > > and run such a test. But boy I wish find_get_pages_contig() was there
> > > > for that. I think I'd prefer adding that instead of coding that logic in
> > > > splice, it can get a little tricky.
> > > 
> > > Here's such a run, graphed with the other two. I'll redo the lockless
> > > side as well now, it's only fair to compare with that batching as well.
> > > 
> > 
> > Hi, thank you for interesting tests.
> > 
> > >From user's view, I want to see the comparison among 
> > - splice(file,/dev/null),
> > - mmap+madvise(file,WILLNEED)/write(/dev/null),
> > - read(file)/write(/dev/null)
> > in this 1-4 threads test. 
> > 
> > This will show when splice() can be used effectively.
> 
> Sure, should be easy enough to do.

Added, 1 vs 2/3/4 clients isn't very interesting, so to keep it short
here are numbers for 2 clients to /dev/null and localhost.

Sending to /dev/null

ml370:/data # ./splice-bench -n2 -l10 -a -s -z file
Waiting for clients
Client1 (splice): 19030 MiB/sec (10240MiB in 551 msecs)
Client0 (splice): 18961 MiB/sec (10240MiB in 553 msecs)
Client1 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
Client0 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
Client1 (rw): 1691 MiB/sec (10240MiB in 6200 msecs)
Client0 (rw): 1690 MiB/sec (10240MiB in 6201 msecs)

Sending/receiving over lo

ml370:/data # ./splice-bench -n2 -l10 -a -s file
Waiting for clients
Client0 (splice): 3007 MiB/sec (10240MiB in 3486 msecs)
Client1 (splice): 3003 MiB/sec (10240MiB in 3491 msecs)
Client0 (mmap): 555 MiB/sec (8192MiB in 15094 msecs)
Client1 (mmap): 580 MiB/sec (9216MiB in 16257 msecs)
Client0 (rw): 538 MiB/sec (8192MiB in 15573 msecs)
Client1 (rw): 541 MiB/sec (8192MiB in 15498 msecs)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27 11:16             ` Jens Axboe
@ 2006-04-27 11:41               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-27 11:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, 27 Apr 2006 13:16:25 +0200
Jens Axboe <axboe@suse.de> wrote:

> Added, 1 vs 2/3/4 clients isn't very interesting, so to keep it short
> here are numbers for 2 clients to /dev/null and localhost.
> 
Thank you! looks splice has significant advantage :)

> Sending to /dev/null
> 
> ml370:/data # ./splice-bench -n2 -l10 -a -s -z file
> Waiting for clients
> Client1 (splice): 19030 MiB/sec (10240MiB in 551 msecs)
> Client0 (splice): 18961 MiB/sec (10240MiB in 553 msecs)
This maybe shows cost of gathering page-cache.

> Client1 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
> Client0 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
This shows read/write system-call and user program cost. right ?

> Client1 (rw): 1691 MiB/sec (10240MiB in 6200 msecs)
> Client0 (rw): 1690 MiB/sec (10240MiB in 6201 msecs)
> 
This shows 10240MiB copy_to_user() cost.
BTW, How big are cpu-cache-size and read/write buffer size in this test ?

> Sending/receiving over lo
> 
read from a file and write to lo ?

> ml370:/data # ./splice-bench -n2 -l10 -a -s file
> Waiting for clients
> Client0 (splice): 3007 MiB/sec (10240MiB in 3486 msecs)
> Client1 (splice): 3003 MiB/sec (10240MiB in 3491 msecs)
> Client0 (mmap): 555 MiB/sec (8192MiB in 15094 msecs)
> Client1 (mmap): 580 MiB/sec (9216MiB in 16257 msecs)
> Client0 (rw): 538 MiB/sec (8192MiB in 15573 msecs)
> Client1 (rw): 541 MiB/sec (8192MiB in 15498 msecs)
> 

Thank you.
-Kame


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27 11:41               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 99+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-04-27 11:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, 27 Apr 2006 13:16:25 +0200
Jens Axboe <axboe@suse.de> wrote:

> Added, 1 vs 2/3/4 clients isn't very interesting, so to keep it short
> here are numbers for 2 clients to /dev/null and localhost.
> 
Thank you! looks splice has significant advantage :)

> Sending to /dev/null
> 
> ml370:/data # ./splice-bench -n2 -l10 -a -s -z file
> Waiting for clients
> Client1 (splice): 19030 MiB/sec (10240MiB in 551 msecs)
> Client0 (splice): 18961 MiB/sec (10240MiB in 553 msecs)
This maybe shows cost of gathering page-cache.

> Client1 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
> Client0 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
This shows read/write system-call and user program cost. right ?

> Client1 (rw): 1691 MiB/sec (10240MiB in 6200 msecs)
> Client0 (rw): 1690 MiB/sec (10240MiB in 6201 msecs)
> 
This shows 10240MiB copy_to_user() cost.
BTW, How big are cpu-cache-size and read/write buffer size in this test ?

> Sending/receiving over lo
> 
read from a file and write to lo ?

> ml370:/data # ./splice-bench -n2 -l10 -a -s file
> Waiting for clients
> Client0 (splice): 3007 MiB/sec (10240MiB in 3486 msecs)
> Client1 (splice): 3003 MiB/sec (10240MiB in 3491 msecs)
> Client0 (mmap): 555 MiB/sec (8192MiB in 15094 msecs)
> Client1 (mmap): 580 MiB/sec (9216MiB in 16257 msecs)
> Client0 (rw): 538 MiB/sec (8192MiB in 15573 msecs)
> Client1 (rw): 541 MiB/sec (8192MiB in 15498 msecs)
> 

Thank you.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27 11:41               ` KAMEZAWA Hiroyuki
@ 2006-04-27 11:45                 ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27 11:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, KAMEZAWA Hiroyuki wrote:
> On Thu, 27 Apr 2006 13:16:25 +0200
> Jens Axboe <axboe@suse.de> wrote:
> 
> > Added, 1 vs 2/3/4 clients isn't very interesting, so to keep it short
> > here are numbers for 2 clients to /dev/null and localhost.
> > 
> Thank you! looks splice has significant advantage :)
> 
> > Sending to /dev/null
> > 
> > ml370:/data # ./splice-bench -n2 -l10 -a -s -z file
> > Waiting for clients
> > Client1 (splice): 19030 MiB/sec (10240MiB in 551 msecs)
> > Client0 (splice): 18961 MiB/sec (10240MiB in 553 msecs)
> This maybe shows cost of gathering page-cache.

Precisely, it's basically the cost of looking up the pages and adding
them to the pipe.

> > Client1 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
> > Client0 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
> This shows read/write system-call and user program cost. right ?

It shows the cost of write()'ing the mmap'ed file area to /dev/null.

> > Client1 (rw): 1691 MiB/sec (10240MiB in 6200 msecs)
> > Client0 (rw): 1690 MiB/sec (10240MiB in 6201 msecs)
> > 
> This shows 10240MiB copy_to_user() cost.
> BTW, How big are cpu-cache-size and read/write buffer size in this test ?

This was done on a xeon with 2mb l2. The buffers size used was 64k in
all cases.

> > Sending/receiving over lo
> > 
> read from a file and write to lo ?

I'd rather say input is a file and output is a socket going to lo, that
is a little more precisely given the differing methods the clients use.
But I suspect this is what you meant.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27 11:45                 ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-27 11:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, npiggin, linux-mm

On Thu, Apr 27 2006, KAMEZAWA Hiroyuki wrote:
> On Thu, 27 Apr 2006 13:16:25 +0200
> Jens Axboe <axboe@suse.de> wrote:
> 
> > Added, 1 vs 2/3/4 clients isn't very interesting, so to keep it short
> > here are numbers for 2 clients to /dev/null and localhost.
> > 
> Thank you! looks splice has significant advantage :)
> 
> > Sending to /dev/null
> > 
> > ml370:/data # ./splice-bench -n2 -l10 -a -s -z file
> > Waiting for clients
> > Client1 (splice): 19030 MiB/sec (10240MiB in 551 msecs)
> > Client0 (splice): 18961 MiB/sec (10240MiB in 553 msecs)
> This maybe shows cost of gathering page-cache.

Precisely, it's basically the cost of looking up the pages and adding
them to the pipe.

> > Client1 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
> > Client0 (mmap): 158875 MiB/sec (10240MiB in 66 msecs)
> This shows read/write system-call and user program cost. right ?

It shows the cost of write()'ing the mmap'ed file area to /dev/null.

> > Client1 (rw): 1691 MiB/sec (10240MiB in 6200 msecs)
> > Client0 (rw): 1690 MiB/sec (10240MiB in 6201 msecs)
> > 
> This shows 10240MiB copy_to_user() cost.
> BTW, How big are cpu-cache-size and read/write buffer size in this test ?

This was done on a xeon with 2mb l2. The buffers size used was 64k in
all cases.

> > Sending/receiving over lo
> > 
> read from a file and write to lo ?

I'd rather say input is a file and output is a socket going to lo, that
is a little more precisely given the differing methods the clients use.
But I suspect this is what you meant.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  9:00                     ` Jens Axboe
@ 2006-04-27 13:36                       ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27 13:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

Jens Axboe wrote:
> On Thu, Apr 27 2006, Nick Piggin wrote:

>>Hmm, what's more, find_get_pages_contig shouldn't result in any
>>fewer tree_lock acquires than the open coded thing there now
>>(for the densely populated pagecache case).
> 
> 
> How do you figure? The open coded one does a find_get_page() on each
> page in that range, so for x number of pages we'll grab and release
> ->tree_lock x times.

Yeah you're right. I had in mind that you were using
find_get_pages_contig in readahead, rather than in splice.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27 13:36                       ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-27 13:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, torvalds, linux-kernel, npiggin, linux-mm

Jens Axboe wrote:
> On Thu, Apr 27 2006, Nick Piggin wrote:

>>Hmm, what's more, find_get_pages_contig shouldn't result in any
>>fewer tree_lock acquires than the open coded thing there now
>>(for the densely populated pagecache case).
> 
> 
> How do you figure? The open coded one does a find_get_page() on each
> page in that range, so for x number of pages we'll grab and release
> ->tree_lock x times.

Yeah you're right. I had in mind that you were using
find_get_pages_contig in readahead, rather than in splice.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27  5:49           ` Nick Piggin
@ 2006-04-27 15:12             ` Linus Torvalds
  -1 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2006-04-27 15:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm



On Thu, 27 Apr 2006, Nick Piggin wrote:
> > 
> > Of course, with small files, the actual filename lookup is likely to be the
> > real limiter.
> 
> Although that's lockless so it scales. find_get_page will overtake it
> at some point.

filename lookup is only lockless for independent files. You end up getting 
the "dentry->d_lock" for a successful lookup in the lookup path, so if you 
have multiple threads looking up the same files (or - MUCH more commonly - 
directories), you're not going to be lockless.

I don't know how we could improve it. I've several times thought that we 
_should_ be able to do the directory lookups under the rcu read lock and 
never touch their d_count or d_lock at all, but the locking against 
directory renaming depends very intimately on d_lock.

It is _possible_ that we should be able to handle it purely with just 
memory ordering rather than depending on d_lock. That would be wonderful.

Of course, we do actually scale pretty damn well already. I'm just saying 
that it's not perfect.

See __d_lookup() for details.

			Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-27 15:12             ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2006-04-27 15:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm


On Thu, 27 Apr 2006, Nick Piggin wrote:
> > 
> > Of course, with small files, the actual filename lookup is likely to be the
> > real limiter.
> 
> Although that's lockless so it scales. find_get_page will overtake it
> at some point.

filename lookup is only lockless for independent files. You end up getting 
the "dentry->d_lock" for a successful lookup in the lookup path, so if you 
have multiple threads looking up the same files (or - MUCH more commonly - 
directories), you're not going to be lockless.

I don't know how we could improve it. I've several times thought that we 
_should_ be able to do the directory lookups under the rcu read lock and 
never touch their d_count or d_lock at all, but the locking against 
directory renaming depends very intimately on d_lock.

It is _possible_ that we should be able to handle it purely with just 
memory ordering rather than depending on d_lock. That would be wonderful.

Of course, we do actually scale pretty damn well already. I'm just saying 
that it's not perfect.

See __d_lookup() for details.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-27 15:12             ` Linus Torvalds
@ 2006-04-28  4:54               ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-28  4:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm

Linus Torvalds wrote:
> 
> On Thu, 27 Apr 2006, Nick Piggin wrote:
> 
>>>Of course, with small files, the actual filename lookup is likely to be the
>>>real limiter.
>>
>>Although that's lockless so it scales. find_get_page will overtake it
>>at some point.
> 
> 
> filename lookup is only lockless for independent files. You end up getting 
> the "dentry->d_lock" for a successful lookup in the lookup path, so if you 
> have multiple threads looking up the same files (or - MUCH more commonly - 
> directories), you're not going to be lockless.

Oh that's true, I forgot. So the many small files case will often have
as much d_lock activity as tree_lock.

> 
> I don't know how we could improve it. I've several times thought that we 
> _should_ be able to do the directory lookups under the rcu read lock and 
> never touch their d_count or d_lock at all, but the locking against 
> directory renaming depends very intimately on d_lock.
> 
> It is _possible_ that we should be able to handle it purely with just 
> memory ordering rather than depending on d_lock. That would be wonderful.
> 
> Of course, we do actually scale pretty damn well already. I'm just saying 
> that it's not perfect.
> 
> See __d_lookup() for details.

Yes I see. Perhaps a seqlock could do the trick (hmm, there already is one),
however we still have to increment the refcount, so there'll always be a
shared cacheline.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28  4:54               ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-28  4:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm

Linus Torvalds wrote:
> 
> On Thu, 27 Apr 2006, Nick Piggin wrote:
> 
>>>Of course, with small files, the actual filename lookup is likely to be the
>>>real limiter.
>>
>>Although that's lockless so it scales. find_get_page will overtake it
>>at some point.
> 
> 
> filename lookup is only lockless for independent files. You end up getting 
> the "dentry->d_lock" for a successful lookup in the lookup path, so if you 
> have multiple threads looking up the same files (or - MUCH more commonly - 
> directories), you're not going to be lockless.

Oh that's true, I forgot. So the many small files case will often have
as much d_lock activity as tree_lock.

> 
> I don't know how we could improve it. I've several times thought that we 
> _should_ be able to do the directory lookups under the rcu read lock and 
> never touch their d_count or d_lock at all, but the locking against 
> directory renaming depends very intimately on d_lock.
> 
> It is _possible_ that we should be able to handle it purely with just 
> memory ordering rather than depending on d_lock. That would be wonderful.
> 
> Of course, we do actually scale pretty damn well already. I'm just saying 
> that it's not perfect.
> 
> See __d_lookup() for details.

Yes I see. Perhaps a seqlock could do the trick (hmm, there already is one),
however we still have to increment the refcount, so there'll always be a
shared cacheline.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-28  4:54               ` Nick Piggin
@ 2006-04-28  5:34                 ` Linus Torvalds
  -1 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2006-04-28  5:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm



On Fri, 28 Apr 2006, Nick Piggin wrote:
> > 
> > See __d_lookup() for details.
> 
> Yes I see. Perhaps a seqlock could do the trick (hmm, there already is one),
> however we still have to increment the refcount, so there'll always be a
> shared cacheline.

Actually, the thing I'd really _like_ to see is not even incrementing the 
refcount for intermediate directories (and those are actually the most 
common case).

It should be possible in theory to do a lookup of a long path all using 
the rcu_read_lock, and only do the refcount increment (and then you might 
as well do the d_lock thing) for the final component of the path.

Of course, it's not possible right now. We do each component separately, 
and we very much depend on the d_lock. For some things, we _have_ to do it 
that way (revalidation etc), so the "possible in theory" isn't always even 
true.

And every time I look at it, I decide that it's too damn complex, and the 
end result would look horrible, and that I'd probably get it wrong anyway.

Still, I've _looked_ at it several times.

		Linus

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28  5:34                 ` Linus Torvalds
  0 siblings, 0 replies; 99+ messages in thread
From: Linus Torvalds @ 2006-04-28  5:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jens Axboe, linux-kernel, npiggin, linux-mm


On Fri, 28 Apr 2006, Nick Piggin wrote:
> > 
> > See __d_lookup() for details.
> 
> Yes I see. Perhaps a seqlock could do the trick (hmm, there already is one),
> however we still have to increment the refcount, so there'll always be a
> shared cacheline.

Actually, the thing I'd really _like_ to see is not even incrementing the 
refcount for intermediate directories (and those are actually the most 
common case).

It should be possible in theory to do a lookup of a long path all using 
the rcu_read_lock, and only do the refcount increment (and then you might 
as well do the d_lock thing) for the final component of the path.

Of course, it's not possible right now. We do each component separately, 
and we very much depend on the d_lock. For some things, we _have_ to do it 
that way (revalidation etc), so the "possible in theory" isn't always even 
true.

And every time I look at it, I decide that it's too damn complex, and the 
end result would look horrible, and that I'd probably get it wrong anyway.

Still, I've _looked_ at it several times.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 13:53 Lockless page cache test results Jens Axboe
@ 2006-04-28  9:10   ` Pavel Machek
  2006-04-26 16:55   ` Andrew Morton
  2006-04-28  9:10   ` Pavel Machek
  2 siblings, 0 replies; 99+ messages in thread
From: Pavel Machek @ 2006-04-28  9:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

On St 26-04-06 15:53:10, Jens Axboe wrote:
> Hi,
> 
> Running a splice benchmark on a 4-way IPF box, I decided to give the
> lockless page cache patches from Nick a spin. I've attached the results
> as a png, it pretty much speaks for itself.
> 
> The test in question splices a 1GiB file to a pipe and then splices that
> to some output. Normally that output would be something interesting, in
> this case it's simply /dev/null. So it tests the input side of things
> only, which is what I wanted to do here. To get adequate runtime, the
> operation is repeated a number of times (120 in this example). The
> benchmark does that number of loops with 1, 2, 3, and 4 clients each
> pinned to a private CPU. The pinning is mainly done for more stable
> results.

35GB/sec, AFAICS? Not sure how significant this benchmark is.. even
with 4 clients, you have 2.5GB/sec, and that is better than almost
anything you can splice to...
								Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28  9:10   ` Pavel Machek
  0 siblings, 0 replies; 99+ messages in thread
From: Pavel Machek @ 2006-04-28  9:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

On St 26-04-06 15:53:10, Jens Axboe wrote:
> Hi,
> 
> Running a splice benchmark on a 4-way IPF box, I decided to give the
> lockless page cache patches from Nick a spin. I've attached the results
> as a png, it pretty much speaks for itself.
> 
> The test in question splices a 1GiB file to a pipe and then splices that
> to some output. Normally that output would be something interesting, in
> this case it's simply /dev/null. So it tests the input side of things
> only, which is what I wanted to do here. To get adequate runtime, the
> operation is repeated a number of times (120 in this example). The
> benchmark does that number of loops with 1, 2, 3, and 4 clients each
> pinned to a private CPU. The pinning is mainly done for more stable
> results.

35GB/sec, AFAICS? Not sure how significant this benchmark is.. even
with 4 clients, you have 2.5GB/sec, and that is better than almost
anything you can splice to...
								Pavel
-- 
Thanks for all the (sleeping) penguins.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-28  9:10   ` Pavel Machek
@ 2006-04-28  9:21     ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-28  9:21 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

On Fri, Apr 28 2006, Pavel Machek wrote:
> On St 26-04-06 15:53:10, Jens Axboe wrote:
> > Hi,
> > 
> > Running a splice benchmark on a 4-way IPF box, I decided to give the
> > lockless page cache patches from Nick a spin. I've attached the results
> > as a png, it pretty much speaks for itself.
> > 
> > The test in question splices a 1GiB file to a pipe and then splices that
> > to some output. Normally that output would be something interesting, in
> > this case it's simply /dev/null. So it tests the input side of things
> > only, which is what I wanted to do here. To get adequate runtime, the
> > operation is repeated a number of times (120 in this example). The
> > benchmark does that number of loops with 1, 2, 3, and 4 clients each
> > pinned to a private CPU. The pinning is mainly done for more stable
> > results.
> 
> 35GB/sec, AFAICS? Not sure how significant this benchmark is.. even
> with 4 clients, you have 2.5GB/sec, and that is better than almost
> anything you can splice to...

2.5GB/sec isn't that much, and remember that is with the system fully
loaded and spending _all_ its time in the kernel. It's the pure page
cache lookup performance, I'd like to think you should have room for
more. With hundreds or even thousands of clients.

The point isn't the numbers themselves, rather the scalability of the
vanilla page cache vs the lockless one. And that is demonstrated aptly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28  9:21     ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-28  9:21 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel, Nick Piggin, Andrew Morton, linux-mm

On Fri, Apr 28 2006, Pavel Machek wrote:
> On St 26-04-06 15:53:10, Jens Axboe wrote:
> > Hi,
> > 
> > Running a splice benchmark on a 4-way IPF box, I decided to give the
> > lockless page cache patches from Nick a spin. I've attached the results
> > as a png, it pretty much speaks for itself.
> > 
> > The test in question splices a 1GiB file to a pipe and then splices that
> > to some output. Normally that output would be something interesting, in
> > this case it's simply /dev/null. So it tests the input side of things
> > only, which is what I wanted to do here. To get adequate runtime, the
> > operation is repeated a number of times (120 in this example). The
> > benchmark does that number of loops with 1, 2, 3, and 4 clients each
> > pinned to a private CPU. The pinning is mainly done for more stable
> > results.
> 
> 35GB/sec, AFAICS? Not sure how significant this benchmark is.. even
> with 4 clients, you have 2.5GB/sec, and that is better than almost
> anything you can splice to...

2.5GB/sec isn't that much, and remember that is with the system fully
loaded and spending _all_ its time in the kernel. It's the pure page
cache lookup performance, I'd like to think you should have room for
more. With hundreds or even thousands of clients.

The point isn't the numbers themselves, rather the scalability of the
vanilla page cache vs the lockless one. And that is demonstrated aptly.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28 11:28                 ` Wu Fengguang
  0 siblings, 0 replies; 99+ messages in thread
From: Wu Fengguang @ 2006-04-28 11:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, torvalds, linux-kernel, npiggin, linux-mm

On Wed, Apr 26, 2006 at 01:12:00PM -0700, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > With a 16-page gang lookup in splice, the top profile for the 4-client
> > case (which is now at 4GiB/sec instead of 3) are:
> > 
> > samples  %        symbol name
> > 30396    36.7217  __do_page_cache_readahead
> > 25843    31.2212  find_get_pages_contig
> > 9699     11.7174  default_idle
> 
> __do_page_cache_readahead() should use gang lookup.  We never got around to
> that, mainly because nothing really demonstrated a need.

I have been testing a patch for this for a while. The new function
looks like

static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
			pgoff_t offset, unsigned long nr_to_read)
{
	struct inode *inode = mapping->host;
	struct page *page;
	LIST_HEAD(page_pool);
	pgoff_t last_index;	/* The last page we want to read */
	pgoff_t hole_index;
	int ret = 0;
	loff_t isize = i_size_read(inode);

	last_index = ((isize - 1) >> PAGE_CACHE_SHIFT);

	if (unlikely(!isize || !nr_to_read))
		goto out;
	if (unlikely(last_index < offset))
		goto out;

	if (last_index > offset + nr_to_read - 1 &&
		offset < offset + nr_to_read)
		last_index = offset + nr_to_read - 1;

	/*
	 * Go through ranges of holes and preallocate all the absent pages.
	 */
next_hole_range:
	cond_resched();

	read_lock_irq(&mapping->tree_lock);
	hole_index = radix_tree_scan_hole(&mapping->page_tree,
					offset, last_index - offset + 1);

	if (hole_index > last_index) {	/* no more holes? */
		read_unlock_irq(&mapping->tree_lock);
		goto submit_io;
	}

	offset = radix_tree_scan_data(&mapping->page_tree, (void **)&page,
						hole_index, last_index);
	read_unlock_irq(&mapping->tree_lock);

	ddprintk("ra range %lu-%lu(%p)-%lu\n", hole_index, offset, page, last_index);

	for (;;) {
                page = page_cache_alloc_cold(mapping);
		if (!page)
			break;

		page->index = hole_index;
		list_add(&page->lru, &page_pool);
		ret++;
		BUG_ON(ret > nr_to_read);

		if (hole_index >= last_index)
			break;

		if (++hole_index >= offset)
			goto next_hole_range;
	}

submit_io:
	/*
	 * Now start the IO.  We ignore I/O errors - if the page is not
	 * uptodate then the caller will launch readpage again, and
	 * will then handle the error.
	 */
	if (ret)
		read_pages(mapping, filp, &page_pool, ret);
	BUG_ON(!list_empty(&page_pool));
out:
	return ret;
}

The radix_tree_scan_data()/radix_tree_scan_hole() functions called
above are more flexible than the original __lookup(). Perhaps we can
rebase radix_tree_gang_lookup() and find_get_pages_contig() on them.

If it is deemed ok, I'll clean it up and submit the patch asap.

Thanks,
Wu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28 11:28                 ` Wu Fengguang
  0 siblings, 0 replies; 99+ messages in thread
From: Wu Fengguang @ 2006-04-28 11:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, torvalds, linux-kernel, npiggin, linux-mm

On Wed, Apr 26, 2006 at 01:12:00PM -0700, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > With a 16-page gang lookup in splice, the top profile for the 4-client
> > case (which is now at 4GiB/sec instead of 3) are:
> > 
> > samples  %        symbol name
> > 30396    36.7217  __do_page_cache_readahead
> > 25843    31.2212  find_get_pages_contig
> > 9699     11.7174  default_idle
> 
> __do_page_cache_readahead() should use gang lookup.  We never got around to
> that, mainly because nothing really demonstrated a need.

I have been testing a patch for this for a while. The new function
looks like

static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
			pgoff_t offset, unsigned long nr_to_read)
{
	struct inode *inode = mapping->host;
	struct page *page;
	LIST_HEAD(page_pool);
	pgoff_t last_index;	/* The last page we want to read */
	pgoff_t hole_index;
	int ret = 0;
	loff_t isize = i_size_read(inode);

	last_index = ((isize - 1) >> PAGE_CACHE_SHIFT);

	if (unlikely(!isize || !nr_to_read))
		goto out;
	if (unlikely(last_index < offset))
		goto out;

	if (last_index > offset + nr_to_read - 1 &&
		offset < offset + nr_to_read)
		last_index = offset + nr_to_read - 1;

	/*
	 * Go through ranges of holes and preallocate all the absent pages.
	 */
next_hole_range:
	cond_resched();

	read_lock_irq(&mapping->tree_lock);
	hole_index = radix_tree_scan_hole(&mapping->page_tree,
					offset, last_index - offset + 1);

	if (hole_index > last_index) {	/* no more holes? */
		read_unlock_irq(&mapping->tree_lock);
		goto submit_io;
	}

	offset = radix_tree_scan_data(&mapping->page_tree, (void **)&page,
						hole_index, last_index);
	read_unlock_irq(&mapping->tree_lock);

	ddprintk("ra range %lu-%lu(%p)-%lu\n", hole_index, offset, page, last_index);

	for (;;) {
                page = page_cache_alloc_cold(mapping);
		if (!page)
			break;

		page->index = hole_index;
		list_add(&page->lru, &page_pool);
		ret++;
		BUG_ON(ret > nr_to_read);

		if (hole_index >= last_index)
			break;

		if (++hole_index >= offset)
			goto next_hole_range;
	}

submit_io:
	/*
	 * Now start the IO.  We ignore I/O errors - if the page is not
	 * uptodate then the caller will launch readpage again, and
	 * will then handle the error.
	 */
	if (ret)
		read_pages(mapping, filp, &page_pool, ret);
	BUG_ON(!list_empty(&page_pool));
out:
	return ret;
}

The radix_tree_scan_data()/radix_tree_scan_hole() functions called
above are more flexible than the original __lookup(). Perhaps we can
rebase radix_tree_gang_lookup() and find_get_pages_contig() on them.

If it is deemed ok, I'll clean it up and submit the patch asap.

Thanks,
Wu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-26 20:31               ` Christoph Lameter
@ 2006-04-28 14:01                 ` David Chinner
  -1 siblings, 0 replies; 99+ messages in thread
From: David Chinner @ 2006-04-28 14:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: dgc, Jens Axboe, Andrew Morton, linux-kernel, npiggin, linux-mm

On Wed, Apr 26, 2006 at 01:31:14PM -0700, Christoph Lameter wrote:
> Dave: Can you tell us more about the tree_lock contentions on I/O that you 
> have seen?

Sorry to be slow responding - I've been sick the last couple of days.

Take a large file - say Size = 5x RAM or so - and then start
N threads runnnning at offset (n / Size) where n = the thread
number. They each read (Size / N) and so typically don't overlap. 

Throughput with increasing numbers of threads on a 24p altix
on an XFS filesystem on 2.6.15-rc5 looks like:

++++ Local I/O Block size 262144 ++++ Thu Dec 22 03:41:42 PST 2005


Loads   Type    blksize count   av_time    tput    usr%   sys%   intr%
-----   ----    ------- -----   ------- -------    ----   ----   -----
  1      read   256.00K 256.00K   82.92  789.59    1.80  215.40   18.40
  2      read   256.00K 256.00K   53.97 1191.56    2.10  389.40   22.60
  4      read   256.00K 256.00K   37.83 1724.63    2.20  776.00   29.30
  8      read   256.00K 256.00K   52.57 1213.63    2.20 1423.60   24.30
  16     read   256.00K 256.00K   60.05 1057.03    1.90 1951.10   24.30
  32     read   256.00K 256.00K   82.13  744.73    2.00 2277.50   18.60
                                        ^^^^^^^         ^^^^^^^

Basically,  we hit a scaling limitation at b/t 4 and 8 threads. This was
consistent across I/O sizes from 4KB to 4MB. I took a simple 30s PC sample
profile:

user ticks:             0               0 %
kernel ticks:           2982            99.97 %
idle ticks:             4               0.13 %

Using /proc/kallsyms as the kernel map file.
====================================================================
                           Kernel

      Ticks     Percent  Cumulative   Routine
                          Percent
--------------------------------------------------------------------
       1897       63.62    63.62      _write_lock_irqsave
        467       15.66    79.28      _read_unlock_irq
         91        3.05    82.33      established_get_next
         74        2.48    84.81      generic__raw_read_trylock
         59        1.98    86.79      xfs_iunlock
         47        1.58    88.36      _write_unlock_irq
         46        1.54    89.91      xfs_bmapi
         40        1.34    91.25      do_generic_mapping_read
         35        1.17    92.42      xfs_ilock_map_shared
         26        0.87    93.29      __copy_user
         23        0.77    94.06      __do_page_cache_readahead
         16        0.54    94.60      unlock_page
         15        0.50    95.10      xfs_ilock
         15        0.50    95.61      shrink_cache
         15        0.50    96.11      _spin_unlock_irqrestore
         13        0.44    96.55      sub_preempt_count
         11        0.37    96.91      mpage_end_io_read
         10        0.34    97.25      add_preempt_count
         10        0.34    97.59      xfs_iomap
          9        0.30    97.89      _read_unlock


So read_unlock_irq looks to be triggered by the mapping->tree_lock.

I think that the write_lock_irqsave() contention is from memory
reclaim (shrink_list()->try_to_release_page()-> ->releasepage()->
xfs_vm_releasepage()-> try_to_free_buffers()->clear_page_dirty()->
test_clear_page_dirty()-> write_lock_irqsave(&mapping->tree_lock...))
because page cache memory was full of this one file and demand is
causing them to be constantly recycled.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28 14:01                 ` David Chinner
  0 siblings, 0 replies; 99+ messages in thread
From: David Chinner @ 2006-04-28 14:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: dgc, Jens Axboe, Andrew Morton, linux-kernel, npiggin, linux-mm

On Wed, Apr 26, 2006 at 01:31:14PM -0700, Christoph Lameter wrote:
> Dave: Can you tell us more about the tree_lock contentions on I/O that you 
> have seen?

Sorry to be slow responding - I've been sick the last couple of days.

Take a large file - say Size = 5x RAM or so - and then start
N threads runnnning at offset (n / Size) where n = the thread
number. They each read (Size / N) and so typically don't overlap. 

Throughput with increasing numbers of threads on a 24p altix
on an XFS filesystem on 2.6.15-rc5 looks like:

++++ Local I/O Block size 262144 ++++ Thu Dec 22 03:41:42 PST 2005


Loads   Type    blksize count   av_time    tput    usr%   sys%   intr%
-----   ----    ------- -----   ------- -------    ----   ----   -----
  1      read   256.00K 256.00K   82.92  789.59    1.80  215.40   18.40
  2      read   256.00K 256.00K   53.97 1191.56    2.10  389.40   22.60
  4      read   256.00K 256.00K   37.83 1724.63    2.20  776.00   29.30
  8      read   256.00K 256.00K   52.57 1213.63    2.20 1423.60   24.30
  16     read   256.00K 256.00K   60.05 1057.03    1.90 1951.10   24.30
  32     read   256.00K 256.00K   82.13  744.73    2.00 2277.50   18.60
                                        ^^^^^^^         ^^^^^^^

Basically,  we hit a scaling limitation at b/t 4 and 8 threads. This was
consistent across I/O sizes from 4KB to 4MB. I took a simple 30s PC sample
profile:

user ticks:             0               0 %
kernel ticks:           2982            99.97 %
idle ticks:             4               0.13 %

Using /proc/kallsyms as the kernel map file.
====================================================================
                           Kernel

      Ticks     Percent  Cumulative   Routine
                          Percent
--------------------------------------------------------------------
       1897       63.62    63.62      _write_lock_irqsave
        467       15.66    79.28      _read_unlock_irq
         91        3.05    82.33      established_get_next
         74        2.48    84.81      generic__raw_read_trylock
         59        1.98    86.79      xfs_iunlock
         47        1.58    88.36      _write_unlock_irq
         46        1.54    89.91      xfs_bmapi
         40        1.34    91.25      do_generic_mapping_read
         35        1.17    92.42      xfs_ilock_map_shared
         26        0.87    93.29      __copy_user
         23        0.77    94.06      __do_page_cache_readahead
         16        0.54    94.60      unlock_page
         15        0.50    95.10      xfs_ilock
         15        0.50    95.61      shrink_cache
         15        0.50    96.11      _spin_unlock_irqrestore
         13        0.44    96.55      sub_preempt_count
         11        0.37    96.91      mpage_end_io_read
         10        0.34    97.25      add_preempt_count
         10        0.34    97.59      xfs_iomap
          9        0.30    97.89      _read_unlock


So read_unlock_irq looks to be triggered by the mapping->tree_lock.

I think that the write_lock_irqsave() contention is from memory
reclaim (shrink_list()->try_to_release_page()-> ->releasepage()->
xfs_vm_releasepage()-> try_to_free_buffers()->clear_page_dirty()->
test_clear_page_dirty()-> write_lock_irqsave(&mapping->tree_lock...))
because page cache memory was full of this one file and demand is
causing them to be constantly recycled.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-28 14:01                 ` David Chinner
@ 2006-04-28 14:10                   ` David Chinner
  -1 siblings, 0 replies; 99+ messages in thread
From: David Chinner @ 2006-04-28 14:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Jens Axboe, Andrew Morton, npiggin, linux-mm

On Sat, Apr 29, 2006 at 12:01:47AM +1000, David Chinner wrote:
> 
> 
> Loads   Type    blksize count   av_time    tput    usr%   sys%   intr%
> -----   ----    ------- -----   ------- -------    ----   ----   -----

Sorry, I forgot units:              (s)   (MiB/s)      (cpu usage)

>   1      read   256.00K 256.00K   82.92  789.59    1.80  215.40   18.40
>   2      read   256.00K 256.00K   53.97 1191.56    2.10  389.40   22.60
>   4      read   256.00K 256.00K   37.83 1724.63    2.20  776.00   29.30
>   8      read   256.00K 256.00K   52.57 1213.63    2.20 1423.60   24.30
>   16     read   256.00K 256.00K   60.05 1057.03    1.90 1951.10   24.30
>   32     read   256.00K 256.00K   82.13  744.73    2.00 2277.50   18.60
>                                         ^^^^^^^         ^^^^^^^

And the reader is dd to /dev/null.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28 14:10                   ` David Chinner
  0 siblings, 0 replies; 99+ messages in thread
From: David Chinner @ 2006-04-28 14:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Jens Axboe, Andrew Morton, npiggin, linux-mm

On Sat, Apr 29, 2006 at 12:01:47AM +1000, David Chinner wrote:
> 
> 
> Loads   Type    blksize count   av_time    tput    usr%   sys%   intr%
> -----   ----    ------- -----   ------- -------    ----   ----   -----

Sorry, I forgot units:              (s)   (MiB/s)      (cpu usage)

>   1      read   256.00K 256.00K   82.92  789.59    1.80  215.40   18.40
>   2      read   256.00K 256.00K   53.97 1191.56    2.10  389.40   22.60
>   4      read   256.00K 256.00K   37.83 1724.63    2.20  776.00   29.30
>   8      read   256.00K 256.00K   52.57 1213.63    2.20 1423.60   24.30
>   16     read   256.00K 256.00K   60.05 1057.03    1.90 1951.10   24.30
>   32     read   256.00K 256.00K   82.13  744.73    2.00 2277.50   18.60
>                                         ^^^^^^^         ^^^^^^^

And the reader is dd to /dev/null.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-28 14:01                 ` David Chinner
  (?)
  (?)
@ 2006-04-30  9:49                 ` Nick Piggin
  2006-04-30 11:20                     ` Nick Piggin
  2006-04-30 11:39                     ` Jens Axboe
  -1 siblings, 2 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-30  9:49 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, Jens Axboe, Andrew Morton, linux-kernel,
	npiggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1926 bytes --]

Hi David,

Forgive my selective quoting...

David Chinner wrote:
> Take a large file - say Size = 5x RAM or so - and then start
> N threads runnnning at offset (n / Size) where n = the thread
> number. They each read (Size / N) and so typically don't overlap. 
> 
> Throughput with increasing numbers of threads on a 24p altix
> on an XFS filesystem on 2.6.15-rc5 looks like:
> 
> Loads      tput
> -----   -------
>   1      789.59
>   2     1191.56
>   4     1724.63
>   8     1213.63
>   16    1057.03
>   32     744.73
> 
> Basically,  we hit a scaling limitation at b/t 4 and 8 threads. This was
> consistent across I/O sizes from 4KB to 4MB. I took a simple 30s PC sample
> profile:

> Percent  Routine
> --------------------------
>   63.62  _write_lock_irqsave
>   15.66  _read_unlock_irq

> So read_unlock_irq looks to be triggered by the mapping->tree_lock.
> 
> I think that the write_lock_irqsave() contention is from memory
> reclaim (shrink_list()->try_to_release_page()-> ->releasepage()->
> xfs_vm_releasepage()-> try_to_free_buffers()->clear_page_dirty()->
> test_clear_page_dirty()-> write_lock_irqsave(&mapping->tree_lock...))
> because page cache memory was full of this one file and demand is
> causing them to be constantly recycled.

I'd say you're right.

tree_lock contention will be coming from a number of sources. reclaim,
as you say, will be a big one. mpage_readpages (from readahead) will
be another.

Then the read lock in find_get_page in generic_mapping_read will start
contending heavily on the writers and not get much concurrency.

I'm sure lockless (read-side) pagecache will help... not only will it
eliminate read_lock costs, but the reduced read contention should also
decrease write_lock contention and bouncing.

As well as lockless pagecache, I think we can batch tree_lock operations
in readahead. Would be interesting to see how much this patch helps.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-batch-ra-pagecache-add.patch --]
[-- Type: text/plain, Size: 6816 bytes --]

Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c	2006-04-30 19:19:14.000000000 +1000
+++ linux-2.6/fs/mpage.c	2006-04-30 19:23:08.000000000 +1000
@@ -26,6 +26,7 @@
 #include <linux/writeback.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 /*
  * I/O completion handler for multipage BIOs.
@@ -389,31 +390,57 @@
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
-	struct pagevec lru_pvec;
+	struct pagevec pvec;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
 
 	clear_buffer_mapped(&map_bh);
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
-					page->index, GFP_KERNEL)) {
-			bio = do_mpage_readpage(bio, page,
-					nr_pages - page_idx,
-					&last_block_in_bio, &map_bh,
-					&first_logical_block,
-					get_block);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else {
-			page_cache_release(page);
+
+		if (!pagevec_add(&pvec, page) || page_idx == nr_pages-1) {
+			int i = 0, in_cache;
+
+			if (radix_tree_preload(GFP_KERNEL))
+				goto pagevec_error;
+
+			write_lock_irq(&mapping->tree_lock);
+			for (; i < pagevec_count(&pvec); i++) {
+				struct page *page = pvec.pages[i];
+				unsigned long offset = page->index;
+
+				if (__add_to_page_cache(page, mapping, offset))
+					break; /* error */
+			}
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+
+			in_cache = i;
+			for (i = 0; i < in_cache; i++) {
+				struct page *page = pvec.pages[i];
+
+				bio = do_mpage_readpage(bio, page,
+						nr_pages - page_idx,
+						&last_block_in_bio, &map_bh,
+						&first_logical_block,
+						get_block);
+				lru_cache_add(page);
+			}
+
+pagevec_error:
+			for (; i < pagevec_count(&pvec); i++) {
+				struct page *page = pvec.pages[i];
+				page_cache_release(page);
+			}
+
+			pagevec_reinit(&pvec);
 		}
 	}
-	pagevec_lru_add(&lru_pvec);
+
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2006-04-30 19:19:14.000000000 +1000
+++ linux-2.6/mm/filemap.c	2006-04-30 19:20:01.000000000 +1000
@@ -394,6 +394,21 @@
 	return err;
 }
 
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+		pgoff_t offset)
+{
+	int error = radix_tree_insert(&mapping->page_tree, offset, page);
+	if (!error) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+		mapping->nrpages++;
+		pagecache_acct(1);
+	}
+	return error;
+}
+
 /*
  * This function is used to add newly allocated pagecache pages:
  * the page is new, so we can just run SetPageLocked() against it.
@@ -407,19 +422,10 @@
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
-		write_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
-			mapping->nrpages++;
-			pagecache_acct(1);
-		}
-		write_unlock_irq(&mapping->tree_lock);
+		error = __add_to_page_cache(page, mapping, offset);
 		radix_tree_preload_end();
 	}
+
 	return error;
 }
 
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c	2006-04-30 19:19:14.000000000 +1000
+++ linux-2.6/mm/readahead.c	2006-04-30 19:23:19.000000000 +1000
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -164,37 +165,60 @@
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static int read_pages(struct address_space *mapping, struct file *filp,
+static void __pagevec_read_pages(struct file *filp,
+		struct address_space *mapping, struct pagevec *pvec)
+{
+	int i = 0, in_cache;
+
+	if (radix_tree_preload(GFP_KERNEL))
+		goto out_error;
+
+	write_lock_irq(&mapping->tree_lock);
+	for (; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		unsigned long offset = page->index;
+
+		if (__add_to_page_cache(page, mapping, offset))
+			break; /* error */
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+
+	in_cache = i;
+	for (i = 0; i < in_cache; i++) {
+		struct page *page = pvec->pages[i];
+		mapping->a_ops->readpage(filp, page);
+		lru_cache_add(page);
+	}
+
+out_error:
+	for (; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		page_cache_release(page);
+	}
+
+	pagevec_reinit(pvec);
+}
+
+static void read_pages(struct address_space *mapping, struct file *filp,
 		struct list_head *pages, unsigned nr_pages)
 {
-	unsigned page_idx;
-	struct pagevec lru_pvec;
-	int ret;
+	unsigned i;
+	struct pagevec pvec;
 
 	if (mapping->a_ops->readpages) {
-		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
-		goto out;
+		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
+		return;
 	}
 
-	pagevec_init(&lru_pvec, 0);
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
+	pagevec_init(&pvec, 0);
+	for (i = 0; i < nr_pages; i++) {
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
-					page->index, GFP_KERNEL)) {
-			ret = mapping->a_ops->readpage(filp, page);
-			if (ret != AOP_TRUNCATED_PAGE) {
-				if (!pagevec_add(&lru_pvec, page))
-					__pagevec_lru_add(&lru_pvec);
-				continue;
-			} /* else fall through to release */
-		}
-		page_cache_release(page);
+
+		if (!pagevec_add(&pvec, page) || i == nr_pages-1)
+			__pagevec_read_pages(filp, mapping, &pvec);
 	}
-	pagevec_lru_add(&lru_pvec);
-	ret = 0;
-out:
-	return ret;
 }
 
 /*
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2006-04-30 19:19:14.000000000 +1000
+++ linux-2.6/include/linux/pagemap.h	2006-04-30 19:20:01.000000000 +1000
@@ -97,6 +97,8 @@
 extern int read_cache_pages(struct address_space *mapping,
 		struct list_head *pages, filler_t *filler, void *data);
 
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+				unsigned long index);
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 				unsigned long index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-30  9:49                 ` Nick Piggin
@ 2006-04-30 11:20                     ` Nick Piggin
  2006-04-30 11:39                     ` Jens Axboe
  1 sibling, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-30 11:20 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, Jens Axboe, Andrew Morton, linux-kernel,
	npiggin, linux-mm

Nick Piggin wrote:

> As well as lockless pagecache, I think we can batch tree_lock operations
> in readahead. Would be interesting to see how much this patch helps.

Btw. the patch introduces multiple locked pages in pagecache from a single
thread, however there should be no new deadlocks or lock orderings
introduced. They are always aquired because they are new pages, so will all
be released. Visibility from other threads is no different to the case
where multiple pages locked by multiple threads.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-30 11:20                     ` Nick Piggin
  0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-30 11:20 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Lameter, Jens Axboe, Andrew Morton, linux-kernel,
	npiggin, linux-mm

Nick Piggin wrote:

> As well as lockless pagecache, I think we can batch tree_lock operations
> in readahead. Would be interesting to see how much this patch helps.

Btw. the patch introduces multiple locked pages in pagecache from a single
thread, however there should be no new deadlocks or lock orderings
introduced. They are always aquired because they are new pages, so will all
be released. Visibility from other threads is no different to the case
where multiple pages locked by multiple threads.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-30  9:49                 ` Nick Piggin
@ 2006-04-30 11:39                     ` Jens Axboe
  2006-04-30 11:39                     ` Jens Axboe
  1 sibling, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-30 11:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Chinner, Christoph Lameter, Andrew Morton, linux-kernel,
	npiggin, linux-mm

On Sun, Apr 30 2006, Nick Piggin wrote:
> @@ -407,19 +422,10 @@
>  	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  
>  	if (error == 0) {
> -		write_lock_irq(&mapping->tree_lock);
> -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (!error) {
> -			page_cache_get(page);
> -			SetPageLocked(page);
> -			page->mapping = mapping;
> -			page->index = offset;
> -			mapping->nrpages++;
> -			pagecache_acct(1);
> -		}
> -		write_unlock_irq(&mapping->tree_lock);
> +		error = __add_to_page_cache(page, mapping, offset);
>  		radix_tree_preload_end();
>  	}
> +
>  	return error;

You killed a lock too many there. I'm sure it'd help scalability,
though :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-30 11:39                     ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2006-04-30 11:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Chinner, Christoph Lameter, Andrew Morton, linux-kernel,
	npiggin, linux-mm

On Sun, Apr 30 2006, Nick Piggin wrote:
> @@ -407,19 +422,10 @@
>  	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>  
>  	if (error == 0) {
> -		write_lock_irq(&mapping->tree_lock);
> -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (!error) {
> -			page_cache_get(page);
> -			SetPageLocked(page);
> -			page->mapping = mapping;
> -			page->index = offset;
> -			mapping->nrpages++;
> -			pagecache_acct(1);
> -		}
> -		write_unlock_irq(&mapping->tree_lock);
> +		error = __add_to_page_cache(page, mapping, offset);
>  		radix_tree_preload_end();
>  	}
> +
>  	return error;

You killed a lock too many there. I'm sure it'd help scalability,
though :-)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
  2006-04-30 11:39                     ` Jens Axboe
  (?)
@ 2006-04-30 11:44                     ` Nick Piggin
  -1 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2006-04-30 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Chinner, Christoph Lameter, Andrew Morton, linux-kernel,
	npiggin, linux-mm

[-- Attachment #1: Type: text/plain, Size: 257 bytes --]

Jens Axboe wrote:

> You killed a lock too many there. I'm sure it'd help scalability,
> though :-)

Ssshh! That was my secret plan :)

Good catch though, thanks. I'll attach a new patch in case David
has a chance to try it out.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-batch-ra-pagecache-add.patch --]
[-- Type: text/plain, Size: 6960 bytes --]

Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c	2006-04-30 19:36:18.000000000 +1000
+++ linux-2.6/fs/mpage.c	2006-04-30 21:42:15.000000000 +1000
@@ -26,6 +26,7 @@
 #include <linux/writeback.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 /*
  * I/O completion handler for multipage BIOs.
@@ -389,31 +390,57 @@ mpage_readpages(struct address_space *ma
 	struct bio *bio = NULL;
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
-	struct pagevec lru_pvec;
+	struct pagevec pvec;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
 
 	clear_buffer_mapped(&map_bh);
-	pagevec_init(&lru_pvec, 0);
+	pagevec_init(&pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_entry(pages->prev, struct page, lru);
 
 		prefetchw(&page->flags);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
-					page->index, GFP_KERNEL)) {
-			bio = do_mpage_readpage(bio, page,
-					nr_pages - page_idx,
-					&last_block_in_bio, &map_bh,
-					&first_logical_block,
-					get_block);
-			if (!pagevec_add(&lru_pvec, page))
-				__pagevec_lru_add(&lru_pvec);
-		} else {
-			page_cache_release(page);
+
+		if (!pagevec_add(&pvec, page) || page_idx == nr_pages-1) {
+			int i = 0, in_cache;
+
+			if (radix_tree_preload(GFP_KERNEL))
+				goto pagevec_error;
+
+			write_lock_irq(&mapping->tree_lock);
+			for (; i < pagevec_count(&pvec); i++) {
+				struct page *page = pvec.pages[i];
+				unsigned long offset = page->index;
+
+				if (__add_to_page_cache(page, mapping, offset))
+					break; /* error */
+			}
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+
+			in_cache = i;
+			for (i = 0; i < in_cache; i++) {
+				struct page *page = pvec.pages[i];
+
+				bio = do_mpage_readpage(bio, page,
+						nr_pages - page_idx,
+						&last_block_in_bio, &map_bh,
+						&first_logical_block,
+						get_block);
+				lru_cache_add(page);
+			}
+
+pagevec_error:
+			for (; i < pagevec_count(&pvec); i++) {
+				struct page *page = pvec.pages[i];
+				page_cache_release(page);
+			}
+
+			pagevec_reinit(&pvec);
 		}
 	}
-	pagevec_lru_add(&lru_pvec);
+
 	BUG_ON(!list_empty(pages));
 	if (bio)
 		mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2006-04-30 19:36:18.000000000 +1000
+++ linux-2.6/mm/filemap.c	2006-04-30 21:42:42.000000000 +1000
@@ -394,6 +394,21 @@ int filemap_write_and_wait_range(struct 
 	return err;
 }
 
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+		pgoff_t offset)
+{
+	int error = radix_tree_insert(&mapping->page_tree, offset, page);
+	if (!error) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+		mapping->nrpages++;
+		pagecache_acct(1);
+	}
+	return error;
+}
+
 /*
  * This function is used to add newly allocated pagecache pages:
  * the page is new, so we can just run SetPageLocked() against it.
@@ -408,18 +423,11 @@ int add_to_page_cache(struct page *page,
 
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
-			mapping->nrpages++;
-			pagecache_acct(1);
-		}
+		error = __add_to_page_cache(page, mapping, offset);
 		write_unlock_irq(&mapping->tree_lock);
 		radix_tree_preload_end();
 	}
+
 	return error;
 }
 
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c	2006-04-30 19:36:18.000000000 +1000
+++ linux-2.6/mm/readahead.c	2006-04-30 21:42:15.000000000 +1000
@@ -14,6 +14,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -164,37 +165,60 @@ int read_cache_pages(struct address_spac
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static int read_pages(struct address_space *mapping, struct file *filp,
+static void __pagevec_read_pages(struct file *filp,
+		struct address_space *mapping, struct pagevec *pvec)
+{
+	int i = 0, in_cache;
+
+	if (radix_tree_preload(GFP_KERNEL))
+		goto out_error;
+
+	write_lock_irq(&mapping->tree_lock);
+	for (; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		unsigned long offset = page->index;
+
+		if (__add_to_page_cache(page, mapping, offset))
+			break; /* error */
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+
+	in_cache = i;
+	for (i = 0; i < in_cache; i++) {
+		struct page *page = pvec->pages[i];
+		mapping->a_ops->readpage(filp, page);
+		lru_cache_add(page);
+	}
+
+out_error:
+	for (; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		page_cache_release(page);
+	}
+
+	pagevec_reinit(pvec);
+}
+
+static void read_pages(struct address_space *mapping, struct file *filp,
 		struct list_head *pages, unsigned nr_pages)
 {
-	unsigned page_idx;
-	struct pagevec lru_pvec;
-	int ret;
+	unsigned i;
+	struct pagevec pvec;
 
 	if (mapping->a_ops->readpages) {
-		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
-		goto out;
+		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
+		return;
 	}
 
-	pagevec_init(&lru_pvec, 0);
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
+	pagevec_init(&pvec, 0);
+	for (i = 0; i < nr_pages; i++) {
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache(page, mapping,
-					page->index, GFP_KERNEL)) {
-			ret = mapping->a_ops->readpage(filp, page);
-			if (ret != AOP_TRUNCATED_PAGE) {
-				if (!pagevec_add(&lru_pvec, page))
-					__pagevec_lru_add(&lru_pvec);
-				continue;
-			} /* else fall through to release */
-		}
-		page_cache_release(page);
+
+		if (!pagevec_add(&pvec, page) || i == nr_pages-1)
+			__pagevec_read_pages(filp, mapping, &pvec);
 	}
-	pagevec_lru_add(&lru_pvec);
-	ret = 0;
-out:
-	return ret;
 }
 
 /*
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2006-04-30 19:36:18.000000000 +1000
+++ linux-2.6/include/linux/pagemap.h	2006-04-30 21:42:16.000000000 +1000
@@ -97,6 +97,8 @@ extern struct page * read_cache_page(str
 extern int read_cache_pages(struct address_space *mapping,
 		struct list_head *pages, filler_t *filler, void *data);
 
+int __add_to_page_cache(struct page *page, struct address_space *mapping,
+				unsigned long index);
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 				unsigned long index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Lockless page cache test results
@ 2006-04-28 16:58 Al Boldi
  0 siblings, 0 replies; 99+ messages in thread
From: Al Boldi @ 2006-04-28 16:58 UTC (permalink / raw)
  To: linux-kernel

Wu Fengguang wrote:
> On Wed, Apr 26, 2006 at 01:12:00PM -0700, Andrew Morton wrote:
> > Jens Axboe <axboe@suse.de> wrote:
> > > With a 16-page gang lookup in splice, the top profile for the 4-client
> > > case (which is now at 4GiB/sec instead of 3) are:
> > >
> > > samples  %        symbol name
> > > 30396    36.7217  __do_page_cache_readahead
> > > 25843    31.2212  find_get_pages_contig
> > > 9699     11.7174  default_idle
> >
> > __do_page_cache_readahead() should use gang lookup.  We never got around
> > to that, mainly because nothing really demonstrated a need.
>
> I have been testing a patch for this for a while. The new function
> looks like
>
> static int
> __do_page_cache_readahead(struct address_space *mapping, struct file
> *filp, pgoff_t offset, unsigned long nr_to_read) {
>         struct inode *inode = mapping->host;
>         struct page *page;
>         LIST_HEAD(page_pool);
>         pgoff_t last_index;     /* The last page we want to read */
>         pgoff_t hole_index;
>         int ret = 0;
>         loff_t isize = i_size_read(inode);
>         last_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
>
>         if (unlikely(!isize || !nr_to_read))
>                 goto out;
>         if (unlikely(last_index < offset))
>                 goto out;
>         if (last_index > offset + nr_to_read - 1 &&
>                 offset < offset + nr_to_read)
>                 last_index = offset + nr_to_read - 1;
>         /*
>          * Go through ranges of holes and preallocate all the absent
> pages. */
> next_hole_range:
>         cond_resched();
>         read_lock_irq(&mapping->tree_lock);
>         hole_index = radix_tree_scan_hole(&mapping->page_tree,
>                                         offset, last_index - offset + 1);
>
>         if (hole_index > last_index) {  /* no more holes? */
>                 read_unlock_irq(&mapping->tree_lock);
>                 goto submit_io;
>         }
>         offset = radix_tree_scan_data(&mapping->page_tree, (void **)&page,
>                                                 hole_index, last_index);
>         read_unlock_irq(&mapping->tree_lock);
>
>         ddprintk("ra range %lu-%lu(%p)-%lu\n", hole_index, offset, page,
> last_index);
>
>         for (;;) {
>                 page = page_cache_alloc_cold(mapping);
>                 if (!page)
>                         break;
>                 page->index = hole_index;
>                 list_add(&page->lru, &page_pool);
>                 ret++;
>                 BUG_ON(ret > nr_to_read);
>                 if (hole_index >= last_index)
>                         break;
>                 if (++hole_index >= offset)
>                         goto next_hole_range;
>         }
> submit_io:
>         /*
>          * Now start the IO.  We ignore I/O errors - if the page is not
>          * uptodate then the caller will launch readpage again, and
>          * will then handle the error.
>          */
>         if (ret)
>                 read_pages(mapping, filp, &page_pool, ret);
>         BUG_ON(!list_empty(&page_pool));
> out:
>         return ret;
> }
> The radix_tree_scan_data()/radix_tree_scan_hole() functions called
> above are more flexible than the original __lookup(). Perhaps we can
> rebase radix_tree_gang_lookup() and find_get_pages_contig() on them.
>
> If it is deemed ok, I'll clean it up and submit the patch asap.

Can you patch it up for 2.6.16 asap?

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2006-04-30 13:38 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-04-26 13:53 Lockless page cache test results Jens Axboe
2006-04-26 14:43 ` Nick Piggin
2006-04-26 14:43   ` Nick Piggin
2006-04-26 19:46   ` Jens Axboe
2006-04-26 19:46     ` Jens Axboe
2006-04-27  5:39     ` Chen, Kenneth W
2006-04-27  5:39       ` Chen, Kenneth W
2006-04-27  6:07       ` Nick Piggin
2006-04-27  6:07         ` Nick Piggin
2006-04-27  6:15       ` Andi Kleen
2006-04-27  6:15         ` Andi Kleen
2006-04-27  7:51         ` Chen, Kenneth W
2006-04-27  7:51           ` Chen, Kenneth W
2006-04-26 16:55 ` Andrew Morton
2006-04-26 16:55   ` Andrew Morton
2006-04-26 17:42   ` Jens Axboe
2006-04-26 17:42     ` Jens Axboe
2006-04-26 18:10     ` Andrew Morton
2006-04-26 18:10       ` Andrew Morton
2006-04-26 18:23       ` Jens Axboe
2006-04-26 18:23         ` Jens Axboe
2006-04-26 18:46         ` Andrew Morton
2006-04-26 18:46           ` Andrew Morton
2006-04-26 19:21           ` Jens Axboe
2006-04-26 19:21             ` Jens Axboe
2006-04-27  5:58           ` Nick Piggin
2006-04-27  5:58             ` Nick Piggin
2006-04-26 18:34       ` Christoph Lameter
2006-04-26 18:34         ` Christoph Lameter
2006-04-26 18:47         ` Andrew Morton
2006-04-26 18:47           ` Andrew Morton
2006-04-26 18:48           ` Christoph Lameter
2006-04-26 18:48             ` Christoph Lameter
2006-04-26 18:49           ` Jens Axboe
2006-04-26 18:49             ` Jens Axboe
2006-04-26 20:31             ` Christoph Lameter
2006-04-26 20:31               ` Christoph Lameter
2006-04-28 14:01               ` David Chinner
2006-04-28 14:01                 ` David Chinner
2006-04-28 14:10                 ` David Chinner
2006-04-28 14:10                   ` David Chinner
2006-04-30  9:49                 ` Nick Piggin
2006-04-30 11:20                   ` Nick Piggin
2006-04-30 11:20                     ` Nick Piggin
2006-04-30 11:39                   ` Jens Axboe
2006-04-30 11:39                     ` Jens Axboe
2006-04-30 11:44                     ` Nick Piggin
2006-04-26 18:58       ` Christoph Hellwig
2006-04-26 18:58         ` Christoph Hellwig
2006-04-26 19:02         ` Jens Axboe
2006-04-26 19:02           ` Jens Axboe
2006-04-26 19:00       ` Linus Torvalds
2006-04-26 19:00         ` Linus Torvalds
2006-04-26 19:15         ` Jens Axboe
2006-04-26 19:15           ` Jens Axboe
2006-04-26 20:12           ` Andrew Morton
2006-04-26 20:12             ` Andrew Morton
2006-04-27  7:45             ` Jens Axboe
2006-04-27  7:47               ` Jens Axboe
2006-04-27  7:47                 ` Jens Axboe
2006-04-27  7:57               ` Nick Piggin
2006-04-27  7:57                 ` Nick Piggin
2006-04-27  8:02                 ` Nick Piggin
2006-04-27  8:02                   ` Nick Piggin
2006-04-27  9:00                   ` Jens Axboe
2006-04-27  9:00                     ` Jens Axboe
2006-04-27 13:36                     ` Nick Piggin
2006-04-27 13:36                       ` Nick Piggin
2006-04-27  8:36                 ` Jens Axboe
2006-04-27  8:36                   ` Jens Axboe
2006-04-28 11:28             ` Wu Fengguang
2006-04-28 11:28               ` Wu Fengguang
2006-04-28 11:28                 ` Wu Fengguang
2006-04-27  5:49         ` Nick Piggin
2006-04-27  5:49           ` Nick Piggin
2006-04-27 15:12           ` Linus Torvalds
2006-04-27 15:12             ` Linus Torvalds
2006-04-28  4:54             ` Nick Piggin
2006-04-28  4:54               ` Nick Piggin
2006-04-28  5:34               ` Linus Torvalds
2006-04-28  5:34                 ` Linus Torvalds
2006-04-27  9:35         ` Jens Axboe
2006-04-27  5:22       ` Nick Piggin
2006-04-27  5:22         ` Nick Piggin
2006-04-26 18:57     ` Jens Axboe
2006-04-27  2:19       ` KAMEZAWA Hiroyuki
2006-04-27  2:19         ` KAMEZAWA Hiroyuki
2006-04-27  8:03         ` Jens Axboe
2006-04-27  8:03           ` Jens Axboe
2006-04-27 11:16           ` Jens Axboe
2006-04-27 11:16             ` Jens Axboe
2006-04-27 11:41             ` KAMEZAWA Hiroyuki
2006-04-27 11:41               ` KAMEZAWA Hiroyuki
2006-04-27 11:45               ` Jens Axboe
2006-04-27 11:45                 ` Jens Axboe
2006-04-28  9:10 ` Pavel Machek
2006-04-28  9:10   ` Pavel Machek
2006-04-28  9:21   ` Jens Axboe
2006-04-28  9:21     ` Jens Axboe
2006-04-28 16:58 Al Boldi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.