linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: process creation time increases linearly with shmem
@ 2005-12-14 14:07 Brice Oliver
  2005-12-14 16:21 ` Hugh Dickins
  0 siblings, 1 reply; 36+ messages in thread
From: Brice Oliver @ 2005-12-14 14:07 UTC (permalink / raw)
  To: linux-kernel

Sorry if this is a bit of an unusual request, but I am just trying to
get more information.

I was working with Ray on this issue, and in order to get the
appropriate patch, I need to get the bugzilla number for this
particular patch so that I
can turn that in to RedHat so they will include this fix in their
release (as they will not allow the kernel patch to be applied unless
they apply it to their source and then distribute to their customers).

Is that something that can be provided to me here?

Thanks,
Brice Oliver

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-12-14 14:07 process creation time increases linearly with shmem Brice Oliver
@ 2005-12-14 16:21 ` Hugh Dickins
  0 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2005-12-14 16:21 UTC (permalink / raw)
  To: Brice Oliver; +Cc: linux-kernel

On Wed, 14 Dec 2005, Brice Oliver wrote:

> Sorry if this is a bit of an unusual request, but I am just trying to
> get more information.
> 
> I was working with Ray on this issue, and in order to get the
> appropriate patch, I need to get the bugzilla number for this
> particular patch so that I
> can turn that in to RedHat so they will include this fix in their
> release (as they will not allow the kernel patch to be applied unless
> they apply it to their source and then distribute to their customers).
> 
> Is that something that can be provided to me here?

I didn't realize there ever was a bugzilla number for it.
This is the actual patch, which Linus put into his tree on 30th August:

From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Sun, 28 Aug 2005 06:49:11 +0000 (+1000)
Subject: [PATCH] Lazy page table copies in fork()
X-Git-Tag: v2.6.14-rc1
X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d992895ba2b27cf5adf1ba0ad6d27662adc54c5e

[PATCH] Lazy page table copies in fork()

Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.

Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.

Ray Fucillo <fucillo@intersystems.com> reports:

  "I applied this latest patch to a 2.6.12 kernel and found that it does
   resolve the problem.  Prior to the patch on this machine, I was
   seeing about 23ms spent in fork for ever 100MB of shared memory
   segment.

   After applying the patch, fork is taking about 1ms regardless of the
   shared memory size."

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -498,6 +498,17 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
 
+	/*
+	 * Don't copy ptes where a page fault will fill them correctly.
+	 * Fork becomes much lighter when there are big shared or private
+	 * readonly mappings. The tradeoff is that copy_page_range is more
+	 * efficient than faulting.
+	 */
+	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+		if (!vma->anon_vma)
+			return 0;
+	}
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-30  0:29                           ` Nick Piggin
@ 2005-08-30  1:03                             ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2005-08-30  1:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ray Fucillo, Hugh Dickins, Rik van Riel, Andi Kleen,
	Andrew Morton, linux-kernel



On Tue, 30 Aug 2005, Nick Piggin wrote:
> 
> Andrew, did you pick up the patch or should I resend to someone?

I picked it up. If it causes performance regressions, we can fix them, and
if it causes other problems then that will be interesting in itself.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-29 23:33                         ` Ray Fucillo
  2005-08-30  0:29                           ` Nick Piggin
@ 2005-08-30  0:34                           ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2005-08-30  0:34 UTC (permalink / raw)
  To: Ray Fucillo
  Cc: Nick Piggin, Hugh Dickins, Rik van Riel, Andi Kleen,
	Andrew Morton, linux-kernel



On Mon, 29 Aug 2005, Ray Fucillo wrote:
> 
> FWIW, an interesting side effect of this occurs when I run the database 
> with this patch internally on a Linux server that uses NIS.  Its an 
> unrelated problem and not a kernel problem.  Its due to the children 
> calling initgroups()...  apparently when you have many processes making 
> simultaneous initgroups() calls something starts imposing very long 
> waits in increments of 3 seconds

Sounds like something is backing off by waiting for three seconds whenever
some lock failure occurs. I don't see what locking the code might want to
do (it should just do the NIS equivalent of reading /etc/groups and do a
"setgroups()" system call), but I assume that the NIS server ends up
having some strange locking.

You might do an "ltrace testcase" (and, probably, the nis server) to see
if you can see where it happens, and bug the appropriate maintainers.
Especially if you have a repeatable test-case (where "repeatable" isn't
just for your particular machine: it's probably timing-related), somebody
might even fix it ;)

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-29 23:33                         ` Ray Fucillo
@ 2005-08-30  0:29                           ` Nick Piggin
  2005-08-30  1:03                             ` Linus Torvalds
  2005-08-30  0:34                           ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-08-30  0:29 UTC (permalink / raw)
  To: Ray Fucillo
  Cc: Hugh Dickins, Linus Torvalds, Rik van Riel, Andi Kleen,
	Andrew Morton, linux-kernel

Ray Fucillo wrote:

> Nick Piggin wrote:
>
>> How does the following look? (I changed the comment a bit). Andrew, 
>> please
>> apply if nobody objects.
>
>
> Nick, I applied this latest patch to a 2.6.12 kernel and found that it 
> does resolve the problem.  Prior to the patch on this machine, I was 
> seeing about 23ms spent in fork for ever 100MB of shared memory 
> segment.  After applying the patch, fork is taking about 1ms 
> regardless of the shared memory size.
>

Hi Ray,
That's good news. I think we should probably consider putting the patch in
2.6.14 or if not, then definitely 2.6.15.

Andrew, did you pick up the patch or should I resend to someone?

I think the fork latency alone is enough to justify inclusion... 
however, did
you actually see increased aggregate throughput of your database (or at 
least
not a _decreased_ throughput)?

> Many thanks to everyone for your help on this.
>

Well thank you very much for breaking the kernel and telling us about it! :)

Nick


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-28  6:49                       ` Nick Piggin
@ 2005-08-29 23:33                         ` Ray Fucillo
  2005-08-30  0:29                           ` Nick Piggin
  2005-08-30  0:34                           ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: Ray Fucillo @ 2005-08-29 23:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Linus Torvalds, Rik van Riel, Andi Kleen,
	Andrew Morton, linux-kernel

Nick Piggin wrote:
> How does the following look? (I changed the comment a bit). Andrew, please
> apply if nobody objects.

Nick, I applied this latest patch to a 2.6.12 kernel and found that it 
does resolve the problem.  Prior to the patch on this machine, I was 
seeing about 23ms spent in fork for ever 100MB of shared memory segment. 
  After applying the patch, fork is taking about 1ms regardless of the 
shared memory size.

Many thanks to everyone for your help on this.

FWIW, an interesting side effect of this occurs when I run the database 
with this patch internally on a Linux server that uses NIS.  Its an 
unrelated problem and not a kernel problem.  Its due to the children 
calling initgroups()...  apparently when you have many processes making 
simultaneous initgroups() calls something starts imposing very long 
waits in increments of 3 seconds, so some processes return from 
initgroups() in a few ms and other processes complete in 3, 6, 9, up to 
21 seconds (plus a few ms).  I'm not sure what the story is with that, 
though its clearly not a kernel issue.  If someone happens to have the 
answer or a suggestion, great, otherwise I'll persue that elsewhere as 
necessary.  (I can reproduce this by simply adding a call to 
initgroups() call in the child of the forktest program that I sent earlier)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-28  4:26                     ` Hugh Dickins
@ 2005-08-28  6:49                       ` Nick Piggin
  2005-08-29 23:33                         ` Ray Fucillo
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-08-28  6:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Rik van Riel, Ray Fucillo, Andi Kleen,
	Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

Hugh Dickins wrote:
> On Sun, 28 Aug 2005, Nick Piggin wrote:
> 
>>This is the condition I ended up with. Any good?
>>
>>if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
>>if (vma->vm_flags & VM_MAYSHARE)
>> return 0;
>>if (vma->vm_file && !vma->anon_vma)
>>   return 0;
>>}
> 
> 
> It's not bad, and practical timings are unlikely to differ, but your
> VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
> and your vm_file test is unnecessary, excluding pure anonymous areas
> which haven't yet taken a fault.
> 

Haven't taken a _write_ fault? Hmm, OK  that would seem to be a good
optimisation as well: we don't need to copy anon memory with only
ZERO_PAGE mappings... well, good as in "nice and logical" if not so
much "will make a difference"!

> Please do send Andrew the patch for -mm, Nick: you were one of the
> creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
> much better that it go in your name (heh, heh, heh, can you trust me?)
> 

Well Andi and I seemed to have the idea independently, Linus thought
private would be a good idea (I agree), you came up with the complete
patch with others contributing bits and pieces, and most importantly
Ray brought our attention to the possible deficiency in our mm.

> Hugh
> 
> * That's ignoring, as we do everywhere else, the case which came up
> a couple of weeks back in discussions with Linus, ptrace writing to
> an area the process does not have write access to, creating an anon
> page within a shared vma: that's an awkward case currently mishandled,
> but the patch below does it no harm.
> 

And in that case maybe your patch works better anyway, because the child
will inherit that page from parent.

How does the following look? (I changed the comment a bit). Andrew, please
apply if nobody objects.

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: vm-lazy-fork.patch --]
[-- Type: text/plain, Size: 1155 bytes --]

Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.

Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.

[ Note to akpm: Ray Fucillo <fucillo@intersystems.com>'s results go here ]

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2005-08-13 11:16:34.000000000 +1000
+++ linux-2.6/mm/memory.c	2005-08-28 16:41:32.000000000 +1000
@@ -498,6 +498,17 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
 
+	/*
+	 * Don't copy ptes where a page fault will fill them correctly.
+	 * Fork becomes much lighter when there are big shared or private
+	 * readonly mappings. The tradeoff is that copy_page_range is more
+	 * efficient than faulting.
+	 */
+	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+		if (!vma->anon_vma)
+			return 0;
+	}
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-27 15:05                   ` Nick Piggin
@ 2005-08-28  4:26                     ` Hugh Dickins
  2005-08-28  6:49                       ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Hugh Dickins @ 2005-08-28  4:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Rik van Riel, Ray Fucillo, linux-kernel

On Sun, 28 Aug 2005, Nick Piggin wrote:
> 
> This is the condition I ended up with. Any good?
> 
> if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
> if (vma->vm_flags & VM_MAYSHARE)
>  return 0;
> if (vma->vm_file && !vma->anon_vma)
>    return 0;
> }

It's not bad, and practical timings are unlikely to differ, but your
VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
and your vm_file test is unnecessary, excluding pure anonymous areas
which haven't yet taken a fault.

Please do send Andrew the patch for -mm, Nick: you were one of the
creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
much better that it go in your name (heh, heh, heh, can you trust me?)

Hugh

* That's ignoring, as we do everywhere else, the case which came up
a couple of weeks back in discussions with Linus, ptrace writing to
an area the process does not have write access to, creating an anon
page within a shared vma: that's an awkward case currently mishandled,
but the patch below does it no harm.

--- 2.6.13-rc7/mm/memory.c	2005-08-24 11:13:41.000000000 +0100
+++ linux/mm/memory.c	2005-08-28 04:48:34.000000000 +0100
@@ -498,6 +498,15 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
 
+	/*
+	 * Assume the fork will probably exec: don't waste time copying
+	 * ptes where a page fault will fill them correctly afterwards.
+	 */
+	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+		if (!vma->anon_vma)
+			return 0;
+	}
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 23:23                 ` Linus Torvalds
@ 2005-08-27 15:05                   ` Nick Piggin
  2005-08-28  4:26                     ` Hugh Dickins
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-08-27 15:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rik van Riel, Hugh Dickins, Ray Fucillo, linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 26 Aug 2005, Rik van Riel wrote:
> 
>>On Fri, 26 Aug 2005, Hugh Dickins wrote:
>>
>>
>>>Well, I still don't think we need to test vm_file.  We can add an
>>>anon_vma test if you like, if we really want to minimize the fork
>>>overhead, in favour of later faults.  Do we?
>>
>>When you consider NUMA placement (the child process may
>>end up running elsewhere), allocating things like page
>>tables lazily may well end up being a performance win.
> 
> 
> It should be easy enough to benchmark something like kernel compiles etc, 
> which are reasonably fork-rich and should show a good mix for something 
> like this. Or even just something like "time to restart a X session" after 
> you've brought it into memory once.
> 

2.6.13-rc7-git2
kbuild (make -j4) on  dual G5.

plain
228.85user 19.90system 2:06.50elapsed 196%CPU (3725666minor)
228.91user 19.90system 2:06.07elapsed 197%CPU (3721353minor)
229.00user 19.78system 2:06.20elapsed 197%CPU (3721345minor)
228.81user 19.94system 2:06.05elapsed 197%CPU (3723791minor)

nocopy shared
229.28user 19.76system 2:06.24elapsed 197%CPU (3725661minor)
229.04user 19.91system 2:06.92elapsed 196%CPU (3718904minor)
228.97user 20.06system 2:06.46elapsed 196%CPU (3723807minor)
229.24user 19.84system 2:06.13elapsed 197%CPU (3723793minor)

nocopy all
228.74user 19.87system 2:06.27elapsed 196%CPU (3819927minor)
228.89user 19.81system 2:05.89elapsed 197%CPU (3822943minor)
228.77user 19.73system 2:06.23elapsed 196%CPU (3820517minor)
228.93user 19.70system 2:05.84elapsed 197%CPU (3822935minor)

I'd say the full test (including anon_vma) is maybe slightly
faster on this test though maybe it isn't significant.

It is doing around 2.5% more minor faults, thought the profiles
say copy_page_range time is reduced as one would expect.

I think that if all else (ie. final performance) is equal, then
faulting is better than copying because the work is being
deferred until it is needed, and we dodge some pathological
cases like Ray's database taking 100s of ms to fork (we hope!)

However it will always depend on workload.

This is the condition I ended up with. Any good?

   if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
     if (vma->vm_flags & VM_MAYSHARE)
       return 0;
     if (vma->vm_file && !vma->anon_vma)
       return 0;
   }

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 23:10               ` Rik van Riel
@ 2005-08-26 23:23                 ` Linus Torvalds
  2005-08-27 15:05                   ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2005-08-26 23:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Hugh Dickins, Nick Piggin, Ray Fucillo, linux-kernel



On Fri, 26 Aug 2005, Rik van Riel wrote:
> On Fri, 26 Aug 2005, Hugh Dickins wrote:
> 
> > Well, I still don't think we need to test vm_file.  We can add an
> > anon_vma test if you like, if we really want to minimize the fork
> > overhead, in favour of later faults.  Do we?
> 
> When you consider NUMA placement (the child process may
> end up running elsewhere), allocating things like page
> tables lazily may well end up being a performance win.

It should be easy enough to benchmark something like kernel compiles etc, 
which are reasonably fork-rich and should show a good mix for something 
like this. Or even just something like "time to restart a X session" after 
you've brought it into memory once.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 18:41             ` Hugh Dickins
  2005-08-26 22:55               ` Linus Torvalds
@ 2005-08-26 23:10               ` Rik van Riel
  2005-08-26 23:23                 ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2005-08-26 23:10 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Nick Piggin, Ray Fucillo, linux-kernel

On Fri, 26 Aug 2005, Hugh Dickins wrote:

> Well, I still don't think we need to test vm_file.  We can add an
> anon_vma test if you like, if we really want to minimize the fork
> overhead, in favour of later faults.  Do we?

When you consider NUMA placement (the child process may
end up running elsewhere), allocating things like page
tables lazily may well end up being a performance win.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 18:41             ` Hugh Dickins
@ 2005-08-26 22:55               ` Linus Torvalds
  2005-08-26 23:10               ` Rik van Riel
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2005-08-26 22:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, Rik van Riel, Ray Fucillo, linux-kernel



On Fri, 26 Aug 2005, Hugh Dickins wrote:
>
> Well, I still don't think we need to test vm_file.  We can add an
> anon_vma test if you like, if we really want to minimize the fork
> overhead, in favour of later faults.  Do we?

I think we might want to do it in -mm for testing. Because quite frankly, 
otherwise the new fork() logic won't get a lot of testing. Shared memory 
isn't that common.

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 18:20                 ` Ross Biro
@ 2005-08-26 18:56                   ` Hugh Dickins
  0 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2005-08-26 18:56 UTC (permalink / raw)
  To: Ross Biro
  Cc: Rik van Riel, Ray Fucillo, Nick Piggin, Linus Torvalds, linux-kernel

On Fri, 26 Aug 2005, Ross Biro wrote:
> On 8/26/05, Rik van Riel <riel@redhat.com> wrote:
> > 
> > Filling in all the page table entries at the first fault to
> > a VMA doesn't make much sense, IMHO.
> > 
> > I suspect we would be better off without that extra complexity,
> > unless there is a demonstrated benefit to it.
> 
> You are probably right, but do you want to put in a patch that might
> have a big performance impact in either direction with out verifying
> it?
> 
> My suggestion is safe, but most likely sub-optimal.  What everyone
> else is suggesting may be far better, but needs to be verified first.

It all has to be verified, and the problem will be that some things
fare well and others badly: how to reach a balanced decision?
Following your suggestion is no more safe than not following it.

> I'm suggesting that we change the code to do the same work fork would
> have done on the first page fault immediately, since it's easy to
> argue that it's not much worse than we have now and much better in
> many cases, and then try to experiment and figure out  what the
> correct solution is.

We don't know what work fork would have done, that information was in
the ptes we decided not to bother to copy.  Perhaps every pte of the
vma was set, perhaps none, perhaps only one.

Also, doing it at fault time has significantly more work to do than
just zipping along the ptes incrementing page counts and clearing bits.
I think; but probably much less extra work than I originally imagined,
since Andrew gave us the gang lookup of the page cache.

All the same, I'm with Rik: one of the great virtues of the original
idea was its simplicity; I'd prefer not to add complexity.

Hugh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 18:07           ` Linus Torvalds
@ 2005-08-26 18:41             ` Hugh Dickins
  2005-08-26 22:55               ` Linus Torvalds
  2005-08-26 23:10               ` Rik van Riel
  0 siblings, 2 replies; 36+ messages in thread
From: Hugh Dickins @ 2005-08-26 18:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Rik van Riel, Ray Fucillo, linux-kernel

On Fri, 26 Aug 2005, Linus Torvalds wrote:
> On Fri, 26 Aug 2005, Hugh Dickins wrote:
> > 
> > I see some flaws in the various patches posted, including Rik's.
> > Here's another version - doing it inside copy_page_range, so this
> > kind of vma special-casing is over in mm/ rather than kernel/.
> 
> I like this approach better, but I don't understand your particular 
> choice of bits.
> 
> > +	 * Assume the fork will probably exec: don't waste time copying
> > +	 * ptes where a page fault will fill them correctly afterwards.
> > +	 */
> > +	if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
> > +								== VM_MAYSHARE)
> > +		return 0;
> > +
> >  	if (is_vm_hugetlb_page(vma))
> >  		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> 
> First off, if you just did it below the hugetlb check, you'd not need to
> check hugetlb again.

Yes: I wanted to include VM_HUGETLB in the list as documentation really;
and it costs nothing to test it along with the other flags - or are there
architectures where the more bits you test, the costlier?

> And while I understand VM_NONLINEAR and VM_RESERVED,
> can you please comment on why VM_MAYSHARE is so important, and why no
> other information matters.

The VM_MAYSHARE one isn't terribly important, there's no correctness
reason to replace VM_SHARED there.   It's just that do_mmap_pgoff takes
VM_SHARED and VM_MAYWRITE off a MAP_SHARED mapping of a file which was
not opened for writing.  We can safely avoid copying the ptes of such a
vma, just as with the writable ones, but the VM_MAYSHARE test catches
them where the VM_SHARED test does not.

> Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair 
> enough. But afaik, a shared anonymous mapping absolutely needs its page 
> tables copied, because those page tables contains either the pointers to 
> the shared pages, or the swap entries.
> 
> So I really think you need to verify that it's a file mapping too.

Either I'm misunderstanding, or you're remembering back to how shared
anonymous was done in 2.2 (perhaps).  In 2.4 and 2.6, shared anonymous
is "backed" by a shared memory object, created by shmem_zero_setup:
which sets vm_file even though we came into do_mmap_pgoff with no file.

> Also, arguably, there are other cases that may or may not be worth 
> worrying about. What about non-shared non-writable file mappings? What 
> about private mappings that haven't been COW'ed? 

Non-shared non-currently-writable file mappings might have been writable
and modified in the past, so we cannot necessarily skip those.

We could, and I did, consider testing whether the vma has an anon_vma:
we always allocate a vma's anon_vma just before first allocating it a
private page, and it's a good test which swapoff uses to narrow its
search.

But partly I thought that a little too tricksy, and hard to explain;
and partly I thought it was liable to catch the executable text,
some of which is most likely to be needed in between fork and exec.

> So I think that in addition to your tests, you should test for
> "vma->vm_file", and you could toy with testing for "vma->anon_vma"  being
> NULL (the latter will cause a _lot_ of hits, because any read-only private
> mapping will trigger, but it's a good stress-test and conceptually
> interesting, even if I suspect it will kill any performance gain through
> extra minor faults in the child).

Ah yes, I wrote the paragraph above before reading this one, honest!

Well, I still don't think we need to test vm_file.  We can add an
anon_vma test if you like, if we really want to minimize the fork
overhead, in favour of later faults.  Do we?

Hugh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 17:53               ` Rik van Riel
@ 2005-08-26 18:20                 ` Ross Biro
  2005-08-26 18:56                   ` Hugh Dickins
  0 siblings, 1 reply; 36+ messages in thread
From: Ross Biro @ 2005-08-26 18:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ray Fucillo, Nick Piggin, Hugh Dickins, Linus Torvalds, linux-kernel

On 8/26/05, Rik van Riel <riel@redhat.com> wrote:
> 
> Filling in all the page table entries at the first fault to
> a VMA doesn't make much sense, IMHO.
> 
> 
> I suspect we would be better off without that extra complexity,
> unless there is a demonstrated benefit to it.

You are probably right, but do you want to put in a patch that might
have a big performance impact in either direction with out verifying
it?

My suggestion is safe, but most likely sub-optimal.  What everyone
else is suggesting may be far better, but needs to be verified first.

I'm suggesting that we change the code to do the same work fork would
have done on the first page fault immediately, since it's easy to
argue that it's not much worse than we have now and much better in
many cases, and then try to experiment and figure out  what the
correct solution is.

    Ross

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 11:49         ` Hugh Dickins
  2005-08-26 14:26           ` Nick Piggin
       [not found]           ` <8783be660508260915524e2b1e@mail.gmail.com>
@ 2005-08-26 18:07           ` Linus Torvalds
  2005-08-26 18:41             ` Hugh Dickins
  2 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2005-08-26 18:07 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Nick Piggin, Rik van Riel, Ray Fucillo, linux-kernel



On Fri, 26 Aug 2005, Hugh Dickins wrote:
> 
> I see some flaws in the various patches posted, including Rik's.
> Here's another version - doing it inside copy_page_range, so this
> kind of vma special-casing is over in mm/ rather than kernel/.

I like this approach better, but I don't understand your particular 
choice of bits.

> +	 * Assume the fork will probably exec: don't waste time copying
> +	 * ptes where a page fault will fill them correctly afterwards.
> +	 */
> +	if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
> +								== VM_MAYSHARE)
> +		return 0;
> +
>  	if (is_vm_hugetlb_page(vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, vma);

First off, if you just did it below the hugetlb check, you'd not need to
check hugetlb again. And while I understand VM_NONLINEAR and VM_RESERVED,
can you please comment on why VM_MAYSHARE is so important, and why no
other information matters.

Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair 
enough. But afaik, a shared anonymous mapping absolutely needs its page 
tables copied, because those page tables contains either the pointers to 
the shared pages, or the swap entries.

So I really think you need to verify that it's a file mapping too.

Also, arguably, there are other cases that may or may not be worth 
worrying about. What about non-shared non-writable file mappings? What 
about private mappings that haven't been COW'ed? 

So I think that in addition to your tests, you should test for
"vma->vm_file", and you could toy with testing for "vma->anon_vma"  being
NULL (the latter will cause a _lot_ of hits, because any read-only private
mapping will trigger, but it's a good stress-test and conceptually
interesting, even if I suspect it will kill any performance gain through
extra minor faults in the child).

			Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 17:00             ` Ray Fucillo
@ 2005-08-26 17:53               ` Rik van Riel
  2005-08-26 18:20                 ` Ross Biro
  0 siblings, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2005-08-26 17:53 UTC (permalink / raw)
  To: Ray Fucillo; +Cc: Nick Piggin, Hugh Dickins, Linus Torvalds, linux-kernel

On Fri, 26 Aug 2005, Ray Fucillo wrote:

> However, there is still a need that the child, once successfully forked, is
> operational reasonably quickly.  I suspect that Ross's idea of paging in
> everything after the first fault would not be optimal for us, because we'd
> still be talking about hundreds of ms of work done before the child does
> anything useful. 

Simply skipping the page table setup of MAP_SHARED regions
should be enough to fix this issue.

> It would still be far better than the behavior we have today because 
> that time would no longer be synchronous with the fork().

Filling in all the page table entries at the first fault to
a VMA doesn't make much sense, IMHO.

The reason I think this is that people have experimented
with prefaulting already resident pages at page fault time,
and those experiments have never shown a conclusive benefit.

Now, if doing such prefaulting for normal processes does not
show a benefit - why would it be beneficial to recently forked
processes with a huge SHM area ?

I suspect we would be better off without that extra complexity,
unless there is a demonstrated benefit to it.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 14:26           ` Nick Piggin
@ 2005-08-26 17:00             ` Ray Fucillo
  2005-08-26 17:53               ` Rik van Riel
  0 siblings, 1 reply; 36+ messages in thread
From: Ray Fucillo @ 2005-08-26 17:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Hugh Dickins, Linus Torvalds, Rik van Riel, linux-kernel

Nick Piggin wrote:
> OK let's see how Ray goes, and try it when 2.6.14 opens...

Working on that now - I'll let you know.

> Yeah I guess that's a good idea. Patch looks pretty good.
> Just a minor issue with the comment, it is not strictly
> just assuming the child will exec... IMO it is worthwhile
> in Ray's case even if his forked process _eventually_ ends
> up touching all the shared memory pages, it is better to
> avoid many ms of fork overhead.

Yes, in our database system the child will immediately touch some shmem 
pages, and may eventually touch most of them (and would almost never 
exec()).  Fork performance is critical in usage scenarios where an 
end-user database request forks a new server process from one master 
server process.

However, there is still a need that the child, once successfully forked, 
is operational reasonably quickly.  I suspect that Ross's idea of paging 
in everything after the first fault would not be optimal for us, because 
we'd still be talking about hundreds of ms of work done before the child 
does anything useful.  It would still be far better than the behavior we 
have today because that time would no longer be synchronous with the 
fork().  Of course, it sounds like our app might be able to make use of 
the hugetlb stuff can mitigate this problem in the future...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 16:38             ` Hugh Dickins
@ 2005-08-26 16:43               ` Ross Biro
  0 siblings, 0 replies; 36+ messages in thread
From: Ross Biro @ 2005-08-26 16:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Nick Piggin, Rik van Riel, Ray Fucillo, linux-kernel

On 8/26/05, Hugh Dickins <hugh@veritas.com> wrote:
> On Fri, 26 Aug 2005, Ross Biro wrote:
> > On 8/26/05, Hugh Dickins <hugh@veritas.com> wrote:
> > >
> > > The refaulting will hurt the performance of something: let's
> > > just hope that something doesn't turn out to be a show-stopper.
> >
> > Why not just fault in all the pages on the first fault. Then the performance
> > loss is a single page fault (the page table copy that would have happened a
> > fork time now happens at fault time) and you get the big win for processes
> > that do fork/exec.
> 
> "all" might be very many more pages than were ever mapped in the parent,
> and not be a win.  Some faultahead might work better.  Might, might, ...

If you reduce "all" to whatever would have been done in fork
originially, then you've got a big win in some cases and a minimal
loss in others, and it's easy to argue you've got something better.

Now changng "all" to something even less might be an even bigger win,
but that requires a lot of benchmarking to justify.

    Ross

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
       [not found]           ` <8783be660508260915524e2b1e@mail.gmail.com>
@ 2005-08-26 16:38             ` Hugh Dickins
  2005-08-26 16:43               ` Ross Biro
  0 siblings, 1 reply; 36+ messages in thread
From: Hugh Dickins @ 2005-08-26 16:38 UTC (permalink / raw)
  To: Ross Biro
  Cc: Linus Torvalds, Nick Piggin, Rik van Riel, Ray Fucillo, linux-kernel

On Fri, 26 Aug 2005, Ross Biro wrote:
> On 8/26/05, Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > The refaulting will hurt the performance of something: let's
> > just hope that something doesn't turn out to be a show-stopper.
> 
> Why not just fault in all the pages on the first fault. Then the performance 
> loss is a single page fault (the page table copy that would have happened a 
> fork time now happens at fault time) and you get the big win for processes 
> that do fork/exec.

"all" might be very many more pages than were ever mapped in the parent,
and not be a win.  Some faultahead might work better.  Might, might, ...

Hugh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26 11:49         ` Hugh Dickins
@ 2005-08-26 14:26           ` Nick Piggin
  2005-08-26 17:00             ` Ray Fucillo
       [not found]           ` <8783be660508260915524e2b1e@mail.gmail.com>
  2005-08-26 18:07           ` Linus Torvalds
  2 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-08-26 14:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linus Torvalds, Rik van Riel, Ray Fucillo, linux-kernel

Hugh Dickins wrote:
> On Thu, 25 Aug 2005, Linus Torvalds wrote:

>>That said, I think it's a valid optimization. Especially as the child 
>>_probably_ doesn't need it (ie there's at least some likelihood of an 
>>execve() or similar).
> 
> 
> I agree, seems a great idea to me (sulking because I was too dumb
> to get it, even when Nick and Andi first posted their patches).
> 
> It won't just save on the copying at fork time, it'll save on
> undoing it all again when the child mm is torn down for exec.
> 
> The refaulting will hurt the performance of something: let's
> just hope that something doesn't turn out to be a show-stopper.
> 

OK let's see how Ray goes, and try it when 2.6.14 opens...

> I see some flaws in the various patches posted, including Rik's.
> Here's another version - doing it inside copy_page_range, so this
> kind of vma special-casing is over in mm/ rather than kernel/.
> 

Yeah I guess that's a good idea. Patch looks pretty good.
Just a minor issue with the comment, it is not strictly
just assuming the child will exec... IMO it is worthwhile
in Ray's case even if his forked process _eventually_ ends
up touching all the shared memory pages, it is better to
avoid many ms of fork overhead.

Also, on NUMA systems this will help get page tables allocated
on the right nodes, which is not an insignificant problem for
big HPC jobs.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26  3:56       ` Linus Torvalds
@ 2005-08-26 11:49         ` Hugh Dickins
  2005-08-26 14:26           ` Nick Piggin
                             ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Hugh Dickins @ 2005-08-26 11:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Rik van Riel, Ray Fucillo, linux-kernel

On Thu, 25 Aug 2005, Linus Torvalds wrote:
> On Fri, 26 Aug 2005, Nick Piggin wrote:
> > 
> > > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> > 
> > Indeed. Linus, can you remember why we haven't done this before?
> 
> Hmm. Historical reasons. Also, if the child ends up needing it, it will 
> now have to fault them in.
> 
> That said, I think it's a valid optimization. Especially as the child 
> _probably_ doesn't need it (ie there's at least some likelihood of an 
> execve() or similar).

I agree, seems a great idea to me (sulking because I was too dumb
to get it, even when Nick and Andi first posted their patches).

It won't just save on the copying at fork time, it'll save on
undoing it all again when the child mm is torn down for exec.

The refaulting will hurt the performance of something: let's
just hope that something doesn't turn out to be a show-stopper.

I see some flaws in the various patches posted, including Rik's.
Here's another version - doing it inside copy_page_range, so this
kind of vma special-casing is over in mm/ rather than kernel/.

No point in testing vm_file, the vm_flags cover the cases.
Test VM_MAYSHARE rather than VM_SHARED to include the never-can-be-
written MAP_SHARED cases too.  Must exclude VM_NONLINEAR, their ptes
are essential for defining the file offsets.  Must exclude VM_RESERVED,
faults on remap_pfn_range areas would usually put in anon zeroed pages
instead of the driver pages - or perhaps would be better as a test
against VM_IO, or vma->vm_ops->nopage?

Having to exclude the VM_NONLINEAR seems rather a shame, since those
are always shared and likely enormous.  The InfiniBand people's idea 
of a way for the app to set VM_DONTCOPY (to avoid rdma get_user_pages
problems) becomes attractive as a way for apps to speed their forks.

Hugh

--- 2.6.13-rc7/mm/memory.c	2005-08-24 11:13:41.000000000 +0100
+++ linux/mm/memory.c	2005-08-26 10:09:50.000000000 +0100
@@ -498,6 +498,14 @@ int copy_page_range(struct mm_struct *ds
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
 
+	/*
+	 * Assume the fork will probably exec: don't waste time copying
+	 * ptes where a page fault will fill them correctly afterwards.
+	 */
+	if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
+								== VM_MAYSHARE)
+		return 0;
+
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26  1:26     ` Nick Piggin
  2005-08-26  1:50       ` Rik van Riel
@ 2005-08-26  3:56       ` Linus Torvalds
  2005-08-26 11:49         ` Hugh Dickins
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2005-08-26  3:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rik van Riel, Ray Fucillo, linux-kernel, Hugh Dickins



On Fri, 26 Aug 2005, Nick Piggin wrote:
> 
> > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> > 
> 
> Indeed. Linus, can you remember why we haven't done this before?

Hmm. Historical reasons. Also, if the child ends up needing it, it will 
now have to fault them in.

That said, I think it's a valid optimization. Especially as the child 
_probably_ doesn't need it (ie there's at least some likelihood of an 
execve() or similar).

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-26  1:26     ` Nick Piggin
@ 2005-08-26  1:50       ` Rik van Riel
  2005-08-26  3:56       ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2005-08-26  1:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ray Fucillo, linux-kernel, Hugh Dickins,
	Douglas Shakshober

On Fri, 26 Aug 2005, Nick Piggin wrote:

> > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> 
> Indeed. Linus, can you remember why we haven't done this before?

Where "this" looks something like the patch below, shamelessly
merging Nick's and Andy's patches and adding the initialization
of retval.

I suspect this may be a measurable win on database servers with
a web frontend, where the connections to the database server are
set up basically for each individual query, and don't stick around
for a long time.

No, I haven't actually tested this patch - but feel free to go
wild while I sign off for the night.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- linux-2.6.12/kernel/fork.c.mapshared	2005-08-25 18:40:44.000000000 -0400
+++ linux-2.6.12/kernel/fork.c	2005-08-25 18:47:16.000000000 -0400
@@ -184,7 +184,7 @@
 {
 	struct vm_area_struct * mpnt, *tmp, **pprev;
 	struct rb_node **rb_link, *rb_parent;
-	int retval;
+	int retval = 0;
 	unsigned long charge;
 	struct mempolicy *pol;
 
@@ -265,7 +265,10 @@
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, current->mm, tmp);
+		/* Skip pte copying if page faults can take care of things. */
+		if (!file || !(tmp->vm_flags & VM_SHARED) ||
+						is_vm_hugetlb_page(vma))
+			retval = copy_page_range(mm, current->mm, tmp);
 		spin_unlock(&mm->page_table_lock);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 17:31   ` Rik van Riel
@ 2005-08-26  1:26     ` Nick Piggin
  2005-08-26  1:50       ` Rik van Riel
  2005-08-26  3:56       ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2005-08-26  1:26 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds; +Cc: Ray Fucillo, linux-kernel, Hugh Dickins

Rik van Riel wrote:
> On Thu, 25 Aug 2005, Nick Piggin wrote:
> 
> 
>>fork() can be changed so as not to set up page tables for
>>MAP_SHARED mappings. I think that has other tradeoffs like
>>initially causing several unavoidable faults reading
>>libraries and program text.
> 
> 
> Actually, libraries and program text are usually mapped
> MAP_PRIVATE, so those would still be copied.
> 

Yep, that seems to be the case here.

> Skipping MAP_SHARED in fork() sounds like a good idea to me...
> 

Indeed. Linus, can you remember why we haven't done this before?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25  0:14 ` Nick Piggin
  2005-08-25 13:07   ` Ray Fucillo
@ 2005-08-25 17:31   ` Rik van Riel
  2005-08-26  1:26     ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Rik van Riel @ 2005-08-25 17:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ray Fucillo, linux-kernel

On Thu, 25 Aug 2005, Nick Piggin wrote:

> fork() can be changed so as not to set up page tables for
> MAP_SHARED mappings. I think that has other tradeoffs like
> initially causing several unavoidable faults reading
> libraries and program text.

Actually, libraries and program text are usually mapped
MAP_PRIVATE, so those would still be copied.

Skipping MAP_SHARED in fork() sounds like a good idea to me...

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 14:47   ` Parag Warudkar
@ 2005-08-25 15:56     ` Andi Kleen
  0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2005-08-25 15:56 UTC (permalink / raw)
  To: Parag Warudkar; +Cc: Ray Fucillo, linux-kernel

On Thursday 25 August 2005 16:47, Parag Warudkar wrote:

> Exactly - one problem is that this forces all of the hugetlb users to go
> the lazy faulting way. 
Actually I disabled it for hugetlbfs (... !is_huge...vma). The reason 
is that lazy faulting for huge pages is still not in mainline.

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 14:22 ` Andi Kleen
  2005-08-25 14:35   ` Nick Piggin
@ 2005-08-25 14:47   ` Parag Warudkar
  2005-08-25 15:56     ` Andi Kleen
  1 sibling, 1 reply; 36+ messages in thread
From: Parag Warudkar @ 2005-08-25 14:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Fucillo, linux-kernel

On Thu, 2005-08-25 at 16:22 +0200, Andi Kleen wrote:
> But I'm not sure it's a good idea in all cases. Would need a lot of 
> benchmarking  at least.
> 
> -Andi
> 

Exactly - one problem is that this forces all of the hugetlb users to go
the lazy faulting way. This is more or less similar to the original
problem the fork() forces everything to be mapped and some apps don't
like it. Same way, some apps may not want hugetlb pages to be all
pre-mapped. 

That's why I was alluding towards having the user specify MAP_SHARED|
MAP_LAZY or something to that tune and then have fork() honor it. So
people who want all things pre-mapped will not specify MAP_LAZY, just
MAP_SHARED. 

Now I don't even know if above is possible and workable for all
scenarios but that's why I was asking.. :)

Parag


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 14:22 ` Andi Kleen
@ 2005-08-25 14:35   ` Nick Piggin
  2005-08-25 14:47   ` Parag Warudkar
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2005-08-25 14:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Parag Warudkar, Ray Fucillo, linux-kernel

Andi Kleen wrote:
>>Would it be worth trying to do something like this?
> 
> 
> Maybe. Shouldn't be very hard though - you just need to check if the VMA is 
> backed by an object and if yes don't call copy_page_range for it.
> 
> I think it just needs (untested) 
> 

I think you need to check for MAP_SHARED as well, because
MAP_PRIVATE mapping of a file could be modified in parent.

See patch I posted just now.

Also, do you need any special case for hugetlb?


> Index: linux-2.6.13-rc5-misc/kernel/fork.c
> ===================================================================
> --- linux-2.6.13-rc5-misc.orig/kernel/fork.c
> +++ linux-2.6.13-rc5-misc/kernel/fork.c
> @@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
>  		rb_parent = &tmp->vm_rb;
>  
>  		mm->map_count++;
> -		retval = copy_page_range(mm, current->mm, tmp);
> +		if (!file && !is_vm_hugetlb_page(vma))
> +			retval = copy_page_range(mm, current->mm, tmp);
>  		spin_unlock(&mm->page_table_lock);
>  
>  		if (tmp->vm_ops && tmp->vm_ops->open)
> 
> But I'm not sure it's a good idea in all cases. Would need a lot of 
> benchmarking  at least.
> 

Yep. I'm sure it must have come up in the past, and Linus
must have said something about best-for-most.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 13:07   ` Ray Fucillo
  2005-08-25 13:13     ` Andi Kleen
@ 2005-08-25 14:28     ` Nick Piggin
  1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2005-08-25 14:28 UTC (permalink / raw)
  To: Ray Fucillo; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]

Ray Fucillo wrote:
> Nick Piggin wrote:
> 
>> fork() can be changed so as not to set up page tables for
>> MAP_SHARED mappings. I think that has other tradeoffs like
>> initially causing several unavoidable faults reading
>> libraries and program text.
>>
>> What kind of application are you using?
> 
> 
> The application is a database system called Caché.  We allocate a large 
> shared memory segment for database cache, which in a large production 
> environment may realistically be 1+GB on 32-bit platforms and much 
> larger on 64-bit.  At these sizes fork() is taking hundreds of 
> miliseconds, which can become a noticeable bottleneck for us.  This 
> performance characteristic seems to be unique to Linux vs other Unix 
> implementations.
> 
> 

As Andi said, hugepages might be a very nice feature for you guys
to look into and might potentially give a performance increase with
reduced TLB pressure, not only your immediate fork problem.

Anyway, the attached patch is something you could try testing. If
you do so, then I would be very interested to see performance results.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: vm-dontcopy-shared.patch --]
[-- Type: text/plain, Size: 928 bytes --]

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2005-08-04 15:24:36.000000000 +1000
+++ linux-2.6/kernel/fork.c	2005-08-26 00:20:50.000000000 +1000
@@ -256,7 +256,6 @@ static inline int dup_mmap(struct mm_str
 		 * Note that, exceptionally, here the vma is inserted
 		 * without holding mm->mmap_sem.
 		 */
-		spin_lock(&mm->page_table_lock);
 		*pprev = tmp;
 		pprev = &tmp->vm_next;
 
@@ -265,8 +264,11 @@ static inline int dup_mmap(struct mm_str
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, current->mm, tmp);
-		spin_unlock(&mm->page_table_lock);
+		if (!(file && (tmp->vm_flags & VM_SHARED))) {
+			spin_lock(&mm->page_table_lock);
+			retval = copy_page_range(mm, current->mm, tmp);
+			spin_unlock(&mm->page_table_lock);
+		}
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 14:05 Parag Warudkar
@ 2005-08-25 14:22 ` Andi Kleen
  2005-08-25 14:35   ` Nick Piggin
  2005-08-25 14:47   ` Parag Warudkar
  0 siblings, 2 replies; 36+ messages in thread
From: Andi Kleen @ 2005-08-25 14:22 UTC (permalink / raw)
  To: Parag Warudkar; +Cc: Ray Fucillo, linux-kernel


> Would it be worth trying to do something like this?

Maybe. Shouldn't be very hard though - you just need to check if the VMA is 
backed by an object and if yes don't call copy_page_range for it.

I think it just needs (untested) 

Index: linux-2.6.13-rc5-misc/kernel/fork.c
===================================================================
--- linux-2.6.13-rc5-misc.orig/kernel/fork.c
+++ linux-2.6.13-rc5-misc/kernel/fork.c
@@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, current->mm, tmp);
+		if (!file && !is_vm_hugetlb_page(vma))
+			retval = copy_page_range(mm, current->mm, tmp);
 		spin_unlock(&mm->page_table_lock);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)

But I'm not sure it's a good idea in all cases. Would need a lot of 
benchmarking  at least.

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
@ 2005-08-25 14:05 Parag Warudkar
  2005-08-25 14:22 ` Andi Kleen
  0 siblings, 1 reply; 36+ messages in thread
From: Parag Warudkar @ 2005-08-25 14:05 UTC (permalink / raw)
  To: Andi Kleen, Ray Fucillo; +Cc: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

> Ray Fucillo <fucillo@intersystems.com> writes:
> > 
> > The application is a database system called Caché.  We allocate a
> > large shared memory segment for database cache, which in a large
> > production environment may realistically be 1+GB on 32-bit platforms
> > and much larger on 64-bit.  At these sizes fork() is taking hundreds
> > of miliseconds, which can become a noticeable bottleneck for us.  This
> > performance characteristic seems to be unique to Linux vs other Unix
> > implementations.
> 
> You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
> then the overhead of walking the pages of it at fork would be much lower.
> 
> -Andi
> -

Why isn't the page walk for the Shared Memory done lazily though? It is better in that applications most likely may not want to page in all of the shared memory at once. Program logic/requirements should dictate this instead of fork making it compulsory. I think this is because we don't distinguish between shared libraries, program text and explicitly shared memory as the above application does - everything is MAP_SHARED.

As someone mentioned this causes unavoidable faults for reading in shared libraries and program text. But if there was a MAP_SHARED|MAP_LAZY - can fork() then be setup not to setup page tables for such mappings and still continue to map the MAP_SHARED ones so program text and libraries don't cause faults? Applications can then specify MAP_SHARED|MAP_LAZY and not incur the overhead of page table walk for the shared memory all at once.

Would it be worth trying to do something like this?

Parag




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25 13:07   ` Ray Fucillo
@ 2005-08-25 13:13     ` Andi Kleen
  2005-08-25 14:28     ` Nick Piggin
  1 sibling, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2005-08-25 13:13 UTC (permalink / raw)
  To: Ray Fucillo; +Cc: linux-kernel

Ray Fucillo <fucillo@intersystems.com> writes:
> 
> The application is a database system called Caché.  We allocate a
> large shared memory segment for database cache, which in a large
> production environment may realistically be 1+GB on 32-bit platforms
> and much larger on 64-bit.  At these sizes fork() is taking hundreds
> of miliseconds, which can become a noticeable bottleneck for us.  This
> performance characteristic seems to be unique to Linux vs other Unix
> implementations.

You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
then the overhead of walking the pages of it at fork would be much lower.

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-25  0:14 ` Nick Piggin
@ 2005-08-25 13:07   ` Ray Fucillo
  2005-08-25 13:13     ` Andi Kleen
  2005-08-25 14:28     ` Nick Piggin
  2005-08-25 17:31   ` Rik van Riel
  1 sibling, 2 replies; 36+ messages in thread
From: Ray Fucillo @ 2005-08-25 13:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel

Nick Piggin wrote:
> fork() can be changed so as not to set up page tables for
> MAP_SHARED mappings. I think that has other tradeoffs like
> initially causing several unavoidable faults reading
> libraries and program text.
> 
> What kind of application are you using?

The application is a database system called Caché.  We allocate a large 
shared memory segment for database cache, which in a large production 
environment may realistically be 1+GB on 32-bit platforms and much 
larger on 64-bit.  At these sizes fork() is taking hundreds of 
miliseconds, which can become a noticeable bottleneck for us.  This 
performance characteristic seems to be unique to Linux vs other Unix 
implementations.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: process creation time increases linearly with shmem
  2005-08-24 18:43 Ray Fucillo
@ 2005-08-25  0:14 ` Nick Piggin
  2005-08-25 13:07   ` Ray Fucillo
  2005-08-25 17:31   ` Rik van Riel
  0 siblings, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2005-08-25  0:14 UTC (permalink / raw)
  To: Ray Fucillo; +Cc: linux-kernel

Ray Fucillo wrote:
> I am seeing process creation time increase linearly with the size of the 
> shared memory segment that the parent touches.  The attached forktest.c 
> is a very simple user program that illustrates this behavior, which I 
> have tested on various kernel versions from 2.4 through 2.6.  Is this a 
> known issue, and is it solvable?
> 

fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* process creation time increases linearly with shmem
@ 2005-08-24 18:43 Ray Fucillo
  2005-08-25  0:14 ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Ray Fucillo @ 2005-08-24 18:43 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 334 bytes --]

I am seeing process creation time increase linearly with the size of the 
shared memory segment that the parent touches.  The attached forktest.c 
is a very simple user program that illustrates this behavior, which I 
have tested on various kernel versions from 2.4 through 2.6.  Is this a 
known issue, and is it solvable?

TIA,
Ray

[-- Attachment #2: forktest.c --]
[-- Type: text/plain, Size: 3803 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/time.h>
#include <errno.h>
#include <sys/shm.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <signal.h>

#define MAXJOBS 50
#define MAXMALLOC 1024

#define USESIGCHLDHND
/* USESIGCHLDHND feature code changes how the parent waits
   for the children.  When this feature code is on we define
   a signal handler for SIGCHLD and call waitpid to clean up
   the child process.  If this feature code is off, we wait
   until all children are forked and then loop through the 
   array of child pids and call waitpid() on each.  The
   purpose of this feature code was to see if there is any
   difference in timing based on cleaning up zombies faster.
   Test have shown no appreciable difference.  */

/* Return a floating point number of seconds since the start
   time in the timeval structure pointed to by starttv */
float elapsedtime(struct timeval *starttv) {
	struct timeval currenttime;
	gettimeofday(&currenttime,NULL);
	return ((currenttime.tv_sec - starttv->tv_sec) +
	       ((float)(currenttime.tv_usec - starttv->tv_usec)/1000000));
}

#ifdef USESIGCHLDHND
int childexitcnt = 0;
void sigchldhnd(int signum) {
	if (waitpid(-1,NULL,WNOHANG)) ++childexitcnt;
	return;
}
#endif

int main(void) {
	pid_t childpid[MAXJOBS];
	int x,i;
	int childcnt = 0;
        float endfork, endwait;
	struct shmid_ds myshmid_ds;
	unsigned int mb;
	int myshmid;
	key_t mykey = 0xf00df00d;
	char *mymem = 0;
	struct timeval starttime;
#ifdef USESIGCHLDHND
	struct sigaction sa;
	sa.sa_handler = sigchldhnd;
	sigemptyset(&sa.sa_mask);
	sa.sa_flags = SA_RESTART;
	if (sigaction(SIGCHLD, &sa, NULL) == -1) {
	   printf("sigaction() failed, errno %d - exiting\n",errno);
	   exit(1);
	}
#endif
        printf("\nNumber of jobs to fork (max %d):  ",MAXJOBS);
        scanf("%d",&x);
        if ((x < 1) || (x > MAXJOBS)) {
           printf("\ninvalid input - exiting\n");
           exit(1);
        }
        printf("\nNumber of MB to allocate (0-%d):  ",MAXMALLOC);
        scanf("%d",&mb);
        if (mb > MAXMALLOC) {
           printf("\ninvalid input - exiting\n");
           exit(1);
        }
	/* allocate and initialize shared memory if number
	   of MB is not zero */
	if (mb) {
	   myshmid = shmget(mykey,mb*1024*1024,IPC_CREAT|0777);
	   if (myshmid == -1) {
	      printf("\nshmget() failed, errno %d. - exiting\n",errno);
	      exit(1);
	   }
	   mymem = (char *) shmat(myshmid,0,0);
	   if (mymem == (char *) -1) {
	      printf("\nshmat() failed, errno %d. - exiting\n",errno);
	      exit(1);
	   }
	   if (shmctl(myshmid,IPC_STAT,&myshmid_ds)) {
	      printf("\nshmctl() failed, errno %d. - exiting\n",errno);
	      exit(1);
	   }
	   /* write a pattern in the new shmem segment*/
	   for (i=0; i < (mb*1024*1024); i+=32) mymem[i]='R';
	}	
	printf("\nStarting %d jobs.  time:0.0", x);
	fflush(stdout);
	gettimeofday(&starttime,NULL); 
	for (i=0; i<x; i++) {
	   childpid[i] = fork();
	   if (!childpid[i]) {
	      /* child process */
	      printf("\n - Child %d         time:%f",i,elapsedtime(&starttime));
	      exit(1);
	   } else if (childpid[i] == -1) {
	      /* failure */
	      printf("\nfork failed, errno = %d");
	   } else childcnt++;
	}
	endfork = elapsedtime(&starttime);
#ifndef USESIGCHLDHND
	for (i=0; i<x; i++) waitpid(childpid[i],0,0);
#else
	while (childexitcnt < childcnt) {
	   if (waitpid(-1,NULL,0)) ++childexitcnt;
	}
#endif	   
	endwait = elapsedtime(&starttime);
	printf("\nTime to fork all processes in seconds: %f", endfork);
        printf("\nTime for all processes to complete: %f\n", endwait);

	/* kill shmem segment */
	if ((mb) && (shmctl(myshmid,IPC_RMID,&myshmid_ds))) {
	   printf("\nshmctl() failed, errno %d. - exiting\n",errno);
	   exit(1);
	}
}



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2005-12-14 16:22 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-14 14:07 process creation time increases linearly with shmem Brice Oliver
2005-12-14 16:21 ` Hugh Dickins
  -- strict thread matches above, loose matches on Subject: below --
2005-08-25 14:05 Parag Warudkar
2005-08-25 14:22 ` Andi Kleen
2005-08-25 14:35   ` Nick Piggin
2005-08-25 14:47   ` Parag Warudkar
2005-08-25 15:56     ` Andi Kleen
2005-08-24 18:43 Ray Fucillo
2005-08-25  0:14 ` Nick Piggin
2005-08-25 13:07   ` Ray Fucillo
2005-08-25 13:13     ` Andi Kleen
2005-08-25 14:28     ` Nick Piggin
2005-08-25 17:31   ` Rik van Riel
2005-08-26  1:26     ` Nick Piggin
2005-08-26  1:50       ` Rik van Riel
2005-08-26  3:56       ` Linus Torvalds
2005-08-26 11:49         ` Hugh Dickins
2005-08-26 14:26           ` Nick Piggin
2005-08-26 17:00             ` Ray Fucillo
2005-08-26 17:53               ` Rik van Riel
2005-08-26 18:20                 ` Ross Biro
2005-08-26 18:56                   ` Hugh Dickins
     [not found]           ` <8783be660508260915524e2b1e@mail.gmail.com>
2005-08-26 16:38             ` Hugh Dickins
2005-08-26 16:43               ` Ross Biro
2005-08-26 18:07           ` Linus Torvalds
2005-08-26 18:41             ` Hugh Dickins
2005-08-26 22:55               ` Linus Torvalds
2005-08-26 23:10               ` Rik van Riel
2005-08-26 23:23                 ` Linus Torvalds
2005-08-27 15:05                   ` Nick Piggin
2005-08-28  4:26                     ` Hugh Dickins
2005-08-28  6:49                       ` Nick Piggin
2005-08-29 23:33                         ` Ray Fucillo
2005-08-30  0:29                           ` Nick Piggin
2005-08-30  1:03                             ` Linus Torvalds
2005-08-30  0:34                           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).