Re: Quick question about hyper-threading (also some NUMA stuff)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Quick question about hyper-threading (also some NUMA stuff)
@ 2003-04-14 13:31 Timothy Miller
  2003-04-14 14:55 ` Martin J. Bligh
  0 siblings, 1 reply; 14+ messages in thread
From: Timothy Miller @ 2003-04-14 13:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: nicoya

From: Tony 'Nicoya' Mantler (nicoya@apia.dhs.org)

> Perhaps the same effect could be obtained by preferentially scheduling
processes
> to execute on the "node" (a node being a single cpu in an SMP system, or
an HT
> virtual CPU pair, or a NUMA node) that they were last running on.

> I think the ideal semantics would probably be something along the lines
of:

>  - a newly fork()ed thread executes on the same node as the creating
thread
>  - calling exec() sets a "feel free to shuffle me elsewhere" flag
>  - threads are otherwise only shuffled to other nodes when a certain load
ratio
> is exceeded (current-node:idle-node)

This sounds like the most sensible approach.  I like considering the
extremes of performance, but sometimes, the time for math required for some
optimization can be worse than any benefit you get out of it.  Your
suggestion is simple.  It increases the likelihood (10% better for little
extra effort is better than 10% worse) of related processes being run on the
same node, while not impacting the system's ability to balance load.  This,
as you say, is also very important for NUMA.

Does the NUMA support migrate pages to the node which is running a process?
Or would processes jump nodes often enough to make that not worth the
effort?

In order for page migration to be worth it, node affinity would have to be
fairly strong.  It's particularly important when a process maps pages which
belong to another node.  Is there any logic there to duplicate pages in
cases where there is enough free memory for it?  We'd have to tag the pages
as duplicates so the VM could reclaim them.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 13:31 Quick question about hyper-threading (also some NUMA stuff) Timothy Miller
@ 2003-04-14 14:55 ` Martin J. Bligh
  2003-04-14 15:29   ` Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Martin J. Bligh @ 2003-04-14 14:55 UTC (permalink / raw)
  To: Timothy Miller, linux-kernel; +Cc: nicoya

> This sounds like the most sensible approach.  I like considering the
> extremes of performance, but sometimes, the time for math required for some
> optimization can be worse than any benefit you get out of it.  Your
> suggestion is simple.  It increases the likelihood (10% better for little
> extra effort is better than 10% worse) of related processes being run on the
> same node, while not impacting the system's ability to balance load.  This,
> as you say, is also very important for NUMA.

See my earlier email - rebalance_node() does this, and it's very cheap, as 
we just SMP balance *within* the node -  the cross node rebalancer is a
separate tunable background process.

> Does the NUMA support migrate pages to the node which is running a process?
> Or would processes jump nodes often enough to make that not worth the
> effort?

No, we don't do page migration as yet. Andi is playing with a homenode 
concept that makes pages allocate from a predefined "home node" always, 
instead of their current node. Last time I benchmarked that concept it 
sucked, but the advent of the per-cpu, per-zone hot/cold page cache, and 
the fact that he's using hardware with totally different NUMA characteristics 
may well change that conclusion.

We don't normally migrate stuff around much on the higher-ration NUMA 
machines. With AMD Hammer or whatever, that may change.

> In order for page migration to be worth it, node affinity would have to be
> fairly strong.  It's particularly important when a process maps pages which
> belong to another node.  Is there any logic there to duplicate pages in
> cases where there is enough free memory for it?  We'd have to tag the pages
> as duplicates so the VM could reclaim them.

Right - we're looking at read only text replication, first for the kernel
(which ia64 has already), then for shared libs and program text. It's a 
good concept, provided you have plenty of RAM (which big NUMA boxes tend
to). Probably needs hooking into the address space structure, and to be
thrown away just like anything else that's unused under memory pressure
from the per-node LRU lists. Though it'd be nice to mark them as particularly
cheap to retrieve, and had a reference count (a node bitmap?) and to 
retrieve them from another node, not from disk.

M.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 14:55 ` Martin J. Bligh
@ 2003-04-14 15:29   ` Antonio Vargas
  2003-04-14 15:39     ` Martin J. Bligh
  0 siblings, 1 reply; 14+ messages in thread
From: Antonio Vargas @ 2003-04-14 15:29 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Timothy Miller, linux-kernel, nicoya

On Mon, Apr 14, 2003 at 07:55:37AM -0700, Martin J. Bligh wrote:
> > This sounds like the most sensible approach.  I like considering the
> > extremes of performance, but sometimes, the time for math required for some
> > optimization can be worse than any benefit you get out of it.  Your
> > suggestion is simple.  It increases the likelihood (10% better for little
> > extra effort is better than 10% worse) of related processes being run on the
> > same node, while not impacting the system's ability to balance load.  This,
> > as you say, is also very important for NUMA.
> 
> See my earlier email - rebalance_node() does this, and it's very cheap, as 
> we just SMP balance *within* the node -  the cross node rebalancer is a
> separate tunable background process.
> 
> > Does the NUMA support migrate pages to the node which is running a process?
> > Or would processes jump nodes often enough to make that not worth the
> > effort?
> 
> No, we don't do page migration as yet. Andi is playing with a homenode 
> concept that makes pages allocate from a predefined "home node" always, 
> instead of their current node. Last time I benchmarked that concept it 
> sucked, but the advent of the per-cpu, per-zone hot/cold page cache, and 
> the fact that he's using hardware with totally different NUMA characteristics 
> may well change that conclusion.
> 
> We don't normally migrate stuff around much on the higher-ration NUMA 
> machines. With AMD Hammer or whatever, that may change.
> 
> > In order for page migration to be worth it, node affinity would have to be
> > fairly strong.  It's particularly important when a process maps pages which
> > belong to another node.  Is there any logic there to duplicate pages in
> > cases where there is enough free memory for it?  We'd have to tag the pages
> > as duplicates so the VM could reclaim them.
> 
> Right - we're looking at read only text replication, first for the kernel
> (which ia64 has already), then for shared libs and program text. It's a 
> good concept, provided you have plenty of RAM (which big NUMA boxes tend
> to). Probably needs hooking into the address space structure, and to be
> thrown away just like anything else that's unused under memory pressure
> from the per-node LRU lists. Though it'd be nice to mark them as particularly
> cheap to retrieve, and had a reference count (a node bitmap?) and to 
> retrieve them from another node, not from disk.

Perhaps it would be good to un-COW pages:

1. fork process
2. if current node is not loaded, continue as usual
3. if current node is loaded:
3a. pick unloaded node
4b. don't do COW for data pages, but simply copy them to node-local memory

This way, read-write sharings would be replicated for each node.

Also, keeping an per-node active-page-list and then forcefully copying
the page to a node-local page-frame when accesing a page which is
active on another node could be good.

Hmm, the un-COW system could be implemented in terms of the second one,
isn't it?

Greets, Antonio.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 15:29   ` Antonio Vargas
@ 2003-04-14 15:39     ` Martin J. Bligh
  2003-04-14 15:57       ` Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Martin J. Bligh @ 2003-04-14 15:39 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Timothy Miller, linux-kernel, nicoya

> Perhaps it would be good to un-COW pages:
> 
> 1. fork process
> 2. if current node is not loaded, continue as usual
> 3. if current node is loaded:
> 3a. pick unloaded node
> 4b. don't do COW for data pages, but simply copy them to node-local memory
> 
> This way, read-write sharings would be replicated for each node.

Sharing read-write stuff is a total nightmare - you have to deal with
all the sync stuff, and invalidation. In real-life scenarios, I really
doubt the complexity is worth it - read-only is quite complex enough,
thanks ;-) 

Theoretically, if you had some pages that were predominatly read-only, 
and very occasionally got written to, it *might* be worth it. 
But probably not ;-)

> Also, keeping an per-node active-page-list and then forcefully copying
> the page to a node-local page-frame when accesing a page which is
> active on another node could be good.

Not sure what you mean by this. wrt the active-page list here's a per-node 
LRU already. Or you mean something on a per-address-space basis?

Yes, faulting the pages in lazily from another node as we touch them is 
probably the right thing to do. Giving secondary copies some LRU disadvantage
(perhaps always keeping them on the inactive list, never the active),
would be fun, but then you get into the whole "who is the primary owner,
and what do we do when they ditch the page" complexity. The node bitmap
I suggested earlier might help. But I'd rather keep it simple at first ;-)

M.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 15:39     ` Martin J. Bligh
@ 2003-04-14 15:57       ` Antonio Vargas
  2003-04-14 16:24         ` Martin J. Bligh
  0 siblings, 1 reply; 14+ messages in thread
From: Antonio Vargas @ 2003-04-14 15:57 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Antonio Vargas, Timothy Miller, linux-kernel, nicoya

On Mon, Apr 14, 2003 at 08:39:05AM -0700, Martin J. Bligh wrote:
> > Perhaps it would be good to un-COW pages:
> > 
> > 1. fork process
> > 2. if current node is not loaded, continue as usual
> > 3. if current node is loaded:
> > 3a. pick unloaded node
> > 4b. don't do COW for data pages, but simply copy them to node-local memory
> > 
> > This way, read-write sharings would be replicated for each node.
> 
> Sharing read-write stuff is a total nightmare - you have to deal with
> all the sync stuff, and invalidation. In real-life scenarios, I really
> doubt the complexity is worth it - read-only is quite complex enough,
> thanks ;-) 

I mean MAP_PRIVATE stuff, not MAP_SHARED.

> Theoretically, if you had some pages that were predominatly read-only, 
> and very occasionally got written to, it *might* be worth it. 
> But probably not ;-)
> 
> > Also, keeping an per-node active-page-list and then forcefully copying
> > the page to a node-local page-frame when accesing a page which is
> > active on another node could be good.
> 
> Not sure what you mean by this. wrt the active-page list here's a per-node 
> LRU already. Or you mean something on a per-address-space basis?

Yes, I meant a per-node active LRU. I'd better get a closer look
at what's already done ;)
 
> Yes, faulting the pages in lazily from another node as we touch them is 
> probably the right thing to do. Giving secondary copies some LRU disadvantage
> (perhaps always keeping them on the inactive list, never the active),
> would be fun, but then you get into the whole "who is the primary owner,
> and what do we do when they ditch the page" complexity. The node bitmap
> I suggested earlier might help. But I'd rather keep it simple at first ;-)
> 
> M.

Antonio.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 15:57       ` Antonio Vargas
@ 2003-04-14 16:24         ` Martin J. Bligh
  2003-04-14 16:43           ` Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Martin J. Bligh @ 2003-04-14 16:24 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Timothy Miller, linux-kernel, nicoya

>> > Perhaps it would be good to un-COW pages:
>> > 
>> > 1. fork process
>> > 2. if current node is not loaded, continue as usual
>> > 3. if current node is loaded:
>> > 3a. pick unloaded node
>> > 4b. don't do COW for data pages, but simply copy them to node-local memory
>> > 
>> > This way, read-write sharings would be replicated for each node.
>> 
>> Sharing read-write stuff is a total nightmare - you have to deal with
>> all the sync stuff, and invalidation. In real-life scenarios, I really
>> doubt the complexity is worth it - read-only is quite complex enough,
>> thanks ;-) 
> 
> I mean MAP_PRIVATE stuff, not MAP_SHARED.

OK, unless I misunderstand you, I think that happens naturally for that
kind of thing - when we do the COW split, we'll get a node-local page
by default (unless the local node is out of memory).

M.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 16:43           ` Antonio Vargas
@ 2003-04-14 16:37             ` Martin J. Bligh
  2003-04-14 17:14               ` Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Martin J. Bligh @ 2003-04-14 16:37 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Timothy Miller, linux-kernel, nicoya

>> OK, unless I misunderstand you, I think that happens naturally for that
>> kind of thing - when we do the COW split, we'll get a node-local page
>> by default (unless the local node is out of memory).
>
> Yes, it happens naturaly, but it's done when we try to write to it.
> What I meant was, at fork time, it we are forking to a different node,
> instead of COW-marking, do the COW-mark and the immediately do a sort-of
> for_each_page(touch_as_if_written(page)), so that nodes would not have to
> reference the memory from others.

Ah, you probably don't want to do that ... it's very expensive. Moreover,
if you exec 2ns later, all the effort will be wasted ... and it's very hard
to deterministically predict whether you'll exec or not (stupid UNIX 
semantics). Doing it lazily is probably best, and as to "nodes would not 
have to reference the memory from others" - you're still doing that, you're
just batching it on the front end.

> I don't know if it's really usefull, and anyways I could not try to code it
> unless there is a sort of NUMA simulator for "normal" machines.

There isn't, but writing one would be very useful (and fairly simple) if
you have a 2x machine or something that you could use. I've thought about
writing this several times ... just haven't got round to it.

M.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 16:24         ` Martin J. Bligh
@ 2003-04-14 16:43           ` Antonio Vargas
  2003-04-14 16:37             ` Martin J. Bligh
  0 siblings, 1 reply; 14+ messages in thread
From: Antonio Vargas @ 2003-04-14 16:43 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Antonio Vargas, Timothy Miller, linux-kernel, nicoya

On Mon, Apr 14, 2003 at 09:24:45AM -0700, Martin J. Bligh wrote:
> >> > Perhaps it would be good to un-COW pages:
> >> > 
> >> > 1. fork process
> >> > 2. if current node is not loaded, continue as usual
> >> > 3. if current node is loaded:
> >> > 3a. pick unloaded node
> >> > 4b. don't do COW for data pages, but simply copy them to node-local memory
> >> > 
> >> > This way, read-write sharings would be replicated for each node.
> >> 
> >> Sharing read-write stuff is a total nightmare - you have to deal with
> >> all the sync stuff, and invalidation. In real-life scenarios, I really
> >> doubt the complexity is worth it - read-only is quite complex enough,
> >> thanks ;-) 
> > 
> > I mean MAP_PRIVATE stuff, not MAP_SHARED.
> 
> OK, unless I misunderstand you, I think that happens naturally for that
> kind of thing - when we do the COW split, we'll get a node-local page
> by default (unless the local node is out of memory).
> 
> M.
�
IYes, it happens naturaly, but it's done when we try to write to it.
What I meant was, at fork time, it we are forking to a different node,
instead of COW-marking, do the COW-mark and the immediately do a sort-of
for_each_page(touch_as_if_written(page)), so that nodes would not have to
reference the memory from others.

I don't know if it's really usefull, and anyways I could not try to code it
unless there is a sort of NUMA simulator for "normal" machines.

Greets, Antonio.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 16:37             ` Martin J. Bligh
@ 2003-04-14 17:14               ` Antonio Vargas
  2003-04-14 17:22                 ` Martin J. Bligh
  0 siblings, 1 reply; 14+ messages in thread
From: Antonio Vargas @ 2003-04-14 17:14 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Antonio Vargas, Timothy Miller, linux-kernel, nicoya

On Mon, Apr 14, 2003 at 09:37:07AM -0700, Martin J. Bligh wrote:
> >> OK, unless I misunderstand you, I think that happens naturally for that
> >> kind of thing - when we do the COW split, we'll get a node-local page
> >> by default (unless the local node is out of memory).
> >
> > Yes, it happens naturaly, but it's done when we try to write to it.
> > What I meant was, at fork time, it we are forking to a different node,
> > instead of COW-marking, do the COW-mark and the immediately do a sort-of
> > for_each_page(touch_as_if_written(page)), so that nodes would not have to
> > reference the memory from others.
> 
> Ah, you probably don't want to do that ... it's very expensive. Moreover,
> if you exec 2ns later, all the effort will be wasted ... and it's very hard
> to deterministically predict whether you'll exec or not (stupid UNIX 
> semantics). Doing it lazily is probably best, and as to "nodes would not 
> have to reference the memory from others" - you're still doing that, you're
> just batching it on the front end.

True... What about a vma-level COW-ahead just like we have a file-level
read-ahead, then? I mean batching the COW at unCOW-because-of-write time.

btw, COW-ahead sound really silly :)

> > I don't know if it's really usefull, and anyways I could not try to code it
> > unless there is a sort of NUMA simulator for "normal" machines.
> 
> There isn't, but writing one would be very useful (and fairly simple) if
> you have a 2x machine or something that you could use. I've thought about
> writing this several times ... just haven't got round to it.
> 
> M.

Not possible for me since I've got no SMP. But posting a quick note about
your proposed "fake-NUMA-on-SMP.patch" would be good only if to have an
offsite (offbrain also? ;) backup of your ideas :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Quick question about hyper-threading (also some NUMA stuff)
  2003-04-14 17:14               ` Antonio Vargas
@ 2003-04-14 17:22                 ` Martin J. Bligh
  2003-04-14 18:32                   ` cow-ahead N pages for fault clustering Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Martin J. Bligh @ 2003-04-14 17:22 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Timothy Miller, linux-kernel, nicoya

>> Ah, you probably don't want to do that ... it's very expensive. Moreover,
>> if you exec 2ns later, all the effort will be wasted ... and it's very
>> hard to deterministically predict whether you'll exec or not (stupid
>> UNIX  semantics). Doing it lazily is probably best, and as to "nodes
>> would not  have to reference the memory from others" - you're still
>> doing that, you're just batching it on the front end.
> 
> True... What about a vma-level COW-ahead just like we have a file-level
> read-ahead, then? I mean batching the COW at unCOW-because-of-write time.

That'd be interesting ... and you can test that on a UP box, is not just
NUMA. Depends on the workload quite heavily, I suspect.

> btw, COW-ahead sound really silly :)

Yeah. So be sure to call it that if it works out ... we need more things
like that ;-) Moooooo.

> Not possible for me since I've got no SMP. But posting a quick note about
> your proposed "fake-NUMA-on-SMP.patch" would be good only if to have an
> offsite (offbrain also? ;) backup of your ideas :)

Oh, well basically you just need to split memory in half, and assign one
cpu to each "node" for each cpu_to_node thingy. Would be easy to just do it
as static #defines for sizes at first (most of the work in supporting a new
NUMA arch is just parsing the machinet tables). Set MAX_NUMNODES to > 1,
make sure you create a pgdat for each "node", and frig with build_zonelists
and free_area_init_core a bit. 

M.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* cow-ahead N pages for fault clustering
  2003-04-14 17:22                 ` Martin J. Bligh
@ 2003-04-14 18:32                   ` Antonio Vargas
  2003-04-14 18:47                     ` Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Antonio Vargas @ 2003-04-14 18:32 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Antonio Vargas, linux-kernel, nicoya

[-- Attachment #1: Type: text/plain, Size: 1075 bytes --]

On Mon, Apr 14, 2003 at 10:22:46AM -0700, Martin J. Bligh wrote:
> >> Ah, you probably don't want to do that ... it's very expensive. Moreover,
> >> if you exec 2ns later, all the effort will be wasted ... and it's very
> >> hard to deterministically predict whether you'll exec or not (stupid
> >> UNIX  semantics). Doing it lazily is probably best, and as to "nodes
> >> would not  have to reference the memory from others" - you're still
> >> doing that, you're just batching it on the front end.
> > 
> > True... What about a vma-level COW-ahead just like we have a file-level
> > read-ahead, then? I mean batching the COW at unCOW-because-of-write time.
> 
> That'd be interesting ... and you can test that on a UP box, is not just
> NUMA. Depends on the workload quite heavily, I suspect.
>  
> > btw, COW-ahead sound really silly :)
> 
> Yeah. So be sure to call it that if it works out ... we need more things
> like that ;-) Moooooo.

What about the attached one? I'm compiling it right now to test in UML :)

[ snip fake-NUMA-on-SMP discussion ]

Greets, Antonio.


[-- Attachment #2: cow-ahead.patch --]
[-- Type: text/plain, Size: 1709 bytes --]

 mm/memory.c |   32 ++++++++++++++++++++++++++++----
 1 files changed, 28 insertions(+), 4 deletions(-)

diff -puN mm/memory.c~cow-ahead mm/memory.c
--- 25/mm/memory.c~cow-ahead	Mon Apr 14 20:08:44 2003
+++ 25-wind/mm/memory.c	Mon Apr 14 20:26:17 2003
@@ -1452,7 +1452,7 @@ static int do_file_page(struct mm_struct
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
-	int write_access, pte_t *pte, pmd_t *pmd)
+	int write_access, pte_t *pte, pmd_t *pmd, int *cowahead)
 {
 	pte_t entry;
 
@@ -1471,8 +1471,11 @@ static inline int handle_pte_fault(struc
 	}
 
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			if(!cowahead)
+				*cowahead = 1;
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
+		}
 
 		entry = pte_mkdirty(entry);
 	}
@@ -1492,6 +1495,17 @@ int handle_mm_fault(struct mm_struct *mm
 	pgd_t *pgd;
 	pmd_t *pmd;
 
+	int cowahead, i;
+	int retval, x;
+
+	/*
+	 * Implement cow-ahead: copy-on-write several
+	 * pages when we fault one of them
+	 */
+
+	i = cowahead = 0;
+
+do_cowahead:
 	__set_current_state(TASK_RUNNING);
 	pgd = pgd_offset(mm, address);
 
@@ -1509,8 +1523,18 @@ int handle_mm_fault(struct mm_struct *mm
 
 	if (pmd) {
 		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
-			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+		if (!pte) break;
+
+		x = handle_pte_fault(mm, vma, address, write_access, pte, pmd, cowahead);
+		if(!i) retval = x;
+
+		i++;
+		address += PAGE_SIZE;
+
+		if(!cowahead || i >= 0 || address >= vma->vm_end)
+			return retval;
+
+		goto do_cowahead;
 	}
 	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;

_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: cow-ahead N pages for fault clustering
  2003-04-14 18:32                   ` cow-ahead N pages for fault clustering Antonio Vargas
@ 2003-04-14 18:47                     ` Antonio Vargas
  2003-04-15  5:49                       ` Martin J. Bligh
  0 siblings, 1 reply; 14+ messages in thread
From: Antonio Vargas @ 2003-04-14 18:47 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: Martin J. Bligh, linux-kernel, nicoya

[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]

On Mon, Apr 14, 2003 at 08:32:51PM +0200, Antonio Vargas wrote:
> On Mon, Apr 14, 2003 at 10:22:46AM -0700, Martin J. Bligh wrote:
> > >> Ah, you probably don't want to do that ... it's very expensive. Moreover,
> > >> if you exec 2ns later, all the effort will be wasted ... and it's very
> > >> hard to deterministically predict whether you'll exec or not (stupid
> > >> UNIX  semantics). Doing it lazily is probably best, and as to "nodes
> > >> would not  have to reference the memory from others" - you're still
> > >> doing that, you're just batching it on the front end.
> > > 
> > > True... What about a vma-level COW-ahead just like we have a file-level
> > > read-ahead, then? I mean batching the COW at unCOW-because-of-write time.
> > 
> > That'd be interesting ... and you can test that on a UP box, is not just
> > NUMA. Depends on the workload quite heavily, I suspect.
> >  
> > > btw, COW-ahead sound really silly :)
> > 
> > Yeah. So be sure to call it that if it works out ... we need more things
> > like that ;-) Moooooo.
> 
> What about the attached one? I'm compiling it right now to test in UML :)
> 
> [ snip fake-NUMA-on-SMP discussion ]
> 

OK, too quick for me... this next one applies, compiles and boots on 2.5.66 + uml.
Now I wonder how can I test if this is useful... ideas?

Greets, Antonio.

[-- Attachment #2: cow-ahead.patch --]
[-- Type: text/plain, Size: 1802 bytes --]

 mm/memory.c |   34 +++++++++++++++++++++++++++++-----
 1 files changed, 29 insertions(+), 5 deletions(-)

diff -puN mm/memory.c~cow-ahead mm/memory.c
--- 25/mm/memory.c~cow-ahead	Mon Apr 14 20:08:44 2003
+++ 25-wind/mm/memory.c	Mon Apr 14 20:37:42 2003
@@ -1452,7 +1452,7 @@ static int do_file_page(struct mm_struct
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 	struct vm_area_struct * vma, unsigned long address,
-	int write_access, pte_t *pte, pmd_t *pmd)
+	int write_access, pte_t *pte, pmd_t *pmd, int *cowahead)
 {
 	pte_t entry;
 
@@ -1471,8 +1471,11 @@ static inline int handle_pte_fault(struc
 	}
 
 	if (write_access) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			if(!*cowahead)
+				*cowahead = 1;
 			return do_wp_page(mm, vma, address, pte, pmd, entry);
+		}
 
 		entry = pte_mkdirty(entry);
 	}
@@ -1492,6 +1495,17 @@ int handle_mm_fault(struct mm_struct *mm
 	pgd_t *pgd;
 	pmd_t *pmd;
 
+	int cowahead, i;
+	int retval, x;
+
+	/*
+	 * Implement cow-ahead: copy-on-write several
+	 * pages when we fault one of them
+	 */
+
+	i = cowahead = 0;
+
+do_cowahead:
 	__set_current_state(TASK_RUNNING);
 	pgd = pgd_offset(mm, address);
 
@@ -1507,10 +1521,20 @@ int handle_mm_fault(struct mm_struct *mm
 	spin_lock(&mm->page_table_lock);
 	pmd = pmd_alloc(mm, pgd, address);
 
-	if (pmd) {
+	while (pmd) {
 		pte_t * pte = pte_alloc_map(mm, pmd, address);
-		if (pte)
-			return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
+		if (!pte) break;
+
+		x = handle_pte_fault(mm, vma, address, write_access, pte, pmd, &cowahead);
+		if(!i) retval = x;
+
+		i++;
+		address += PAGE_SIZE;
+
+		if(!cowahead || i >= 0 || address >= vma->vm_end)
+			return retval;
+
+		goto do_cowahead;
 	}
 	spin_unlock(&mm->page_table_lock);
 	return VM_FAULT_OOM;

_

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: cow-ahead N pages for fault clustering
  2003-04-14 18:47                     ` Antonio Vargas
@ 2003-04-15  5:49                       ` Martin J. Bligh
  2003-04-18 17:35                         ` Antonio Vargas
  0 siblings, 1 reply; 14+ messages in thread
From: Martin J. Bligh @ 2003-04-15  5:49 UTC (permalink / raw)
  To: Antonio Vargas; +Cc: linux-kernel, nicoya

>> > >> Ah, you probably don't want to do that ... it's very expensive.
>> > >> Moreover, if you exec 2ns later, all the effort will be wasted ...
>> > >> and it's very hard to deterministically predict whether you'll exec
>> > >> or not (stupid UNIX  semantics). Doing it lazily is probably best,
>> > >> and as to "nodes would not  have to reference the memory from
>> > >> others" - you're still doing that, you're just batching it on the
>> > >> front end.
>> > > 
>> > > True... What about a vma-level COW-ahead just like we have a
>> > > file-level read-ahead, then? I mean batching the COW at
>> > > unCOW-because-of-write time.
>> > 
>> > That'd be interesting ... and you can test that on a UP box, is not
>> > just NUMA. Depends on the workload quite heavily, I suspect.
>> >  
>> > > btw, COW-ahead sound really silly :)
>> > 
>> > Yeah. So be sure to call it that if it works out ... we need more
>> > things like that ;-) Moooooo.
>> 
>> What about the attached one? I'm compiling it right now to test in UML :)
>> 
>> [ snip fake-NUMA-on-SMP discussion ]
>> 
> 
> OK, too quick for me... this next one applies, compiles and boots on
> 2.5.66 + uml. Now I wonder how can I test if this is useful... ideas?

Well, benchmark it ;-) My favourite trick is to just 
"/usr/bin/time make bzImage" on some fixed kernel version & config,
but aim7 / aim9 is pretty easy to set up too, and might be interesting.

M.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: cow-ahead N pages for fault clustering
  2003-04-15  5:49                       ` Martin J. Bligh
@ 2003-04-18 17:35                         ` Antonio Vargas
  0 siblings, 0 replies; 14+ messages in thread
From: Antonio Vargas @ 2003-04-18 17:35 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Antonio Vargas, linux-kernel, nicoya

On Mon, Apr 14, 2003 at 10:49:03PM -0700, Martin J. Bligh wrote:
> >> > >> Ah, you probably don't want to do that ... it's very expensive.
> >> > >> Moreover, if you exec 2ns later, all the effort will be wasted ...
> >> > >> and it's very hard to deterministically predict whether you'll exec
> >> > >> or not (stupid UNIX  semantics). Doing it lazily is probably best,
> >> > >> and as to "nodes would not  have to reference the memory from
> >> > >> others" - you're still doing that, you're just batching it on the
> >> > >> front end.
> >> > > 
> >> > > True... What about a vma-level COW-ahead just like we have a
> >> > > file-level read-ahead, then? I mean batching the COW at
> >> > > unCOW-because-of-write time.
> >> > 
> >> > That'd be interesting ... and you can test that on a UP box, is not
> >> > just NUMA. Depends on the workload quite heavily, I suspect.
> >> >  
> >> > > btw, COW-ahead sound really silly :)
> >> > 
> >> > Yeah. So be sure to call it that if it works out ... we need more
> >> > things like that ;-) Moooooo.
> >> 
> >> What about the attached one? I'm compiling it right now to test in UML :)
> >> 
> >> [ snip fake-NUMA-on-SMP discussion ]
> >> 
> > 
> > OK, too quick for me... this next one applies, compiles and boots on
> > 2.5.66 + uml. Now I wonder how can I test if this is useful... ideas?
> 
> Well, benchmark it ;-) My favourite trick is to just 
> "/usr/bin/time make bzImage" on some fixed kernel version & config,
> but aim7 / aim9 is pretty easy to set up too, and might be interesting.
> 
> M.

I've benchmarked my patch with a 2-pages-per-fault loop:

make allnoconfig
date >>aaa
make bzImage
date >>aaa

and then checked manually the time difference

Took the same time both on vanilla 2.5.66 and my 2.5.66+cowahead.

Perhaps it's better for other workloads...

ps. my posted patch had a little bug: it did the cow loop only 1 time,
    so it only cow'ed 1 page... be sure to change the end test if
    you want to benchmark it futher.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-04-18 17:11 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-14 13:31 Quick question about hyper-threading (also some NUMA stuff) Timothy Miller
2003-04-14 14:55 ` Martin J. Bligh
2003-04-14 15:29   ` Antonio Vargas
2003-04-14 15:39     ` Martin J. Bligh
2003-04-14 15:57       ` Antonio Vargas
2003-04-14 16:24         ` Martin J. Bligh
2003-04-14 16:43           ` Antonio Vargas
2003-04-14 16:37             ` Martin J. Bligh
2003-04-14 17:14               ` Antonio Vargas
2003-04-14 17:22                 ` Martin J. Bligh
2003-04-14 18:32                   ` cow-ahead N pages for fault clustering Antonio Vargas
2003-04-14 18:47                     ` Antonio Vargas
2003-04-15  5:49                       ` Martin J. Bligh
2003-04-18 17:35                         ` Antonio Vargas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).