linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
@ 2002-04-26 18:27 Russell King
  2002-04-26 22:46 ` Andrea Arcangeli
  2002-04-27 22:10 ` Daniel Phillips
  0 siblings, 2 replies; 165+ messages in thread
From: Russell King @ 2002-04-26 18:27 UTC (permalink / raw)
  To: linux-kernel

Hi,

I've been looking at some of the ARM discontigmem implementations, and
have come across a nasty bug.  To illustrate this, I'm going to take
part of the generic kernel, and use the Alpha implementation to
illustrate the problem we're facing on ARM.

I'm going to argue here that virt_to_page() can, in the discontigmem
case, produce rather a nasty bug when used with non-direct mapped
kernel memory arguments.

In mm/memory.c:remap_pte_range() we have the following code:

                page = virt_to_page(__va(phys_addr));
                if ((!VALID_PAGE(page)) || PageReserved(page))
                        set_pte(pte, mk_pte_phys(phys_addr, prot));

Let's look closely at the first line:

                page = virt_to_page(__va(phys_addr));

Essentially, what we're trying to do here is convert a physical address
to a struct page pointer.

__va() is defined, on Alpha, to be:

	#define __va(x)  ((void *)((unsigned long) (x) + PAGE_OFFSET))

so we produce a unique "va" for any physical address that is passed.  No
problem so far.  Now, lets look at virt_to_page() for the Alpha:

	#define virt_to_page(kaddr) \
	     (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))

Looks inoccuous enough.  However, look closer at ADDR_TO_MAPBASE:

	#define ADDR_TO_MAPBASE(kaddr) \
                        NODE_MEM_MAP(KVADDR_TO_NID((unsigned long)(kaddr)))
	#define NODE_MEM_MAP(nid)       (NODE_DATA(nid)->node_mem_map)
	#define NODE_DATA(n)		(&((PLAT_NODE_DATA(n))->gendata))
	#define PLAT_NODE_DATA(n)       (plat_node_data[(n)])

Ok, so here we get the map base via:

	plat_node_data[KVADDR_TO_NID((unsigned long)(kaddr))]->
		gendata.node_mem_map

plat_node_data is declared as:

	plat_pg_data_t *plat_node_data[MAX_NUMNODES];

Lets look closer at KVADDR_TO_NID() and MAX_NUMNODES:

	#define KVADDR_TO_NID(kaddr)    PHYSADDR_TO_NID(__pa(kaddr))
	#define __pa(x)                 ((unsigned long) (x) - PAGE_OFFSET)
	#define PHYSADDR_TO_NID(pa)     ALPHA_PA_TO_NID(pa)
	#define ALPHA_PA_TO_NID(pa)     ((pa) >> 36)    /* 16 nodes max due 43bit kseg */

	#define MAX_NUMNODES            WILDFIRE_MAX_QBB
	#define WILDFIRE_MAX_QBB        8       /* more than 8 requires other mods */

So, we have a maximum of 8 nodes total, and therefore the plat_node_data
array is 8 entries large.

Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
opened /dev/mem and mapped some random bit of physical memory space)?

__pa returns a large positive number.  We shift this large positive
number left by 36 bits, leaving 28 bits of large positive number, which
is larger than our total 8 nodes.

We use this 28-bit number to index plat_node_data.  Whoops.

And now, for the icing on the cake, take a look at Alpha's pte_page()
implementation:

        unsigned long kvirt;                                            \
        struct page * __xx;                                             \
                                                                        \
        kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT));     \
        __xx = virt_to_page(kvirt);                                     \
                                                                        \
        __xx;                                                           \


Someone *please* tell me where I'm wrong.  I really want to be wrong,
because I can see the same thing happening (in theory, and one report
in practice from a developer) on a certain ARM platform.

On ARM, however, we have cherry to add here.  __va() may alias certain
physical memory addresses to the same virtual memory address, which
makes:

	VALID_PAGE(virt_to_page(__va(phys)))

completely nonsensical.

I'll try kicking myself 3 times to see if I wake up from this horrible
dream now. 8)

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-26 18:27 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Russell King
@ 2002-04-26 22:46 ` Andrea Arcangeli
  2002-04-29 17:50   ` Martin J. Bligh
  2002-04-29 22:00   ` Roman Zippel
  2002-04-27 22:10 ` Daniel Phillips
  1 sibling, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-04-26 22:46 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel

On Fri, Apr 26, 2002 at 07:27:11PM +0100, Russell King wrote:
> Hi,
> 
> I've been looking at some of the ARM discontigmem implementations, and
> have come across a nasty bug.  To illustrate this, I'm going to take
> part of the generic kernel, and use the Alpha implementation to
> illustrate the problem we're facing on ARM.
> 
> I'm going to argue here that virt_to_page() can, in the discontigmem
> case, produce rather a nasty bug when used with non-direct mapped
> kernel memory arguments.
> 
> In mm/memory.c:remap_pte_range() we have the following code:
> 
>                 page = virt_to_page(__va(phys_addr));
>                 if ((!VALID_PAGE(page)) || PageReserved(page))
>                         set_pte(pte, mk_pte_phys(phys_addr, prot));
> 
> Let's look closely at the first line:
> 
>                 page = virt_to_page(__va(phys_addr));
> 
> Essentially, what we're trying to do here is convert a physical address
> to a struct page pointer.
> 
> __va() is defined, on Alpha, to be:
> 
> 	#define __va(x)  ((void *)((unsigned long) (x) + PAGE_OFFSET))
> 
> so we produce a unique "va" for any physical address that is passed.  No
> problem so far.  Now, lets look at virt_to_page() for the Alpha:
> 
> 	#define virt_to_page(kaddr) \
> 	     (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
> 
> Looks inoccuous enough.  However, look closer at ADDR_TO_MAPBASE:
> 
> 	#define ADDR_TO_MAPBASE(kaddr) \
>                         NODE_MEM_MAP(KVADDR_TO_NID((unsigned long)(kaddr)))
> 	#define NODE_MEM_MAP(nid)       (NODE_DATA(nid)->node_mem_map)
> 	#define NODE_DATA(n)		(&((PLAT_NODE_DATA(n))->gendata))
> 	#define PLAT_NODE_DATA(n)       (plat_node_data[(n)])
> 
> Ok, so here we get the map base via:
> 
> 	plat_node_data[KVADDR_TO_NID((unsigned long)(kaddr))]->
> 		gendata.node_mem_map
> 
> plat_node_data is declared as:
> 
> 	plat_pg_data_t *plat_node_data[MAX_NUMNODES];
> 
> Lets look closer at KVADDR_TO_NID() and MAX_NUMNODES:
> 
> 	#define KVADDR_TO_NID(kaddr)    PHYSADDR_TO_NID(__pa(kaddr))
> 	#define __pa(x)                 ((unsigned long) (x) - PAGE_OFFSET)
> 	#define PHYSADDR_TO_NID(pa)     ALPHA_PA_TO_NID(pa)
> 	#define ALPHA_PA_TO_NID(pa)     ((pa) >> 36)    /* 16 nodes max due 43bit kseg */
> 
> 	#define MAX_NUMNODES            WILDFIRE_MAX_QBB
> 	#define WILDFIRE_MAX_QBB        8       /* more than 8 requires other mods */
> 
> So, we have a maximum of 8 nodes total, and therefore the plat_node_data
> array is 8 entries large.
> 
> Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
> opened /dev/mem and mapped some random bit of physical memory space)?
> 
> __pa returns a large positive number.  We shift this large positive
> number left by 36 bits, leaving 28 bits of large positive number, which
> is larger than our total 8 nodes.
> 
> We use this 28-bit number to index plat_node_data.  Whoops.
> 
> And now, for the icing on the cake, take a look at Alpha's pte_page()
> implementation:
> 
>         unsigned long kvirt;                                            \
>         struct page * __xx;                                             \
>                                                                         \
>         kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT));     \
>         __xx = virt_to_page(kvirt);                                     \
>                                                                         \
>         __xx;                                                           \
> 
> 
> Someone *please* tell me where I'm wrong.  I really want to be wrong,
> because I can see the same thing happening (in theory, and one report
> in practice from a developer) on a certain ARM platform.
> 
> On ARM, however, we have cherry to add here.  __va() may alias certain
> physical memory addresses to the same virtual memory address, which
> makes:
> 
> 	VALID_PAGE(virt_to_page(__va(phys)))
> 
> completely nonsensical.

correct. This should fix it:

--- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~	Fri Apr 26 10:28:28 2002
+++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h	Sat Apr 27 00:30:02 2002
@@ -106,8 +106,8 @@
 #define kern_addr_valid(kaddr)	test_bit(LOCAL_MAP_NR(kaddr), \
 					 NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap)
 
-#define virt_to_page(kaddr)	(ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
-#define VALID_PAGE(page)	(((page) - mem_map) < max_mapnr)
+#define virt_to_page(kaddr)	(KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0)
+#define VALID_PAGE(page)	((page) != NULL)
 
 #ifdef CONFIG_NUMA
 #ifdef CONFIG_NUMA_SCHED

It still doesn't cover the ram between the end of a node and the start
of the next node, but at least on alpha-wildfire there can be nothing
mapped there (it's reserved for "more dimm ram" slots) and it would be
even more costly to check if the address is in those intra-node holes.

The invalid pages now will start at phys addr 64G*8 that is the maximum
ram that linux can handle the the wildfire. if you mmap the intra-node
ram via /dev/mem you risk for troubles anyways because there's no dimm
there and probably the effect is undefined or unpredictable, it's just a
"mustn't do that", /dev/mem is a "root" thing so the above approch looks
fine to me.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-26 18:27 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Russell King
  2002-04-26 22:46 ` Andrea Arcangeli
@ 2002-04-27 22:10 ` Daniel Phillips
  2002-04-29 13:35   ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-04-27 22:10 UTC (permalink / raw)
  To: Russell King, linux-kernel

On Friday 26 April 2002 20:27, Russell King wrote:
> Hi,
> 
> I've been looking at some of the ARM discontigmem implementations, and
> have come across a nasty bug.  To illustrate this, I'm going to take
> part of the generic kernel, and use the Alpha implementation to
> illustrate the problem we're facing on ARM.
> 
> I'm going to argue here that virt_to_page() can, in the discontigmem
> case, produce rather a nasty bug when used with non-direct mapped
> kernel memory arguments.

It's tough to follow, even when you know the code.  While cooking my
config_nonlinear patch I noticed the line you're concerned about and
regarded it with deep suspicion.  My patch does this:

-               page = virt_to_page(__va(phys_addr));
+               page = phys_to_page(phys_addr);

And of course took care that phys_to_page does the right thing in all
cases.

<plug>
The new config_nonlinear was designed as a cleaner, more powerful
replacement for all non-numa uses of config_discontigmem.
</plug>

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-27 22:10 ` Daniel Phillips
@ 2002-04-29 13:35   ` Andrea Arcangeli
  2002-04-29 23:02     ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-04-29 13:35 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, linux-kernel

On Sun, Apr 28, 2002 at 12:10:20AM +0200, Daniel Phillips wrote:
> On Friday 26 April 2002 20:27, Russell King wrote:
> > Hi,
> > 
> > I've been looking at some of the ARM discontigmem implementations, and
> > have come across a nasty bug.  To illustrate this, I'm going to take
> > part of the generic kernel, and use the Alpha implementation to
> > illustrate the problem we're facing on ARM.
> > 
> > I'm going to argue here that virt_to_page() can, in the discontigmem
> > case, produce rather a nasty bug when used with non-direct mapped
> > kernel memory arguments.
> 
> It's tough to follow, even when you know the code.  While cooking my
> config_nonlinear patch I noticed the line you're concerned about and
> regarded it with deep suspicion.  My patch does this:
> 
> -               page = virt_to_page(__va(phys_addr));
> +               page = phys_to_page(phys_addr);
> 
> And of course took care that phys_to_page does the right thing in all
> cases.

The problem remains the same also going from phys to page, the problem
is that it's not a contigous mem_map and it choked when the phys addr
was above the max ram physaddr. The patch I posted a few days ago will
fix it (modulo for ununused ram space, but attempting to map into the
address space unused ram space is a bug in the first place).

> 
> <plug>
> The new config_nonlinear was designed as a cleaner, more powerful
> replacement for all non-numa uses of config_discontigmem.
> </plug>

I maybe wrong because I only had a short look at it so far, but the
"non-numa" is what I noticed too and that's what renders it not a very
interesting option IMHO. Most discontigmem needs numa too. If it cannot
handle numa it doesn't worth to add the complexity there, with numa we
must view those chunks differently, not linearly. Also there's nothing
magic that says mem_map must have a magical meaning, doesn't worth to
preserve the mem_map thing, virt_to_page is a much cleaner abstraction
than doing mem_map + pfn by hand.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-26 22:46 ` Andrea Arcangeli
@ 2002-04-29 17:50   ` Martin J. Bligh
  2002-04-29 22:00   ` Roman Zippel
  1 sibling, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-04-29 17:50 UTC (permalink / raw)
  To: Andrea Arcangeli, Russell King; +Cc: linux-kernel, Daniel Phillips

>>                 page = virt_to_page(__va(phys_addr));
>>
>> ...
>>
>> __va() is defined, on Alpha, to be:
>> 
>> 	# define __va(x)  ((void *)((unsigned long) (x) + PAGE_OFFSET))
>> 
>> ...
>> 
>> Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
>> opened /dev/mem and mapped some random bit of physical memory space)?

But we generated kaddr by using __va, as above? If the user mapped /dev/mem
and created a second possible answer for a P->V mapping, that seems
irrelevant, as long as __va always returns the "primary" mapping into kernel
virtual address space.

I'd agree we're lacking some error checking here (maybe virt_to_page should
be an inline that checks that kaddr really is a kernel virtual address), but I 
can't see a real practical problem in the scenario you describe. As other
people seem to be able to, maybe I'm missing something ;-)

I'm not sure if your arch is a 32-bit or 64-bit arch, but I see more of a problem
in this code if we do "page = virt_to_page(__va(phys_addr));" on a physaddr
that's in HIGHMEM on a 32 bit arch, in which we get garbage from the wrapping,
and Daniel's "page = phys_to_page(phys_addr);" makes infintely more sense.

Martin.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-26 22:46 ` Andrea Arcangeli
  2002-04-29 17:50   ` Martin J. Bligh
@ 2002-04-29 22:00   ` Roman Zippel
  2002-04-30  0:43     ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-04-29 22:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, linux-kernel

Hi,

On Sat, 27 Apr 2002, Andrea Arcangeli wrote:

> correct. This should fix it:
> 
> --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~	Fri Apr 26 10:28:28 2002
> +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h	Sat Apr 27 00:30:02 2002
> @@ -106,8 +106,8 @@
>  #define kern_addr_valid(kaddr)	test_bit(LOCAL_MAP_NR(kaddr), \
>  					 NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap)
>  
> -#define virt_to_page(kaddr)	(ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
> -#define VALID_PAGE(page)	(((page) - mem_map) < max_mapnr)
> +#define virt_to_page(kaddr)	(KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0)
> +#define VALID_PAGE(page)	((page) != NULL)
>  
>  #ifdef CONFIG_NUMA
>  #ifdef CONFIG_NUMA_SCHED

I'd prefer if VALID_PAGE would go away completely, that test was almost
always to late. What about the patch below, it even reduces the code size
by 1072 bytes (but it's otherwise untested).
It introduces virt_to_valid_page and pte_valid_page, which include a
check, whether the input is valid.

bye, Roman

Index: arch/arm/mach-arc/small_page.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/arm/mach-arc/small_page.c,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 small_page.c
--- arch/arm/mach-arc/small_page.c	15 Jan 2002 18:12:17 -0000	1.1.1.1
+++ arch/arm/mach-arc/small_page.c	29 Apr 2002 20:38:49 -0000
@@ -150,8 +150,8 @@ static void __free_small_page(unsigned l
 	unsigned long flags;
 	struct page *page;
 
-	page = virt_to_page(spage);
-	if (VALID_PAGE(page)) {
+	page = virt_to_valid_page(spage);
+	if (page) {
 
 		/*
 		 * The container-page must be marked Reserved
Index: arch/arm/mm/fault-armv.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/arm/mm/fault-armv.c,v
retrieving revision 1.1.1.5
diff -u -p -r1.1.1.5 fault-armv.c
--- arch/arm/mm/fault-armv.c	14 Apr 2002 20:06:12 -0000	1.1.1.5
+++ arch/arm/mm/fault-armv.c	29 Apr 2002 19:18:37 -0000
@@ -240,9 +240,9 @@ make_coherent(struct vm_area_struct *vma
  */
 void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 {
-	struct page *page = pte_page(pte);
+	struct page *page = pte_valid_page(pte);
 
-	if (VALID_PAGE(page) && page->mapping) {
+	if (page && page->mapping) {
 		if (test_and_clear_bit(PG_dcache_dirty, &page->flags))
 			__flush_dcache_page(page);
 
Index: arch/ia64/mm/init.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/ia64/mm/init.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 init.c
--- arch/ia64/mm/init.c	24 Apr 2002 19:35:43 -0000	1.1.1.3
+++ arch/ia64/mm/init.c	29 Apr 2002 20:39:05 -0000
@@ -147,7 +147,7 @@ free_initrd_mem (unsigned long start, un
 		printk(KERN_INFO "Freeing initrd memory: %ldkB freed\n", (end - start) >> 10);
 
 	for (; start < end; start += PAGE_SIZE) {
-		if (!VALID_PAGE(virt_to_page(start)))
+		if (!virt_to_valid_page(start))
 			continue;
 		clear_bit(PG_reserved, &virt_to_page(start)->flags);
 		set_page_count(virt_to_page(start), 1);
Index: arch/mips/mm/umap.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/mips/mm/umap.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 umap.c
--- arch/mips/mm/umap.c	31 Jan 2002 22:19:02 -0000	1.1.1.2
+++ arch/mips/mm/umap.c	29 Apr 2002 19:17:45 -0000
@@ -116,8 +116,8 @@ void *vmalloc_uncached (unsigned long si
 static inline void free_pte(pte_t page)
 {
 	if (pte_present(page)) {
-		struct page *ptpage = pte_page(page);
-		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+		struct page *ptpage = pte_valid_page(page);
+		if (!ptpage || PageReserved(ptpage))
 			return;
 		__free_page(ptpage);
 		if (current->mm->rss <= 0)
Index: arch/mips64/mm/umap.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/mips64/mm/umap.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 umap.c
--- arch/mips64/mm/umap.c	31 Jan 2002 22:19:51 -0000	1.1.1.2
+++ arch/mips64/mm/umap.c	29 Apr 2002 19:17:29 -0000
@@ -115,8 +115,8 @@ void *vmalloc_uncached (unsigned long si
 static inline void free_pte(pte_t page)
 {
 	if (pte_present(page)) {
-		struct page *ptpage = pte_page(page);
-		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+		struct page *ptpage = pte_valid_page(page);
+		if (!ptpage || PageReserved(ptpage))
 			return;
 		__free_page(ptpage);
 		if (current->mm->rss <= 0)
Index: arch/sh/mm/fault.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sh/mm/fault.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 fault.c
--- arch/sh/mm/fault.c	31 Jan 2002 22:19:42 -0000	1.1.1.2
+++ arch/sh/mm/fault.c	29 Apr 2002 19:17:11 -0000
@@ -298,8 +298,8 @@ void update_mmu_cache(struct vm_area_str
 		return;
 
 #if defined(__SH4__)
-	page = pte_page(pte);
-	if (VALID_PAGE(page) && !test_bit(PG_mapped, &page->flags)) {
+	page = pte_valid_page(pte);
+	if (page && !test_bit(PG_mapped, &page->flags)) {
 		unsigned long phys = pte_val(pte) & PTE_PHYS_MASK;
 		__flush_wback_region((void *)P1SEGADDR(phys), PAGE_SIZE);
 		__set_bit(PG_mapped, &page->flags);
Index: arch/sparc/mm/generic.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc/mm/generic.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 generic.c
--- arch/sparc/mm/generic.c	31 Jan 2002 22:19:00 -0000	1.1.1.2
+++ arch/sparc/mm/generic.c	29 Apr 2002 19:16:58 -0000
@@ -19,8 +19,8 @@ static inline void forget_pte(pte_t page
 	if (pte_none(page))
 		return;
 	if (pte_present(page)) {
-		struct page *ptpage = pte_page(page);
-		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+		struct page *ptpage = pte_valid_page(page);
+		if (!ptpage || PageReserved(ptpage))
 			return;
 		page_cache_release(ptpage);
 		return;
Index: arch/sparc/mm/sun4c.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc/mm/sun4c.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 sun4c.c
--- arch/sparc/mm/sun4c.c	14 Apr 2002 20:05:32 -0000	1.1.1.3
+++ arch/sparc/mm/sun4c.c	29 Apr 2002 20:39:35 -0000
@@ -1327,7 +1327,7 @@ static __u32 sun4c_get_scsi_one(char *bu
 	unsigned long page;
 
 	page = ((unsigned long)bufptr) & PAGE_MASK;
-	if (!VALID_PAGE(virt_to_page(page))) {
+	if (!virt_to_valid_page(page)) {
 		sun4c_flush_page(page);
 		return (__u32)bufptr; /* already locked */
 	}
@@ -2106,7 +2106,7 @@ static void sun4c_pte_clear(pte_t *ptep)
 static int sun4c_pmd_bad(pmd_t pmd)
 {
 	return (((pmd_val(pmd) & ~PAGE_MASK) != PGD_TABLE) ||
-		(!VALID_PAGE(virt_to_page(pmd_val(pmd)))));
+		(!virt_to_valid_page(pmd_val(pmd))));
 }
 
 static int sun4c_pmd_present(pmd_t pmd)
Index: arch/sparc64/kernel/traps.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/kernel/traps.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 traps.c
--- arch/sparc64/kernel/traps.c	11 Feb 2002 18:49:01 -0000	1.1.1.3
+++ arch/sparc64/kernel/traps.c	29 Apr 2002 20:39:53 -0000
@@ -1284,9 +1284,9 @@ void cheetah_deferred_handler(struct pt_
 			}
 
 			if (recoverable) {
-				struct page *page = virt_to_page(__va(afar));
+				struct page *page = virt_to_valid_page(__va(afar));
 
-				if (VALID_PAGE(page))
+				if (page)
 					get_page(page);
 				else
 					recoverable = 0;
Index: arch/sparc64/mm/generic.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/mm/generic.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 generic.c
--- arch/sparc64/mm/generic.c	19 Mar 2002 01:27:51 -0000	1.1.1.3
+++ arch/sparc64/mm/generic.c	29 Apr 2002 19:14:35 -0000
@@ -19,8 +19,8 @@ static inline void forget_pte(pte_t page
 	if (pte_none(page))
 		return;
 	if (pte_present(page)) {
-		struct page *ptpage = pte_page(page);
-		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+		struct page *ptpage = pte_valid_page(page);
+		if (!ptpage || PageReserved(ptpage))
 			return;
 		page_cache_release(ptpage);
 		return;
Index: arch/sparc64/mm/init.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/mm/init.c,v
retrieving revision 1.1.1.6
diff -u -p -r1.1.1.6 init.c
--- arch/sparc64/mm/init.c	14 Apr 2002 20:06:08 -0000	1.1.1.6
+++ arch/sparc64/mm/init.c	29 Apr 2002 19:14:15 -0000
@@ -187,11 +187,10 @@ extern void __update_mmu_cache(unsigned 
 
 void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t pte)
 {
-	struct page *page = pte_page(pte);
+	struct page *page = pte_valid_page(pte);
 	unsigned long pg_flags;
 
-	if (VALID_PAGE(page) &&
-	    page->mapping &&
+	if (page && page->mapping &&
 	    ((pg_flags = page->flags) & (1UL << PG_dcache_dirty))) {
 		int cpu = ((pg_flags >> 24) & (NR_CPUS - 1UL));
 
@@ -260,10 +259,10 @@ static inline void flush_cache_pte_range
 			continue;
 
 		if (pte_present(pte) && pte_dirty(pte)) {
-			struct page *page = pte_page(pte);
+			struct page *page = pte_valid_page(pte);
 			unsigned long pgaddr, uaddr;
 
-			if (!VALID_PAGE(page) || PageReserved(page) || !page->mapping)
+			if (!page || PageReserved(page) || !page->mapping)
 				continue;
 			pgaddr = (unsigned long) page_address(page);
 			uaddr = address + offset;
Index: fs/proc/array.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/fs/proc/array.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 array.c
--- fs/proc/array.c	14 Apr 2002 20:01:10 -0000	1.1.1.7
+++ fs/proc/array.c	29 Apr 2002 19:12:38 -0000
@@ -424,8 +424,8 @@ static inline void statm_pte_range(pmd_t
 		++*total;
 		if (!pte_present(page))
 			continue;
-		ptpage = pte_page(page);
-		if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+		ptpage = pte_valid_page(page);
+		if (!ptpage || PageReserved(ptpage))
 			continue;
 		++*pages;
 		if (pte_dirty(page))
Index: include/asm-cris/processor.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-cris/processor.h,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 processor.h
--- include/asm-cris/processor.h	31 Jan 2002 22:16:02 -0000	1.1.1.2
+++ include/asm-cris/processor.h	29 Apr 2002 20:40:17 -0000
@@ -101,8 +101,7 @@ unsigned long get_wchan(struct task_stru
     ({                  \
         unsigned long eip = 0;   \
         unsigned long regs = (unsigned long)user_regs(tsk); \
-        if (regs > PAGE_SIZE && \
-            VALID_PAGE(virt_to_page(regs))) \
+        if (regs > PAGE_SIZE && virt_to_valid_page(regs)) \
               eip = ((struct pt_regs *)regs)->irp; \
         eip; })
 
Index: include/asm-i386/page.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/page.h,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 page.h
--- include/asm-i386/page.h	24 Feb 2002 23:11:41 -0000	1.1.1.3
+++ include/asm-i386/page.h	29 Apr 2002 21:09:09 -0000
@@ -132,7 +132,10 @@ static __inline__ int get_order(unsigned
 #define __pa(x)			((unsigned long)(x)-PAGE_OFFSET)
 #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
 #define virt_to_page(kaddr)	(mem_map + (__pa(kaddr) >> PAGE_SHIFT))
-#define VALID_PAGE(page)	((page - mem_map) < max_mapnr)
+#define virt_to_valid_page(kaddr) ({ \
+	unsigned long __paddr = __pa(kaddr); \
+	__paddr < max_mapnr ? mem_map + (__paddr >> PAGE_SHIFT) : NULL; \
+})
 
 #define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
 				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
Index: include/asm-i386/pgtable-2level.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-2level.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 pgtable-2level.h
--- include/asm-i386/pgtable-2level.h	26 Nov 2001 19:29:55 -0000	1.1.1.1
+++ include/asm-i386/pgtable-2level.h	29 Apr 2002 21:13:29 -0000
@@ -57,6 +57,7 @@ static inline pmd_t * pmd_offset(pgd_t *
 #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
 #define pte_same(a, b)		((a).pte_low == (b).pte_low)
 #define pte_page(x)		(mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))
+#define pte_valid_page(x)	(pte_val(x) < max_mapnr ? pte_page(x) : NULL)
 #define pte_none(x)		(!(x).pte_low)
 #define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))
 
Index: include/asm-i386/pgtable-3level.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-3level.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 pgtable-3level.h
--- include/asm-i386/pgtable-3level.h	26 Nov 2001 19:29:55 -0000	1.1.1.1
+++ include/asm-i386/pgtable-3level.h	29 Apr 2002 21:13:08 -0000
@@ -87,6 +87,7 @@ static inline int pte_same(pte_t a, pte_
 }
 
 #define pte_page(x)	(mem_map+(((x).pte_low >> PAGE_SHIFT) | ((x).pte_high << (32 - PAGE_SHIFT))))
+#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL)
 #define pte_none(x)	(!(x).pte_low && !(x).pte_high)
 
 static inline pte_t __mk_pte(unsigned long page_nr, pgprot_t pgprot)
Index: include/asm-m68k/processor.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-m68k/processor.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 processor.h
--- include/asm-m68k/processor.h	26 Nov 2001 19:29:57 -0000	1.1.1.1
+++ include/asm-m68k/processor.h	29 Apr 2002 20:40:37 -0000
@@ -139,7 +139,7 @@ unsigned long get_wchan(struct task_stru
     ({			\
 	unsigned long eip = 0;	 \
 	if ((tsk)->thread.esp0 > PAGE_SIZE && \
-	    (VALID_PAGE(virt_to_page((tsk)->thread.esp0)))) \
+	    (virt_to_valid_page((tsk)->thread.esp0))) \
 	      eip = ((struct pt_regs *) (tsk)->thread.esp0)->pc; \
 	eip; })
 #define	KSTK_ESP(tsk)	((tsk) == current ? rdusp() : (tsk)->thread.usp)
Index: include/asm-sh/pgalloc.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-sh/pgalloc.h,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 pgalloc.h
--- include/asm-sh/pgalloc.h	31 Jan 2002 22:15:51 -0000	1.1.1.2
+++ include/asm-sh/pgalloc.h	29 Apr 2002 19:11:43 -0000
@@ -105,9 +105,8 @@ static inline pte_t ptep_get_and_clear(p
 
 	pte_clear(ptep);
 	if (!pte_not_present(pte)) {
-		struct page *page = pte_page(pte);
-		if (VALID_PAGE(page)&&
-		    (!page->mapping || !(page->mapping->i_mmap_shared)))
+		struct page *page = pte_valid_page(pte);
+		if (page && (!page->mapping || !(page->mapping->i_mmap_shared)))
 			__clear_bit(PG_mapped, &page->flags);
 	}
 	return pte;
Index: mm/memory.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/memory.c,v
retrieving revision 1.1.1.9
diff -u -p -r1.1.1.9 memory.c
--- mm/memory.c	29 Apr 2002 17:30:38 -0000	1.1.1.9
+++ mm/memory.c	29 Apr 2002 20:38:17 -0000
@@ -76,8 +76,8 @@ mem_map_t * mem_map;
  */
 void __free_pte(pte_t pte)
 {
-	struct page *page = pte_page(pte);
-	if ((!VALID_PAGE(page)) || PageReserved(page))
+	struct page *page = pte_valid_page(pte);
+	if (!page || PageReserved(page))
 		return;
 	if (pte_dirty(pte))
 		set_page_dirty(page);		
@@ -278,9 +278,8 @@ skip_copy_pte_range:		address = (address
 					swap_duplicate(pte_to_swp_entry(pte));
 					goto cont_copy_pte_range;
 				}
-				ptepage = pte_page(pte);
-				if ((!VALID_PAGE(ptepage)) || 
-				    PageReserved(ptepage))
+				ptepage = pte_valid_page(pte);
+				if (!ptepage || PageReserved(ptepage))
 					goto cont_copy_pte_range;
 
 				/* If it's a COW mapping, write protect it both in the parent and the child */
@@ -356,8 +355,8 @@ static inline int zap_pte_range(mmu_gath
 		if (pte_none(pte))
 			continue;
 		if (pte_present(pte)) {
-			struct page *page = pte_page(pte);
-			if (VALID_PAGE(page) && !PageReserved(page))
+			struct page *page = pte_valid_page(pte);
+			if (page && !PageReserved(page))
 				freed ++;
 			/* This will eventually call __free_pte on the pte. */
 			tlb_remove_page(tlb, ptep, address + offset);
@@ -473,7 +472,7 @@ static struct page * follow_page(struct 
 	if (pte_present(pte)) {
 		if (!write ||
 		    (pte_write(pte) && pte_dirty(pte)))
-			return pte_page(pte);
+			return pte_valid_page(pte);
 	}
 
 out:
@@ -488,8 +487,6 @@ out:
 
 static inline struct page * get_page_map(struct page *page)
 {
-	if (!VALID_PAGE(page))
-		return 0;
 	return page;
 }
 
@@ -860,12 +857,12 @@ static inline void remap_pte_range(pte_t
 		end = PMD_SIZE;
 	do {
 		struct page *page;
-		pte_t oldpage;
+		pte_t oldpage, newpage;
 		oldpage = ptep_get_and_clear(pte);
-
-		page = virt_to_page(__va(phys_addr));
-		if ((!VALID_PAGE(page)) || PageReserved(page))
- 			set_pte(pte, mk_pte_phys(phys_addr, prot));
+		newpage = mk_pte_phys(phys_addr, prot);
+		page = pte_valid_page(newpage);
+		if (!page || PageReserved(page))
+ 			set_pte(pte, newpage);
 		forget_pte(oldpage);
 		address += PAGE_SIZE;
 		phys_addr += PAGE_SIZE;
@@ -978,8 +975,8 @@ static int do_wp_page(struct mm_struct *
 {
 	struct page *old_page, *new_page;
 
-	old_page = pte_page(pte);
-	if (!VALID_PAGE(old_page))
+	old_page = pte_valid_page(pte);
+	if (!old_page)
 		goto bad_wp_page;
 
 	if (!TryLockPage(old_page)) {
Index: mm/msync.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/msync.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 msync.c
--- mm/msync.c	14 Apr 2002 20:01:38 -0000	1.1.1.2
+++ mm/msync.c	29 Apr 2002 19:04:34 -0000
@@ -26,8 +26,8 @@ static int filemap_sync_pte(pte_t *ptep,
 	pte_t pte = *ptep;
 
 	if (pte_present(pte) && pte_dirty(pte)) {
-		struct page *page = pte_page(pte);
-		if (VALID_PAGE(page) && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) {
+		struct page *page = pte_valid_page(pte);
+		if (page && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) {
 			flush_tlb_page(vma, address);
 			set_page_dirty(page);
 		}
Index: mm/page_alloc.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/page_alloc.c,v
retrieving revision 1.1.1.8
diff -u -p -r1.1.1.8 page_alloc.c
--- mm/page_alloc.c	24 Apr 2002 19:31:04 -0000	1.1.1.8
+++ mm/page_alloc.c	29 Apr 2002 20:42:30 -0000
@@ -101,8 +101,6 @@ static void __free_pages_ok (struct page
 		BUG();
 	if (page->mapping)
 		BUG();
-	if (!VALID_PAGE(page))
-		BUG();
 	if (PageSwapCache(page))
 		BUG();
 	if (PageLocked(page))
@@ -294,8 +292,6 @@ static struct page * balance_classzone(z
 						BUG();
 					if (page->mapping)
 						BUG();
-					if (!VALID_PAGE(page))
-						BUG();
 					if (PageSwapCache(page))
 						BUG();
 					if (PageLocked(page))
@@ -474,7 +470,7 @@ void __free_pages(struct page *page, uns
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0)
-		__free_pages(virt_to_page(addr), order);
+		__free_pages(virt_to_valid_page(addr), order);
 }
 
 /*
Index: mm/slab.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/slab.c,v
retrieving revision 1.1.1.5
diff -u -p -r1.1.1.5 slab.c
--- mm/slab.c	13 Mar 2002 21:16:16 -0000	1.1.1.5
+++ mm/slab.c	29 Apr 2002 20:44:21 -0000
@@ -1415,7 +1415,7 @@ alloc_new_slab_nolock:
 #if DEBUG
 # define CHECK_NR(pg)						\
 	do {							\
-		if (!VALID_PAGE(pg)) {				\
+		if (!pg) {					\
 			printk(KERN_ERR "kfree: out of range ptr %lxh.\n", \
 				(unsigned long)objp);		\
 			BUG();					\
@@ -1439,7 +1439,7 @@ static inline void kmem_cache_free_one(k
 {
 	slab_t* slabp;
 
-	CHECK_PAGE(virt_to_page(objp));
+	CHECK_PAGE(virt_to_valid_page(objp));
 	/* reduces memory footprint
 	 *
 	if (OPTIMIZE(cachep))
@@ -1519,7 +1519,7 @@ static inline void __kmem_cache_free (km
 #ifdef CONFIG_SMP
 	cpucache_t *cc = cc_data(cachep);
 
-	CHECK_PAGE(virt_to_page(objp));
+	CHECK_PAGE(virt_to_valid_page(objp));
 	if (cc) {
 		int batchcount;
 		if (cc->avail < cc->limit) {
@@ -1601,7 +1601,7 @@ void kmem_cache_free (kmem_cache_t *cach
 {
 	unsigned long flags;
 #if DEBUG
-	CHECK_PAGE(virt_to_page(objp));
+	CHECK_PAGE(virt_to_valid_page(objp));
 	if (cachep != GET_PAGE_CACHE(virt_to_page(objp)))
 		BUG();
 #endif
@@ -1626,7 +1626,7 @@ void kfree (const void *objp)
 	if (!objp)
 		return;
 	local_irq_save(flags);
-	CHECK_PAGE(virt_to_page(objp));
+	CHECK_PAGE(virt_to_valid_page(objp));
 	c = GET_PAGE_CACHE(virt_to_page(objp));
 	__kmem_cache_free(c, (void*)objp);
 	local_irq_restore(flags);
Index: mm/vmalloc.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/vmalloc.c,v
retrieving revision 1.1.1.5
diff -u -p -r1.1.1.5 vmalloc.c
--- mm/vmalloc.c	24 Apr 2002 19:31:04 -0000	1.1.1.5
+++ mm/vmalloc.c	29 Apr 2002 18:59:39 -0000
@@ -45,8 +45,8 @@ static inline void free_area_pte(pmd_t *
 		if (pte_none(page))
 			continue;
 		if (pte_present(page)) {
-			struct page *ptpage = pte_page(page);
-			if (VALID_PAGE(ptpage) && (!PageReserved(ptpage)))
+			struct page *ptpage = pte_valid_page(page);
+			if (ptpage && (!PageReserved(ptpage)))
 				__free_page(ptpage);
 			continue;
 		}
Index: mm/vmscan.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/vmscan.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 vmscan.c
--- mm/vmscan.c	24 Apr 2002 19:31:04 -0000	1.1.1.7
+++ mm/vmscan.c	29 Apr 2002 18:57:37 -0000
@@ -206,9 +206,9 @@ static inline int swap_out_pmd(struct mm
 
 	do {
 		if (pte_present(*pte)) {
-			struct page *page = pte_page(*pte);
+			struct page *page = pte_valid_page(*pte);
 
-			if (VALID_PAGE(page) && !PageReserved(page)) {
+			if (page && !PageReserved(page)) {
 				count -= try_to_swap_out(mm, vma, address, pte, page, classzone);
 				if (!count) {
 					address += PAGE_SIZE;


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-29 13:35   ` Andrea Arcangeli
@ 2002-04-29 23:02     ` Daniel Phillips
  2002-05-01  2:23       ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-04-29 23:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, linux-kernel

On Monday 29 April 2002 15:35, Andrea Arcangeli wrote:
> On Sun, Apr 28, 2002 at 12:10:20AM +0200, Daniel Phillips wrote:
> > On Friday 26 April 2002 20:27, Russell King wrote:
> > > Hi,
> > > 
> > > I've been looking at some of the ARM discontigmem implementations, and
> > > have come across a nasty bug.  To illustrate this, I'm going to take
> > > part of the generic kernel, and use the Alpha implementation to
> > > illustrate the problem we're facing on ARM.
> > > 
> > > I'm going to argue here that virt_to_page() can, in the discontigmem
> > > case, produce rather a nasty bug when used with non-direct mapped
> > > kernel memory arguments.
> > 
> > It's tough to follow, even when you know the code.  While cooking my
> > config_nonlinear patch I noticed the line you're concerned about and
> > regarded it with deep suspicion.  My patch does this:
> > 
> > -               page = virt_to_page(__va(phys_addr));
> > +               page = phys_to_page(phys_addr);
> > 
> > And of course took care that phys_to_page does the right thing in all
> > cases.
> 
> The problem remains the same also going from phys to page, the problem
> is that it's not a contigous mem_map and it choked when the phys addr
> was above the max ram physaddr. The patch I posted a few days ago will
> fix it (modulo for ununused ram space, but attempting to map into the
> address space unused ram space is a bug in the first place).

My config_nonlinear patch does not suffer from the above problem.  Here's the
code:

unsigned long vsection[MAX_SECTIONS];

static inline unsigned long phys_to_ordinal(phys_t p)
{
	return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT);
}

static inline struct page *phys_to_page(unsigned long p)
{
	return mem_map + phys_to_ordinal(p);
}

Nothing can go out of range.  Sensible, no?

> > <plug>
> > The new config_nonlinear was designed as a cleaner, more powerful
> > replacement for all non-numa uses of config_discontigmem.
> > </plug>
> 
> I maybe wrong because I only had a short look at it so far, but the
> "non-numa" is what I noticed too and that's what renders it not a very
> interesting option IMHO. Most discontigmem needs numa too.

I am, first and foremost, presenting config_nonlinear as a replacement for
config_discontig for *non-numa* uses of config_discontig.  (Sorry if I'm
repeating myself here.)

There are also applications in numa.  Please see the lse-tech archives for
details.  I expect that, by taking a fresh look at numa code in the light
of new work, that the numa code can be cleaned up and simplififed
considerably.  But that's "further work".  Config_nonlinear stands on its
own quite nicely.

> If it cannot
> handle numa it doesn't worth to add the complexity there,

It does not add complexity, it removes complexity.  Please read the patch
more closely.  It's very simple.  It's also more powerful than
config_discontig.

> with numa we must view those chunks differently, not linearly.

Correct.  Now, if you want to extend my patch to handle multiple mem_map
vectors, you do it by defining an ordinal_to_page and page_to_ordinal pair
of mappings.[1]  Don't you think this is a nicer way to organize things?

> Also there's nothing
> magic that says mem_map must have a magical meaning, doesn't worth to
> preserve the mem_map thing, virt_to_page is a much cleaner abstraction
> than doing mem_map + pfn by hand.

True.  The upcoming iteration of config_nonlinear moves all uses of
mem_map inside the per-arch page.h headers, so that mem_map need not
exist at all in configurations where there is no single mem_map.

[1] Since allocation needs to be aware of the separate zones,
_alloc_pages stays much as it is, but if we change all non-numa users
of config_discontig over to config_nonlinear then we can get rid of the
#ifdef CONFIG_NUMA's, by way of cleanup.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-29 22:00   ` Roman Zippel
@ 2002-04-30  0:43     ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-04-30  0:43 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Russell King, linux-kernel

On Tue, Apr 30, 2002 at 12:00:50AM +0200, Roman Zippel wrote:
> Hi,
> 
> On Sat, 27 Apr 2002, Andrea Arcangeli wrote:
> 
> > correct. This should fix it:
> > 
> > --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~	Fri Apr 26 10:28:28 2002
> > +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h	Sat Apr 27 00:30:02 2002
> > @@ -106,8 +106,8 @@
> >  #define kern_addr_valid(kaddr)	test_bit(LOCAL_MAP_NR(kaddr), \
> >  					 NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap)
> >  
> > -#define virt_to_page(kaddr)	(ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
> > -#define VALID_PAGE(page)	(((page) - mem_map) < max_mapnr)
> > +#define virt_to_page(kaddr)	(KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0)
> > +#define VALID_PAGE(page)	((page) != NULL)
> >  
> >  #ifdef CONFIG_NUMA
> >  #ifdef CONFIG_NUMA_SCHED
> 
> I'd prefer if VALID_PAGE would go away completely, that test was almost
> always to late. What about the patch below, it even reduces the code size

it is _always_ too late indeed, I definitely agree with your proposal to
change the common code API, yours is a much saner API. But that's a
common code change call, my object was to fix the arch part without
changing the common code, and after all my patch will work exactly the
same as yours, it's just that you put the page != NULL check explicit
and I still use VALID_PAGE instead. You can skip the overflow-check when
we know the vaddr or the pte to match with a valid ram page, so it's a
bit faster than my fix with discontigmem enabled. I'm not sure if for
2.4 it worth to change that given that my two liner arch-contained patch
will also work flawlessy. I've just quite a lots of stuff pending in 2.4
that makes some huge difference to users, so I tend to prefer to left
the stuff that doesn't make difference to users for 2.5 only (it's a
cleanup plus a minor discontigmem optimization after all). So I
recommend you to push it to Linus after fixing the below bugs.

> --- include/asm-i386/page.h	24 Feb 2002 23:11:41 -0000	1.1.1.3
> +++ include/asm-i386/page.h	29 Apr 2002 21:09:09 -0000
> @@ -132,7 +132,10 @@ static __inline__ int get_order(unsigned
>  #define __pa(x)			((unsigned long)(x)-PAGE_OFFSET)
>  #define __va(x)			((void *)((unsigned long)(x)+PAGE_OFFSET))
>  #define virt_to_page(kaddr)	(mem_map + (__pa(kaddr) >> PAGE_SHIFT))
> -#define VALID_PAGE(page)	((page - mem_map) < max_mapnr)
> +#define virt_to_valid_page(kaddr) ({ \
> +	unsigned long __paddr = __pa(kaddr); \
> +	__paddr < max_mapnr ? mem_map + (__paddr >> PAGE_SHIFT) : NULL; \
> +})
>  
>  #define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
>  				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> Index: include/asm-i386/pgtable-2level.h
> ===================================================================
> RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-2level.h,v
> retrieving revision 1.1.1.1
> diff -u -p -r1.1.1.1 pgtable-2level.h
> --- include/asm-i386/pgtable-2level.h	26 Nov 2001 19:29:55 -0000	1.1.1.1
> +++ include/asm-i386/pgtable-2level.h	29 Apr 2002 21:13:29 -0000
> @@ -57,6 +57,7 @@ static inline pmd_t * pmd_offset(pgd_t *
>  #define ptep_get_and_clear(xp)	__pte(xchg(&(xp)->pte_low, 0))
>  #define pte_same(a, b)		((a).pte_low == (b).pte_low)
>  #define pte_page(x)		(mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))
> +#define pte_valid_page(x)	(pte_val(x) < max_mapnr ? pte_page(x) : NULL)
>  #define pte_none(x)		(!(x).pte_low)
>  #define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))
>  
> Index: include/asm-i386/pgtable-3level.h
> ===================================================================
> RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-3level.h,v
> retrieving revision 1.1.1.1
> diff -u -p -r1.1.1.1 pgtable-3level.h
> --- include/asm-i386/pgtable-3level.h	26 Nov 2001 19:29:55 -0000	1.1.1.1
> +++ include/asm-i386/pgtable-3level.h	29 Apr 2002 21:13:08 -0000
> @@ -87,6 +87,7 @@ static inline int pte_same(pte_t a, pte_
>  }
>  
>  #define pte_page(x)	(mem_map+(((x).pte_low >> PAGE_SHIFT) | ((x).pte_high << (32 - PAGE_SHIFT))))
> +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL)
>  #define pte_none(x)	(!(x).pte_low && !(x).pte_high)
>  

map_mapnr is a pfn, not a physaddr, you're off of 2^PAGE_SHIFT, fix is
trivial of course.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  2:23       ` Andrea Arcangeli
@ 2002-04-30 23:12         ` Daniel Phillips
  2002-05-01  1:05           ` Daniel Phillips
  2002-05-02  0:47           ` Andrea Arcangeli
  2002-05-01 18:05         ` Jesse Barnes
  1 sibling, 2 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-04-30 23:12 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, linux-kernel

On Wednesday 01 May 2002 04:23, Andrea Arcangeli wrote:
> On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote:
> > My config_nonlinear patch does not suffer from the above problem.  Here's the
> > code:
> >
> > unsigned long vsection[MAX_SECTIONS];
> > 
> > static inline unsigned long phys_to_ordinal(phys_t p)
> > {
> > 	return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT);
> > }
> > 
> > static inline struct page *phys_to_page(unsigned long p)
> > {
> > 	return mem_map + phys_to_ordinal(p);
> > }
> > 
> > Nothing can go out of range.  Sensible, no?
> 
> Really the above vsection[p >> SECTION_SHIFT] will overflow in the very
> same case I fixed a few days ago for numa-alpha.

No it will not.  Note however that you do want to choose SECTION_SHIFT so
that the vsection table is not too large.

> The whole point is that
> p isn't a ram page and you assumed that (the alpha code was also assuming
> that and that's why it overflowed the same way as yours).  Either that
> or you're wasting some huge tons of ram with vsection on a 64bit arch.

No and no.  In fact, the vsection table for my current project is only 32
elements long.

> After the above out of range bug is fixed in practice it is the same as
> the current discontigmem, except that with the current way you can take
> the page structures in the right node with numa. And again I cannot see
> any advantage in having a contiguous mem_map even for archs with only
> discontigmem and non-numa

> (I think only ARM falls in such category, btw).

You would be wrong about that.

It's clear that you have not looked at the config_nonlinear patch closely,
and are not familiar with it.  I'll try to provide some help, by enumerating
some similarities and differences below.  I'll apologize in advance for not
replying to your email point by point.  Sorry, there were just too many
points ;-)

Config_discontigmem
-------------------

Has exactly one purpose: to eliminate memory wastage due to struct pages
that refer to unpopulated regions of memory.  It does this by dividing
memory regions up into 'nodes', and each node is handled separately by
the memory manager, which attempts to balance allocations between them.

Config_discontig replicates however many zones there are across however
many discontiguous regions there are, so for purposes of allocation, we
end up with a two-dimensional array of zones, (MAX_NR_ZONES * MAX_NR_NODES).

Config_discontigmem uses a table mapping in one direction: given an
address, find a struct page in a one of several ->mem_map array indexed
by the address, or compute a physical memory address by finding a base
physical address in an array indexed by the virtual address.  Conversion
from physical address to struct page also requires a table lookup, to
locate the desired ->mem_map array.

Config_nonlinear
----------------

Config_nonlinear introduces a new, logical address space, and uses a pair
of tables, indexed by a few high bits of the address, one to map sections
of logical address space to sections of physical address space, and the
other to perform the reverse mapping.  This pair of tables is used to
define the usual set of address translation functions used to maintain
the process page tables, including the kernel virtual page tables.  The
real work of doing this translation is, of course, performed by the
address translation hardware - otherwise the bookkeeping cost of
config_nonlinear is comparable to or slightly better than
config_discontigmem.

Such things as bootmem allocations and VALID_PAGE checks are carried out
in the logical address space, which constitutes a considerable
simplification vs config_discontigmem.

Config_nonlinear was not designed as a replacement for numa address
management, however, it is compatible with it and there are numa
applications where config_nonlinear can create efficiencies that
config_discontigmem cannot.  That said, the rest of this discussion is
concerned with non-numa applications of config_nonlinear.

In the non-numa case, config_nonlinear does what config_discontigmem
does, that is, eliminates struct page memory wastage due to unpopulated
regions of memory, and in addition:

  - Can map a large range of physical memory into a small range of
    kernel virtual memory.  This becomes important when physical memory
    is installed at widely separated intervals

  - Does not artificially divide memory into nodes, instead, joins it
    together in one homogeneous pool, which the memory manager divides
    into zones as *necessary* (for example, for highmem).

  - Sharply reduces number of zones needing balancing is sharply reduced vs
    config_discontigmem.  Please take a look at the non-numa code in
    _alloc_pages that attempts to balance between the 'artificial' nodes
    created by config_discontigmem.  It just cycles between round robin
    between the nodes on each allocation, ignoring the relative
    availability of memory in the nodes.  This obvious deficiency could be
    fixed by adding more (finicky) code, or the problem can be eliminated
    completely, using config_nonlinear.

  - Has better locality of reference in the mapping tables, because the
    tables are compact (and could easily be made yet more compact than in
    the posted patch).  That is, each table entry in a config_discontig
    node array is 796 bytes, as opposed to 4 (or possibly 2 or 1) with
    config_nonlinear.

  - Eliminates two levels of procedure calls from the alloc_pages call
    chain.

  - Provides a simple model that is easily implemented across *all*
    architectures.  Please look at the config_discontigmem option and see
    how many architectures support it.  Hint: it is not enough just to
    add the option to config.in.

  - Leads forward to interesting possibilities such as hot plug memory.
    (Because pinned kernel memory can be remapped to an alternative
    region of physical memory if desired)

  - Cleans things up considerably.  It eliminates the unholy marriage of
    config_discontig-for-the-purpose of working around physical memory
    holes and config_discontig-for-the-purpose of numa allocation.  For
    example, eliminates a number of #ifdefs from the numa code, allowing 
    the numa code to be developed in the way that is best for numa,
    instead of being hobbled by the need to serve a dual purpose.

It's easy to wave your hands at the idea that code should ever be cleaned up.
As an example of just how much the config_nonlinear patch cleans things up,
let's look at the ARM definition of VALID_PAGE, with config_discontigmem:

    #define KVADDR_TO_NID(addr) \
                    (((unsigned long)(addr) - 0xc0000000) >> 27)

    #define NODE_DATA(nid)          (&discontig_node_data[nid])

    #define NODE_MEM_MAP(nid)       (NODE_DATA(nid)->node_mem_map)

    #define VALID_PAGE(page) \
    ({ unsigned int node = KVADDR_TO_NID(page); \
       ( (node < NR_NODES) && \
         ((unsigned)((page) - NODE_MEM_MAP(node)) < NODE_DATA(node)->node_size) ); \
    })

With config_nonlinear (which does the same job as config_discontigmem in this
case) we get:

    static inline int VALID_PAGE(struct page *page)
    {
            return page - mem_map < max_mapnr;
    }

Isn't that nice?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-30 23:12         ` Daniel Phillips
@ 2002-05-01  1:05           ` Daniel Phillips
  2002-05-02  0:47           ` Andrea Arcangeli
  1 sibling, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  1:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, linux-kernel

On Wednesday 01 May 2002 01:12, I wrote:
> Config_discontigmem
> -------------------
> 
> Has exactly one purpose: to eliminate memory wastage due to struct pages
> that refer to unpopulated regions of memory...

That is, when not used together with config_numa.  When used with config_numa,
it has a second purpose: to allow ->mem_map arrays to exist on the same numa
node as the referenced pages.  The config_nonlinear patch could be extended
to handle this as well (by elaborating the definitions of virt_to_page and
phys_to_page) but I don't have plans to do that at this time.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  0:47           ` Andrea Arcangeli
@ 2002-05-01  1:26             ` Daniel Phillips
  2002-05-02  1:43               ` Andrea Arcangeli
  2002-05-02  2:37             ` William Lee Irwin III
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  1:26 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, linux-kernel

On Thursday 02 May 2002 02:47, Andrea Arcangeli wrote:
> >   - Leads forward to interesting possibilities such as hot plug memory.
> >     (Because pinned kernel memory can be remapped to an alternative
> >     region of physical memory if desired)
> 
> You cannot handle hot plug with nonlinear, you cannot take the mem_map
> contigous when somebody plugins new memory, you've to allocate the
> mem_map in the new node, discontigmem allows that, nonlinear doesn't.

You have not read and understood the patch, which this comment demonstrates.

For your information, the mem_map lives in *virtual* memory, it does not
need to change location, only the kernel page tables need to be updated,
to allow a section of kernel memory to be moved to a different physical
location.  For user memory, this was always possible, now it is possible
for kernel memory as well.  Naturally, it's not all you have to do to get
hotplug memory, but it's a big step in that direction.

> At the very least you should waste some tons of memory of unused mem_map
> for all the potential memory that you're going to plugin, if you want to
> handle hot-plug with nonlinear.

Eh.  No.

It's not useful for me to keep correcting you on your misunderstanding of
what config_nonlinear actually does.  Please read Jonathan Corbet's
excellent writeup in lwn, it's written in a very understandable fashion.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
@ 2002-05-01  1:35               ` Daniel Phillips
  2002-05-02  1:45                 ` William Lee Irwin III
  2002-05-02  1:46                 ` Andrea Arcangeli
  2002-05-02  1:01               ` Andrea Arcangeli
                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  1:35 UTC (permalink / raw)
  To: Anton Blanchard, Andrea Arcangeli
  Cc: Russell King, linux-kernel, Jesse Barnes

On Thursday 02 May 2002 02:20, Anton Blanchard wrote:
> > so ia64 is one of those archs with a ram layout with huge holes in the
> > middle of the ram of the nodes? I'd be curious to know what's the
> > hardware advantage of designing the ram layout in such a way, compared
> > to all other numa archs that I deal with. Also if you know other archs
> > with huge holes in the middle of the ram of the nodes I'd be curious to
> > know about them too. thanks for the interesting info!
> 
> From arch/ppc64/kernel/iSeries_setup.c:
> 
>  * The iSeries may have very large memories ( > 128 GB ) and a partition
>  * may get memory in "chunks" that may be anywhere in the 2**52 real
>  * address space.  The chunks are 256K in size.
> 
> Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
> solution to this problem.

Using the config_nonlinear model, you'd change the four mapping functions:

	logical_to_phys
	phys_to_logical
	pagenum_to_phys
	phys_to_pagenum

to use a hash table instead of a table lookup.  Bill Irwin suggested a btree
would work here as well.

(Note I'm trying out the term 'pagenum' instead of 'ordinal' here, following
comments on lse-tech.)

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  1:46                 ` Andrea Arcangeli
@ 2002-05-01  1:56                   ` Daniel Phillips
  0 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  1:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Anton Blanchard, Russell King, linux-kernel, Jesse Barnes

On Thursday 02 May 2002 03:46, Andrea Arcangeli wrote:
> On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote:
> > On Thursday 02 May 2002 02:20, Anton Blanchard wrote:
> > > > so ia64 is one of those archs with a ram layout with huge holes in the
> > > > middle of the ram of the nodes? I'd be curious to know what's the
> > > > hardware advantage of designing the ram layout in such a way, compared
> > > > to all other numa archs that I deal with. Also if you know other archs
> > > > with huge holes in the middle of the ram of the nodes I'd be curious to
> > > > know about them too. thanks for the interesting info!
> > > 
> > > From arch/ppc64/kernel/iSeries_setup.c:
> > > 
> > >  * The iSeries may have very large memories ( > 128 GB ) and a partition
> > >  * may get memory in "chunks" that may be anywhere in the 2**52 real
> > >  * address space.  The chunks are 256K in size.
> > > 
> > > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
> > > solution to this problem.
> > 
> > Using the config_nonlinear model, you'd change the four mapping functions:
> > 
> > 	logical_to_phys
> > 	phys_to_logical
> > 	pagenum_to_phys
> > 	phys_to_pagenum
> > 
> > to use a hash table instead of a table lookup.  Bill Irwin suggested a btree
> > would work here as well.
> 
> btree? btree is not an interesting in core data structure.

Well, I didn't really like the btree for this application either, but I see
his point.

> Anyways you
> can use a btree with discontigmem too for the lookup. nonlinear will pay
> off if you've something of the order of 256 discontigmem chunks with
> significant holes in between like origin 2k, and I think it should be
> resolved internally to the arch without exposing it to the common code.

Those mapping functions are all defined per-arch, in page.h.  The only part
of this patch that affects the common code is the new distinction between
logical and physical address spaces (which are the same when the option
isn't enabled).

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  1:45                 ` William Lee Irwin III
@ 2002-05-01  2:02                   ` Daniel Phillips
  2002-05-02  2:33                     ` William Lee Irwin III
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  2:02 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel,
	Jesse Barnes

On Thursday 02 May 2002 03:45, William Lee Irwin III wrote:
> On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote:
> > to use a hash table instead of a table lookup.  Bill Irwin suggested a btree
> > would work here as well.
> 
> I remember suggesting a sorted array of extents on which binary
> search could be performed. A B-tree seems unlikely but perhaps if
> it were contiguously allocated and some other tricks done it might
> do, maybe I don't remember the special sauce used for the occasion.

Thanks for the correction.  When you said 'extents' I automatically thought
'btree of extents'.  I'd tend to go for the hash table anyway - your binary
search is going to take quite a few more steps to terminate than the bucket
search, given some reasonable choice of hash table size.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-29 23:02     ` Daniel Phillips
@ 2002-05-01  2:23       ` Andrea Arcangeli
  2002-04-30 23:12         ` Daniel Phillips
  2002-05-01 18:05         ` Jesse Barnes
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-01  2:23 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, linux-kernel

On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote:
> My config_nonlinear patch does not suffer from the above problem.  Here's the
> code:
>
> unsigned long vsection[MAX_SECTIONS];
> 
> static inline unsigned long phys_to_ordinal(phys_t p)
> {
> 	return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT);
> }
> 
> static inline struct page *phys_to_page(unsigned long p)
> {
> 	return mem_map + phys_to_ordinal(p);
> }
> 
> Nothing can go out of range.  Sensible, no?

Really the above vsection[p >> SECTION_SHIFT] will overflow in the very
same case I fixed a few days ago for numa-alpha. The whole point is that
p isn't a ram page and you assumed that (the alpha code was also assuming
that and that's why it overflowed the same way as yours).  Either that
or you're wasting some huge tons of ram with vsection on a 64bit arch.

After the above out of range bug is fixed in practice it is the same as
the current discontigmem, except that with the current way you can take
the page structures in the right node with numa. And again I cannot see
any advantage in having a contigous mem_map even for archs with only
discontigmem and non-numa (I think only ARM falls in such category, btw).

> > > <plug>
> > > The new config_nonlinear was designed as a cleaner, more powerful
> > > replacement for all non-numa uses of config_discontigmem.
> > > </plug>
> > 
> > I maybe wrong because I only had a short look at it so far, but the
> > "non-numa" is what I noticed too and that's what renders it not a very
> > interesting option IMHO. Most discontigmem needs numa too.
> 
> I am, first and foremost, presenting config_nonlinear as a replacement for
> config_discontig for *non-numa* uses of config_discontig.  (Sorry if I'm
> repeating myself here.)
> 
> There are also applications in numa.  Please see the lse-tech archives for
> details.  I expect that, by taking a fresh look at numa code in the light
> of new work, that the numa code can be cleaned up and simplififed
> considerably.  But that's "further work".  Config_nonlinear stands on its
> own quite nicely.

Tell me how an ARM machine will run faster with nonlinear, it is doing
nearly the same thing except it's a lesser abstraction that forces a
contiguous mem_map. Current code is much more powerful and it carries
more information (the pgdat describes the whole memory topology to the
common code), and it's not going to be slower, so I don't see why should
we complicate the code with nonlinear. Personally I hate more than one
way of doing the same thing if there's no need of it, the less ways the
less you have to keep in mind, the simpler to understand, the better
(partly offtopic but for the very same reason when I work in userspace I
much prefer coding in python than in perl).

> > If it cannot
> > handle numa it doesn't worth to add the complexity there,
> 
> It does not add complexity, it removes complexity.  Please read the patch
> more closely.  It's very simple.  It's also more powerful than
> config_discontig.

How? I may be overlooking something but I would say it's all but more
powerful. I don't see any "power" point in trying to keep the mem_map
contigous. please don't tell me it's more powerful, just tell me why.

> > with numa we must view those chunks differently, not linearly.
> 
> Correct.  Now, if you want to extend my patch to handle multiple mem_map
> vectors, you do it by defining an ordinal_to_page and page_to_ordinal pair
> of mappings.[1]  Don't you think this is a nicer way to organize things?

What's the advantage? And after you can have more than one mem_map,
after you added this "vector", then each mem_map will match a
discontigmem pgdat. Tell me a numa machine where there's an hole in the
middle of a node. The holes are always intra-node, never within the
nodes themself. So the nonlinear-numa should fallback to the stright
mem_map array pointed by the pgdat all the time like it is just right now.

The only advantage of nonlinear I can see could be a machine with an
huge hole in a node, then with nonlinear you could avoid wasting mem_map
for this hole but without having to add another pgdat that would
otherwise break numa assumptions on the pgdat, but I'm not aware of any
machine with huge holes of the order of the gigabytes in the middle of a
node, at the very least if that happens it means the hardware of the
machine is misconfigured.

The very same problem would happen right now in x86 if there would be an
huge hole in the physical ram, so you have 128M of ram and then an hole
of 63G and then the other phusical 900M at offset 63G+128M, it will
never happen, that's broken hardware if you see anything like that.

at the very least I would wait somebody to ask with a so weird hardware
that intentionally does like the above instead of overdesigning common
code abstractions, and there would be also other ways to deal with such
situation without requiring a contigous mem_map.

> > Also there's nothing
> > magic that says mem_map must have a magical meaning, doesn't worth to
> > preserve the mem_map thing, virt_to_page is a much cleaner abstraction
> > than doing mem_map + pfn by hand.
> 
> True.  The upcoming iteration of config_nonlinear moves all uses of
> mem_map inside the per-arch page.h headers, so that mem_map need not
> exist at all in configurations where there is no single mem_map.

That's fine, and all correct kernel code just does that correctly,
nonbody is allowed to use mem_map in any common code anywhere (besides
the mm proper internals when discontigmem is disabled).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  1:43               ` Andrea Arcangeli
@ 2002-05-01  2:41                 ` Daniel Phillips
  2002-05-02 13:34                   ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  2:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, linux-kernel

On Thursday 02 May 2002 03:43, Andrea Arcangeli wrote:
> On Wed, May 01, 2002 at 03:26:22AM +0200, Daniel Phillips wrote:
> > For your information, the mem_map lives in *virtual* memory, it does not
> > need to change location, only the kernel page tables need to be updated,
> > to allow a section of kernel memory to be moved to a different physical
> > location.  For user memory, this was always possible, now it is possible
> > for kernel memory as well.  Naturally, it's not all you have to do to get
> > hotplug memory, but it's a big step in that direction.
> 
> what kernel pagetables?

The normal page tables that are used to map kernel memory.

> pagetables for space that you left free for what?

These page tables have not been left free for anything.  The nice thing about
page tables is that you can change the page table entries to point wherever
you want.  (I know you know this.)  This is what config_nonlinear supports,
and that is why it's called config_nonlinear.  When we want to remap part of
the kernel memory to a different piece of physical memory, we just need to
fill in some pte's.  The tricky part is knowing how to fill in the ptes, and
config_nonlinear takes care of that.

> You waste virtual space for that at the very least on x86 that is
> just very tigh, at this point kernel virtual space is more costly than
> physical space these days. And nevertheless most archs doesn't have
> pagetables at all to read and write the page structures. yes it's
> virtual memory but it's a direct mapping.

Most architectures?  That's quite possibly an exaggeration.  Some
architectures - MIPS32 for example - make this difficult or impossible,
and so what?  Those can't do software hotplug memory, sorry.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  2:33                     ` William Lee Irwin III
@ 2002-05-01  2:44                       ` Daniel Phillips
  0 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01  2:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel,
	Jesse Barnes

On Thursday 02 May 2002 04:33, William Lee Irwin III wrote:
> Actually, now that I think about it, a contiguously-allocated B-tree of
> extents doesn't sound bad at all, even without additional dressing. Do
> you think it's worth a try?

If it solves a problem on a real machine, certainly.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02  8:50                   ` Roman Zippel
@ 2002-05-01 13:21                     ` Daniel Phillips
  2002-05-02 14:00                       ` Roman Zippel
  2002-05-02 18:35                     ` Geert Uytterhoeven
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01 13:21 UTC (permalink / raw)
  To: Roman Zippel, Andrea Arcangeli; +Cc: Ralf Baechle, Russell King, linux-kernel

On Thursday 02 May 2002 10:50, Roman Zippel wrote:
> Hi,
> 
> On Thu, 2 May 2002, Andrea Arcangeli wrote:
> 
> > What I
> > care about is not to clobber the common code with additional overlapping
> > common code abstractions.
> 
> Just to throw in an alternative: On m68k we map currently everything
> together into a single virtual area. This means the virtual<->physical
> conversion is a bit more expensive and mem_map is simply indexed by the
> the virtual address.

Are you talking about mm_ptov and friends here?  What are the loops for?
Could you please describe the most extreme case of physical discontiguity
you're handling?

> It works nicely, it just needs two small patches in the initializition
> code, which aren't integrated yet. I think it's very close to what Daniel
> wants, only that the logical and virtual address are identical.

Yes, since logical and virtual kernel addresses in config_nonlinear differ
only by a constant (PAGE_OFFSET) then setting the constant to zero gives
me your case.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 14:00                       ` Roman Zippel
@ 2002-05-01 14:08                         ` Daniel Phillips
  2002-05-02 17:56                           ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01 14:08 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

On Thursday 02 May 2002 16:00, Roman Zippel wrote:
> Hi,
> 
> On Wed, 1 May 2002, Daniel Phillips wrote:
> 
> > > Just to throw in an alternative: On m68k we map currently everything
> > > together into a single virtual area. This means the virtual<->physical
> > > conversion is a bit more expensive and mem_map is simply indexed by the
> > > the virtual address.
> > 
> > Are you talking about mm_ptov and friends here?  What are the loops for?
> 
> It simply searches through all memory nodes, it's not really efficient.
> 
> > Could you please describe the most extreme case of physical discontiguity
> > you're handling?
> 
> I can't assume anything. I'm thinking about calculating the table
> dynamically and patching the kernel at bootup, we are already doing
> something similiar in the Amiga/ppc kernel.

Maybe this is a good place to try out a hash table variant of
config_nonlinear.  It's got to be more efficient than searching all the
nodes, don't you think?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:35                       ` Andrea Arcangeli
@ 2002-05-01 15:42                         ` Daniel Phillips
  2002-05-02 16:06                           ` Andrea Arcangeli
  2002-05-02 16:07                         ` Martin J. Bligh
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01 15:42 UTC (permalink / raw)
  To: Andrea Arcangeli, Martin J. Bligh; +Cc: Russell King, linux-kernel

On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote:
> On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote:
> > At the moment I use the contig memory model (so we only use discontig for
> > NUMA support) but I may need to change that in the future.
> 
> I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
> the current discontigmem-numa model too as far I can see.

No it doesn't.  The config_discontigmem model forces all zone_normal memory
to be on node zero, so all the remaining nodes can only have highmem locally.
Even with good cache hardware, this has to hurt.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:28                 ` Anton Blanchard
@ 2002-05-01 16:10                   ` Daniel Phillips
  2002-05-02 15:59                   ` Dave Engebretsen
  2002-05-02 16:31                   ` William Lee Irwin III
  2 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01 16:10 UTC (permalink / raw)
  To: Anton Blanchard, Andrea Arcangeli
  Cc: Russell King, linux-kernel, Jesse Barnes

On Thursday 02 May 2002 17:28, Anton Blanchard wrote:
> > is this machine a numa machine? If not then discontigmem will work just
> > fine. also it's a matter of administration, even if it's a numa machine
> > you can use it just optimally with discontigmem+numa. Regardless of what
> > we do if the partitioning is bad the kernel will do bad. If you create
> > zillon discontigous nodes of 256K each, you'd need waste memory to
> > handle them regardless of nonlinear or discontigmem (with discontigmem
> > you will waste more memory than nonlinear yes, exactly because it's more
> > powerful, but I think a machine with an huge lot of non contigous 256K
> > chunks is misconfigured, it's like if you pretend to install linux on a
> > machine after you partitioned the HD with thousand of logical volumes
> > large 256K each [for the sake of this example let's assume there are
> > more than 256LV available in LVM], a sane partitioning requires you to
> > have at least a partition for /usr large 1 giga, depends what you're
> > doing of course, but requiring sane partitioning it's an admin problem
> > not a kernel problem IMHO).
> 
> Its not a NUMA machine, its one that allows shared processor logical
> partitions. While I would prefer the hypervisor to give each partition
> a nice memory map (and internally maintain a mapping to real memory)
> it does not. I can imagine if the machine has been up for many months
> memory could become very fragmented.
> 
> Also when we do hotplug memory support will discontigmem be able to
> efficiently handle memory turning up all over the place in the memory
> map?

My proposal for support of extremely fragmented physical memory maps is
to use a hash table instead of a direct table lookup in the following
four functions:

    logical_to_phys
    phys_to_logica
    pagenum_to_phys
    phys_to_pagenum
   
With the page.h organization:

#ifdef CONFIG_NONLINEAR
  #ifdef CONFIG_NONLINEAR_HASH
     <the hash table versions of above>
  #else
     <the direct table mappings>
  #endif
#else
  /* Stub definitions */
  #define logical_to_phys(p) (p)
  #define phys_to_logical(p) (p)
  #define ordinal_to_phys(n) ((n) << PAGE_SHIFT)
  #define phys_to_ordinal(p) ((p) >> PAGE_SHIFT)
#endif

The hash tables only be updated when the memory configuration changes.

In fact, we will may likely only need the hash table in one direction, in
the case that the virtual memory map is less fragmented than physical memory.
Then we can use a direct table to go the other direction, and we might want:

#ifdef CONFIG_NONLINEAR
  #ifdef CONFIG_NONLINEAR_PHASH
     <the hash table versions of above>
  #else
     <the direct table mappings>
  #endif
#else
  #define phys_to_logical(p) (p)
  #define phys_to_ordinal(p) ((p) >> PAGE_SHIFT)
#endif

#ifdef CONFIG_NONLINEAR
  #ifdef CONFIG_NONLINEAR_VHASH
     <the hash table versions of above>
  #else
     <the direct table mappings>
  #endif
#else
  #define logical_to_phys(p) (p)
  #define ordinal_to_phys(n) ((n) << PAGE_SHIFT)
#endif

These are all per-arch, though one of my goals is to reduce the
difference between arches, where it doesn't involve any compromise.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:59                   ` Dave Engebretsen
@ 2002-05-01 17:24                     ` Daniel Phillips
  2002-05-02 16:44                       ` Dave Engebretsen
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01 17:24 UTC (permalink / raw)
  To: Dave Engebretsen, linux-kernel

On Thursday 02 May 2002 17:59, Dave Engebretsen wrote:
> Anton Blanchard wrote:
> > Also when we do hotplug memory support will discontigmem be able to
> > efficiently handle memory turning up all over the place in the memory
> > map?
>
> On this type of partitioned system where ppc64 runs, there is not much
> administration that could be done to help the problem.  As Anton
> mentioned, when the system has been up for a long time, and memory has
> been moving between partitions which support dynamic memory movement, it
> is assured that memory will become very fragmented.  As more partitions
> on these systems become available, and resources migrate more freely,
> the problem will get worse.
> 
> Whether this management from kernel to hardware addresses is done in the
> hypervisor layer or the OS, the same overhead exists, given todays
> hardware structure for PowerPC servers anyway.  In todays ppc64
> implementation, we just use an array to map from what the kernel sees as
> its address space to what is put in the hardware page table and I/O
> translation tables, thus not requiring any changes in independant code. 
> This does consume some storage, but the highly fragmented nature of our
> platform memory drives this decision.  I would like to see that data
> structure decision left to the archs as different platform design points
> may lead to different mapping decisions.

And it is left up to the arch in my patch, I've simply imposed a little more
order on what, up till now, has been a pretty chaotic corner of the kernel,
and provided a template that satisfies a wider variety of needs than the old
one.

It sounds like the table translation you're doing in the hypervisor is
exactly what I've implemented in the kernel.  One advantage of going with
the kernel's implementation is that you get the benefit of improvements
made to it, for example, the proposed hashing scheme to handle extremely
fragmented physical memory maps.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 17:56                           ` Roman Zippel
@ 2002-05-01 17:59                             ` Daniel Phillips
  2002-05-02 18:26                               ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-01 17:59 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

On Thursday 02 May 2002 19:56, Roman Zippel wrote:
> Daniel Phillips wrote:
> 
> > Maybe this is a good place to try out a hash table variant of
> > config_nonlinear.  It's got to be more efficient than searching all the
> > nodes, don't you think?
> 
> Most of the time there are only a few nodes, I just don't know where and
> how big they are, so I don't think a hash based approach will be a lot
> faster. When I'm going to change this, I'd rather try the dynamic table
> approach.

Which dynamic table approach is that?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  2:23       ` Andrea Arcangeli
  2002-04-30 23:12         ` Daniel Phillips
@ 2002-05-01 18:05         ` Jesse Barnes
  2002-05-01 23:17           ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Jesse Barnes @ 2002-05-01 18:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

On Wed, May 01, 2002 at 04:23:41AM +0200, Andrea Arcangeli wrote:
> What's the advantage? And after you can have more than one mem_map,
> after you added this "vector", then each mem_map will match a
> discontigmem pgdat. Tell me a numa machine where there's an hole in the
> middle of a node. The holes are always intra-node, never within the
> nodes themself. So the nonlinear-numa should fallback to the stright

Just FYI, there _are_ many NUMA machines with memory holes in the
middle of a node.  Check out the discontig patch at
http://sf.net/projects/discontig for more info.

Jesse

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01 18:05         ` Jesse Barnes
@ 2002-05-01 23:17           ` Andrea Arcangeli
  2002-05-01 23:23             ` discontiguous memory platforms Jesse Barnes
  2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-01 23:17 UTC (permalink / raw)
  To: Daniel Phillips, Russell King, linux-kernel; +Cc: Jesse Barnes

On Wed, May 01, 2002 at 11:05:47AM -0700, Jesse Barnes wrote:
> On Wed, May 01, 2002 at 04:23:41AM +0200, Andrea Arcangeli wrote:
> > What's the advantage? And after you can have more than one mem_map,
> > after you added this "vector", then each mem_map will match a
> > discontigmem pgdat. Tell me a numa machine where there's an hole in the
> > middle of a node. The holes are always intra-node, never within the
> > nodes themself. So the nonlinear-numa should fallback to the stright
> 
> Just FYI, there _are_ many NUMA machines with memory holes in the
> middle of a node.  Check out the discontig patch at
> http://sf.net/projects/discontig for more info.

so ia64 is one of those archs with a ram layout with huge holes in the
middle of the ram of the nodes? I'd be curious to know what's the
hardware advantage of designing the ram layout in such a way, compared
to all other numa archs that I deal with. Also if you know other archs
with huge holes in the middle of the ram of the nodes I'd be curious to
know about them too. thanks for the interesting info!

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-01 23:17           ` Andrea Arcangeli
@ 2002-05-01 23:23             ` Jesse Barnes
  2002-05-02  0:51               ` Ralf Baechle
  2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
  1 sibling, 1 reply; 165+ messages in thread
From: Jesse Barnes @ 2002-05-01 23:23 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 01:17:50AM +0200, Andrea Arcangeli wrote:
> so ia64 is one of those archs with a ram layout with huge holes in the
> middle of the ram of the nodes? I'd be curious to know what's the

Well, our ia64 platform is at least, but I think there are others.

> hardware advantage of designing the ram layout in such a way, compared
> to all other numa archs that I deal with. Also if you know other archs
> with huge holes in the middle of the ram of the nodes I'd be curious to
> know about them too. thanks for the interesting info!

AFAIK, some MIPS platforms (both NUMA and non-NUMA) have memory
layouts like this too.  I've never done hardware design before, so I'm
not sure if there's a good reason for such layouts.  Ralf or Daniel
might be able to shed some more light on that...

Jesse

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01 23:17           ` Andrea Arcangeli
  2002-05-01 23:23             ` discontiguous memory platforms Jesse Barnes
@ 2002-05-02  0:20             ` Anton Blanchard
  2002-05-01  1:35               ` Daniel Phillips
                                 ` (3 more replies)
  1 sibling, 4 replies; 165+ messages in thread
From: Anton Blanchard @ 2002-05-02  0:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Daniel Phillips, Russell King, linux-kernel, Jesse Barnes

 
> so ia64 is one of those archs with a ram layout with huge holes in the
> middle of the ram of the nodes? I'd be curious to know what's the
> hardware advantage of designing the ram layout in such a way, compared
> to all other numa archs that I deal with. Also if you know other archs
> with huge holes in the middle of the ram of the nodes I'd be curious to
> know about them too. thanks for the interesting info!

>From arch/ppc64/kernel/iSeries_setup.c:

 * The iSeries may have very large memories ( > 128 GB ) and a partition
 * may get memory in "chunks" that may be anywhere in the 2**52 real
 * address space.  The chunks are 256K in size.

Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
solution to this problem.

Anton

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-04-30 23:12         ` Daniel Phillips
  2002-05-01  1:05           ` Daniel Phillips
@ 2002-05-02  0:47           ` Andrea Arcangeli
  2002-05-01  1:26             ` Daniel Phillips
  2002-05-02  2:37             ` William Lee Irwin III
  1 sibling, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02  0:47 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, linux-kernel

On Wed, May 01, 2002 at 01:12:48AM +0200, Daniel Phillips wrote:
> On Wednesday 01 May 2002 04:23, Andrea Arcangeli wrote:
> > On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote:
> > > My config_nonlinear patch does not suffer from the above problem.  Here's the
> > > code:
> > >
> > > unsigned long vsection[MAX_SECTIONS];
> > > 
> > > static inline unsigned long phys_to_ordinal(phys_t p)
> > > {
> > > 	return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT);
> > > }
> > > 
> > > static inline struct page *phys_to_page(unsigned long p)
> > > {
> > > 	return mem_map + phys_to_ordinal(p);
> > > }
> > > 
> > > Nothing can go out of range.  Sensible, no?
> > 
> > Really the above vsection[p >> SECTION_SHIFT] will overflow in the very
> > same case I fixed a few days ago for numa-alpha.
> 
> No it will not.  Note however that you do want to choose SECTION_SHIFT so
> that the vsection table is not too large.

You cannot choose SECTION_SHIFT, the hardware will define it for you.

A 64bit arch will get discontigous for example in 64G chunks (real world
example actually), so your SECTION_SHIFT will be equal to 36 and you
will overflow as I said in the previous email (just like discontigmem in
mainline did) unless you want to waste some huge ram for the table (with
2^64/64G entries i.e. sizeof(long) * 2^(64-36) bytes).

> > The whole point is that
> > p isn't a ram page and you assumed that (the alpha code was also assuming
> > that and that's why it overflowed the same way as yours).  Either that
> > or you're wasting some huge tons of ram with vsection on a 64bit arch.
> 
> No and no.  In fact, the vsection table for my current project is only 32
> elements long.

See above.

>     created by config_discontigmem.  It just cycles between round robin
>     between the nodes on each allocation, ignoring the relative

Forget mainline. Look at 2.4.19pre7aa3 _only_ when you look into numa,
there are an huge number of fixes in that area also from Samuel Ortiz
and others. Before I even cosnsider pushing those fixes in mainline
(btw, they are cleanly separated in orthogonal patches, not mixed with
teh other stuff), I will need to see the other vm updates that everybody
deals with included (only a limited number of users is affected by numa
issues).

>   - Has better locality of reference in the mapping tables, because the
>     tables are compact (and could easily be made yet more compact than in
>     the posted patch).  That is, each table entry in a config_discontig
>     node array is 796 bytes, as opposed to 4 (or possibly 2 or 1) with
>     config_nonlinear.

Oh yeah, you save 1 microsecond every 10 years of uptime by taking
advantage of the potentially coalesced cacheline between the last page
in a node and the first page of the next node. Before you can care about
this optimizations you should remove from x86 the pgdat loops that are
not needed with discontigmem disabled like in x86 (this has nothing to
do with discontigmem/nonlinear). That wouldn't be measurable too but at
least it would be more worthwhile.

>   - Eliminates two levels of procedure calls from the alloc_pages call
>     chain.

Again, look -aa, not mainline.

>   - Provides a simple model that is easily implemented across *all*

I don't see much simplicity, it's only weaker I think.

>     architectures.  Please look at the config_discontigmem option and see
>     how many architectures support it.  Hint: it is not enough just to
>     add the option to config.in.
> 
>   - Leads forward to interesting possibilities such as hot plug memory.
>     (Because pinned kernel memory can be remapped to an alternative
>     region of physical memory if desired)

You cannot handle hot plug with nonlinear, you cannot take the mem_map
contigous when somebody plugins new memory, you've to allocate the
mem_map in the new node, discontigmem allows that, nonlinear doesn't.
At the very least you should waste some tons of memory of unused mem_map
for all the potential memory that you're going to plugin, if you want to
handle hot-plug with nonlinear.

breaking up the limitation of the contigous mem_map is been one of the
goals achieved with 2.4, there is no significant advantage (but only
the old limitations) to try to coalesce it again.

>   - Cleans things up considerably.  It eliminates the unholy marriage of

It clobbers things considerably because it overlaps another more power
functionality that is needed anyways for hotplug of ram and numa.

>     config_discontig-for-the-purpose of working around physical memory
>     holes and config_discontig-for-the-purpose of numa allocation.  For
>     example, eliminates a number of #ifdefs from the numa code, allowing 
>     the numa code to be developed in the way that is best for numa,
>     instead of being hobbled by the need to serve a dual purpose.
> 
> It's easy to wave your hands at the idea that code should ever be cleaned up.
> As an example of just how much the config_nonlinear patch cleans things up,
> let's look at the ARM definition of VALID_PAGE, with config_discontigmem:
> 
>     #define KVADDR_TO_NID(addr) \
>                     (((unsigned long)(addr) - 0xc0000000) >> 27)
> 
>     #define NODE_DATA(nid)          (&discontig_node_data[nid])
> 
>     #define NODE_MEM_MAP(nid)       (NODE_DATA(nid)->node_mem_map)
> 
>     #define VALID_PAGE(page) \
>     ({ unsigned int node = KVADDR_TO_NID(page); \
>        ( (node < NR_NODES) && \
>          ((unsigned)((page) - NODE_MEM_MAP(node)) < NODE_DATA(node)->node_size) ); \
>     })
> 
> With config_nonlinear (which does the same job as config_discontigmem in this
> case) we get:
> 
>     static inline int VALID_PAGE(struct page *page)
>     {
>             return page - mem_map < max_mapnr;
>     }
> 
> Isn't that nice?

It isn't nicer to my eyes. It cannot handle a non cotigous mem_map,
showstopper for hotplug ram and numa, and secondly VALID_PAGE must go
away since the first place. The rest of the NODE_MEM_MAP(node) is
completly equivalent to your phys_to_ordinal, just in a different place
and capable of handling discontigous mem_map too.

For my tree I'm not going to include it for now. For my current
understanding of the thing the only ones that could ask for it are the
ia64 folks with huge holes in the middle of the ram of the nodes that may
prefer to hide their inter-node-discontigmem stuff behind the numa layer
to avoid confusing the numa heuristics (still I don't know how big those
holes are so it also depends on that if they really need it), so I will
wait them to ask for it before considering an inclusion. If we need to
complicate a lot the MM code I need somebody to ask for it with some
valid reason, your "cleanup" argument doesn't apply here IMHO (this is
not a change of the IDE code that reshapes the code but that remains
completly functional equivalent), until somebody asks for it with valid
arguments this remains an overlapping non needed and complex
functionality to my eyes.

(btw, the 640k-1M hole is due backwards compatibility stuff so it had a
good reason for it at least and it's a very small hole after all that we
don't even care to skip in the mem_map array on a 4M box)

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-01 23:23             ` discontiguous memory platforms Jesse Barnes
@ 2002-05-02  0:51               ` Ralf Baechle
  2002-05-02  1:27                 ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Ralf Baechle @ 2002-05-02  0:51 UTC (permalink / raw)
  To: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel

On Wed, May 01, 2002 at 04:23:43PM -0700, Jesse Barnes wrote:

> On Thu, May 02, 2002 at 01:17:50AM +0200, Andrea Arcangeli wrote:
> > so ia64 is one of those archs with a ram layout with huge holes in the
> > middle of the ram of the nodes? I'd be curious to know what's the
> 
> Well, our ia64 platform is at least, but I think there are others.
> 
> > hardware advantage of designing the ram layout in such a way, compared
> > to all other numa archs that I deal with. Also if you know other archs
> > with huge holes in the middle of the ram of the nodes I'd be curious to
> > know about them too. thanks for the interesting info!
> 
> AFAIK, some MIPS platforms (both NUMA and non-NUMA) have memory
> layouts like this too.  I've never done hardware design before, so I'm
> not sure if there's a good reason for such layouts.  Ralf or Daniel
> might be able to shed some more light on that...

Just to give a few examples of memory layouts on MIPS systems. Sibyte 1250
is as follows:

 - 256MB at physical address 0
 - 512MB at physical address 0x80000000
 - 256MB at physical address 0xc0000000
 - The entire rest of the memory is mapped contiguously from physical
   address 0x1:00000000 up.
 All available memory is mapped from the lowest address up.

Origin 200/2000.  Each node has an address space of 2GB, each node has 4
  memory banks, that is each bank takes 512MB of address space.  Even
  unpopulated or partially populated banks take the full 512MB address
  space.  Memory in partially populated banks is mapped at the beginning
  of the bank's address space; each node must have have at least one
  bank with memory in it, that is something like

 - 32MB @ physical address 0x00:00000000
 - 32MB @ physical address 0x00:80000000
 - 32MB @ physical address 0x01:00000000
 ...
 - 32MB @ physical address 0x7f:00000000

  would be a valid configuration.  That's 8GB of RAM scattered in tiny
  chunks of just 32mb throughout 256MB address space.  In theory nodes
  might not even have to exist, so

 - 32MB @ physical address 0x00:00000000
 - 32MB @ physical address 0x7f:00000000

  would be a valid configuration as well.

There are other examples more but #1 is becoming a widespread chip and #2
is a rather extreme example just to show how far discontiguity may go.

  Ralf

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
  2002-05-01  1:35               ` Daniel Phillips
@ 2002-05-02  1:01               ` Andrea Arcangeli
  2002-05-02 15:28                 ` Anton Blanchard
  2002-05-02 23:05               ` Daniel Phillips
  2002-05-03 23:52               ` David Mosberger
  3 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02  1:01 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Daniel Phillips, Russell King, linux-kernel, Jesse Barnes

On Thu, May 02, 2002 at 10:20:11AM +1000, Anton Blanchard wrote:
>  
> > so ia64 is one of those archs with a ram layout with huge holes in the
> > middle of the ram of the nodes? I'd be curious to know what's the
> > hardware advantage of designing the ram layout in such a way, compared
> > to all other numa archs that I deal with. Also if you know other archs
> > with huge holes in the middle of the ram of the nodes I'd be curious to
> > know about them too. thanks for the interesting info!
> 
> >From arch/ppc64/kernel/iSeries_setup.c:
> 
>  * The iSeries may have very large memories ( > 128 GB ) and a partition
>  * may get memory in "chunks" that may be anywhere in the 2**52 real
>  * address space.  The chunks are 256K in size.
> 
> Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
> solution to this problem.

is this machine a numa machine? If not then discontigmem will work just
fine. also it's a matter of administration, even if it's a numa machine
you can use it just optimally with discontigmem+numa. Regardless of what
we do if the partitioning is bad the kernel will do bad. If you create
zillon discontigous nodes of 256K each, you'd need waste memory to
handle them regardless of nonlinear or discontigmem (with discontigmem
you will waste more memory than nonlinear yes, exactly because it's more
powerful, but I think a machine with an huge lot of non contigous 256K
chunks is misconfigured, it's like if you pretend to install linux on a
machine after you partitioned the HD with thousand of logical volumes
large 256K each [for the sake of this example let's assume there are
more than 256LV available in LVM], a sane partitioning requires you to
have at least a partition for /usr large 1 giga, depends what you're
doing of course, but requiring sane partitioning it's an admin problem
not a kernel problem IMHO).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02  0:51               ` Ralf Baechle
@ 2002-05-02  1:27                 ` Andrea Arcangeli
  2002-05-02  1:32                   ` Ralf Baechle
  2002-05-02  8:50                   ` Roman Zippel
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02  1:27 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Daniel Phillips, Russell King, linux-kernel

On Wed, May 01, 2002 at 05:51:33PM -0700, Ralf Baechle wrote:
> On Wed, May 01, 2002 at 04:23:43PM -0700, Jesse Barnes wrote:
> 
> > On Thu, May 02, 2002 at 01:17:50AM +0200, Andrea Arcangeli wrote:
> > > so ia64 is one of those archs with a ram layout with huge holes in the
> > > middle of the ram of the nodes? I'd be curious to know what's the
> > 
> > Well, our ia64 platform is at least, but I think there are others.
> > 
> > > hardware advantage of designing the ram layout in such a way, compared
> > > to all other numa archs that I deal with. Also if you know other archs
> > > with huge holes in the middle of the ram of the nodes I'd be curious to
> > > know about them too. thanks for the interesting info!
> > 
> > AFAIK, some MIPS platforms (both NUMA and non-NUMA) have memory
> > layouts like this too.  I've never done hardware design before, so I'm
> > not sure if there's a good reason for such layouts.  Ralf or Daniel
> > might be able to shed some more light on that...
> 
> Just to give a few examples of memory layouts on MIPS systems. Sibyte 1250
> is as follows:
> 
>  - 256MB at physical address 0
>  - 512MB at physical address 0x80000000
>  - 256MB at physical address 0xc0000000
>  - The entire rest of the memory is mapped contiguously from physical
>    address 0x1:00000000 up.
>  All available memory is mapped from the lowest address up.

Is this a numa? If not then you should be just perfectly fine with
discontigmem with this chip.


> 
> Origin 200/2000.  Each node has an address space of 2GB, each node has 4
>   memory banks, that is each bank takes 512MB of address space.  Even
>   unpopulated or partially populated banks take the full 512MB address
>   space.  Memory in partially populated banks is mapped at the beginning
>   of the bank's address space; each node must have have at least one
>   bank with memory in it, that is something like
> 
>  - 32MB @ physical address 0x00:00000000
>  - 32MB @ physical address 0x00:80000000
>  - 32MB @ physical address 0x01:00000000
>  ...
>  - 32MB @ physical address 0x7f:00000000
> 
>   would be a valid configuration.  That's 8GB of RAM scattered in tiny
>   chunks of just 32mb throughout 256MB address space.  In theory nodes
>   might not even have to exist, so
> 
>  - 32MB @ physical address 0x00:00000000
>  - 32MB @ physical address 0x7f:00000000
> 
>   would be a valid configuration as well.

this means 256 nodes. for example that many different discontigmem nodes
would give you a measurable slowdown in the nr_free_pages O(N) loops
over the pgdat list, so nonlinear on the above hardware is a win. I
wasn't aware that such a memory layout actually existed. So if you
want to support the above efficiently we must make it possibe for you to
do nonlinear transparently to the common code kernel abstraction. What
are actually the common code changes involved with nonlinear? What I
care about is not to clobber the common code with additional overlapping
common code abstractions. We should try to make it possible to do
nonlinear under mips completly transparently to the current common code,
if we do that then you can use nonlinear to handle the above extreme
origin 200/2k scenario without the common code noticing that. Then
there's no point to argue about nonlinear or discontigmem, nonlinear
becomes mips way of handling virt_to_page and that's all, no
config_nonlinaer at all, just select ARCH=mips instead of ARCH=x86. Then
I'll be very fine with it of course, it would become an obviously right
implementation of virt_to_page/pte_page for a certain arch.

> 
> There are other examples more but #1 is becoming a widespread chip and #2
> is a rather extreme example just to show how far discontiguity may go.
> 
>   Ralf


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02  1:27                 ` Andrea Arcangeli
@ 2002-05-02  1:32                   ` Ralf Baechle
  2002-05-02  8:50                   ` Roman Zippel
  1 sibling, 0 replies; 165+ messages in thread
From: Ralf Baechle @ 2002-05-02  1:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 03:27:25AM +0200, Andrea Arcangeli wrote:

> >  - 256MB at physical address 0
> >  - 512MB at physical address 0x80000000
> >  - 256MB at physical address 0xc0000000
> >  - The entire rest of the memory is mapped contiguously from physical
> >    address 0x1:00000000 up.
> >  All available memory is mapped from the lowest address up.
> 
> Is this a numa? If not then you should be just perfectly fine with
> discontigmem with this chip.

This is a system on a chip with memory controllers on die.  In theory
multiple of it can be combined to brew some crude ccNUMA system but I
don't know if people are actually doing that.

 Ralf

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  1:26             ` Daniel Phillips
@ 2002-05-02  1:43               ` Andrea Arcangeli
  2002-05-01  2:41                 ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02  1:43 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, linux-kernel

On Wed, May 01, 2002 at 03:26:22AM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 02:47, Andrea Arcangeli wrote:
> > >   - Leads forward to interesting possibilities such as hot plug memory.
> > >     (Because pinned kernel memory can be remapped to an alternative
> > >     region of physical memory if desired)
> > 
> > You cannot handle hot plug with nonlinear, you cannot take the mem_map
> > contigous when somebody plugins new memory, you've to allocate the
> > mem_map in the new node, discontigmem allows that, nonlinear doesn't.
> 
> You have not read and understood the patch, which this comment demonstrates.
> 
> For your information, the mem_map lives in *virtual* memory, it does not
> need to change location, only the kernel page tables need to be updated,
> to allow a section of kernel memory to be moved to a different physical
> location.  For user memory, this was always possible, now it is possible
> for kernel memory as well.  Naturally, it's not all you have to do to get
> hotplug memory, but it's a big step in that direction.

what kernel pagetables? pagetables for space that you left free for
what? You waste virtual space for that at the very least on x86 that is
just very tigh, at this point kernel virtual space is more costly than
physical space these days. And nevertheless most archs doesn't have
pagetables at all to read and write the page structures. yes it's
virtual memory but it's a direct mapping. DaveM even rewrote the palcode
of sparc to skip the pagetable walking for kernel direct mapping. alpha
mips have a kseg that maps to physical addresses directly. there are
_no_ pagetables for the mem_map in most archs, the palcode resolves it
directly without using the tlb. So if you move mem_map to pagetables
like modules to handle hotplug you just made automatically the whole
kernel slower due the additional pte walking and tlb trashing.

> > At the very least you should waste some tons of memory of unused mem_map
> > for all the potential memory that you're going to plugin, if you want to
> > handle hot-plug with nonlinear.
> 
> Eh.  No.
> 
> It's not useful for me to keep correcting you on your misunderstanding of
> what config_nonlinear actually does.  Please read Jonathan Corbet's
> excellent writeup in lwn, it's written in a very understandable fashion.
> 
> -- 
> Daniel


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  1:35               ` Daniel Phillips
@ 2002-05-02  1:45                 ` William Lee Irwin III
  2002-05-01  2:02                   ` Daniel Phillips
  2002-05-02  1:46                 ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02  1:45 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel,
	Jesse Barnes

On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote:
> to use a hash table instead of a table lookup.  Bill Irwin suggested a btree
> would work here as well.

I remember suggesting a sorted array of extents on which binary
search could be performed. A B-tree seems unlikely but perhaps if
it were contiguously allocated and some other tricks done it might
do, maybe I don't remember the special sauce used for the occasion.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  1:35               ` Daniel Phillips
  2002-05-02  1:45                 ` William Lee Irwin III
@ 2002-05-02  1:46                 ` Andrea Arcangeli
  2002-05-01  1:56                   ` Daniel Phillips
  1 sibling, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02  1:46 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Anton Blanchard, Russell King, linux-kernel, Jesse Barnes

On Wed, May 01, 2002 at 03:35:20AM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 02:20, Anton Blanchard wrote:
> > > so ia64 is one of those archs with a ram layout with huge holes in the
> > > middle of the ram of the nodes? I'd be curious to know what's the
> > > hardware advantage of designing the ram layout in such a way, compared
> > > to all other numa archs that I deal with. Also if you know other archs
> > > with huge holes in the middle of the ram of the nodes I'd be curious to
> > > know about them too. thanks for the interesting info!
> > 
> > From arch/ppc64/kernel/iSeries_setup.c:
> > 
> >  * The iSeries may have very large memories ( > 128 GB ) and a partition
> >  * may get memory in "chunks" that may be anywhere in the 2**52 real
> >  * address space.  The chunks are 256K in size.
> > 
> > Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
> > solution to this problem.
> 
> Using the config_nonlinear model, you'd change the four mapping functions:
> 
> 	logical_to_phys
> 	phys_to_logical
> 	pagenum_to_phys
> 	phys_to_pagenum
> 
> to use a hash table instead of a table lookup.  Bill Irwin suggested a btree
> would work here as well.

btree? btree is not an interesting in core data structure. Anyways you
can use a btree with discontigmem too for the lookup. nonlinear will pay
off if you've something of the order of 256 discontigmem chunks with
significant holes in between like origin 2k, and I think it should be
resolved internally to the arch without exposing it to the common code.

> 
> (Note I'm trying out the term 'pagenum' instead of 'ordinal' here, following
> comments on lse-tech.)
> 
> -- 
> Daniel


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  2:02                   ` Daniel Phillips
@ 2002-05-02  2:33                     ` William Lee Irwin III
  2002-05-01  2:44                       ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02  2:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel,
	Jesse Barnes

On Thursday 02 May 2002 03:45, William Lee Irwin III wrote:
>> I remember suggesting a sorted array of extents on which binary
>> search could be performed. A B-tree seems unlikely but perhaps if
>> it were contiguously allocated and some other tricks done it might
>> do, maybe I don't remember the special sauce used for the occasion.

On Wed, May 01, 2002 at 04:02:33AM +0200, Daniel Phillips wrote:
> Thanks for the correction.  When you said 'extents' I automatically thought
> 'btree of extents'.  I'd tend to go for the hash table anyway - your binary
> search is going to take quite a few more steps to terminate than the bucket
> search, given some reasonable choice of hash table size.

It's probably motivated more by sheer terror of another huge hash table
sized proportional to memory eating the kernel virtual address space
alive than anything else. I probably should have used reverse psychology
instead. I should note that the size of the array I suggested is not
proportional to memory, only to the number of fragments. It would
probably only have a distinct advantage in a situation where both the
fragment sizes and distributions are irregular; when the number of
fragments is in fact proportional to memory it gains little aside from
a small factor of compactness and/or in-core contiguity. The hashing
techniques that seem obvious to me effectively require some sort of
objects to back a direct mapping, which translates to per-page overhead,
which I'm very very picky about. I also like things to behave gracefully
about space and time when faced with irregular or "hostile" layouts.

Actually, now that I think about it, a contiguously-allocated B-tree of
extents doesn't sound bad at all, even without additional dressing. Do
you think it's worth a try?

Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  0:47           ` Andrea Arcangeli
  2002-05-01  1:26             ` Daniel Phillips
@ 2002-05-02  2:37             ` William Lee Irwin III
  2002-05-02 15:59               ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02  2:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 02:47:40AM +0200, Andrea Arcangeli wrote:
> Oh yeah, you save 1 microsecond every 10 years of uptime by taking
> advantage of the potentially coalesced cacheline between the last page
> in a node and the first page of the next node. Before you can care about
> this optimizations you should remove from x86 the pgdat loops that are
> not needed with discontigmem disabled like in x86 (this has nothing to
> do with discontigmem/nonlinear). That wouldn't be measurable too but at
> least it would be more worthwhile.

Which ones did you have in mind? I did poke around this area a bit, and
already have my eye on one...


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02  1:27                 ` Andrea Arcangeli
  2002-05-02  1:32                   ` Ralf Baechle
@ 2002-05-02  8:50                   ` Roman Zippel
  2002-05-01 13:21                     ` Daniel Phillips
  2002-05-02 18:35                     ` Geert Uytterhoeven
  1 sibling, 2 replies; 165+ messages in thread
From: Roman Zippel @ 2002-05-02  8:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ralf Baechle, Daniel Phillips, Russell King, linux-kernel

Hi,

On Thu, 2 May 2002, Andrea Arcangeli wrote:

> What I
> care about is not to clobber the common code with additional overlapping
> common code abstractions.

Just to throw in an alternative: On m68k we map currently everything
together into a single virtual area. This means the virtual<->physical
conversion is a bit more expensive and mem_map is simply indexed by the
the virtual address.
It works nicely, it just needs two small patches in the initializition
code, which aren't integrated yet. I think it's very close to what Daniel
wants, only that the logical and virtual address are identical.

bye, Roman


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01  2:41                 ` Daniel Phillips
@ 2002-05-02 13:34                   ` Andrea Arcangeli
  2002-05-02 15:18                     ` Martin J. Bligh
  2002-05-02 16:00                     ` William Lee Irwin III
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 13:34 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, linux-kernel

On Wed, May 01, 2002 at 04:41:17AM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 03:43, Andrea Arcangeli wrote:
> > On Wed, May 01, 2002 at 03:26:22AM +0200, Daniel Phillips wrote:
> > > For your information, the mem_map lives in *virtual* memory, it does not
> > > need to change location, only the kernel page tables need to be updated,
> > > to allow a section of kernel memory to be moved to a different physical
> > > location.  For user memory, this was always possible, now it is possible
> > > for kernel memory as well.  Naturally, it's not all you have to do to get
> > > hotplug memory, but it's a big step in that direction.
> > 
> > what kernel pagetables?
> 
> The normal page tables that are used to map kernel memory.
> 
> > pagetables for space that you left free for what?
> 
> These page tables have not been left free for anything.  The nice thing about
> page tables is that you can change the page table entries to point wherever
> you want.  (I know you know this.)  This is what config_nonlinear supports,
> and that is why it's called config_nonlinear.  When we want to remap part of
> the kernel memory to a different piece of physical memory, we just need to
> fill in some pte's.  The tricky part is knowing how to fill in the ptes, and
> config_nonlinear takes care of that.
> 
> > You waste virtual space for that at the very least on x86 that is
> > just very tigh, at this point kernel virtual space is more costly than
> > physical space these days. And nevertheless most archs doesn't have
> > pagetables at all to read and write the page structures. yes it's
> > virtual memory but it's a direct mapping.
> 
> Most architectures?  That's quite possibly an exaggeration.  Some
> architectures - MIPS32 for example - make this difficult or impossible,
> and so what?  Those can't do software hotplug memory, sorry.

alpha is the same as mips I think. sparc would be the same too if
there's any discontigmem sparc. Dunno of arm. We're talking about
architectures needing discontigmem, 99% percent of users  doesn't need
discontigmem in the first place, you never need discontigmem in x86 and
even in new-numa you don't need discontigmem, you want to pass through
discontigmem only to get the numa topology description that the current
discontigmem provides via the pgdat.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-01 13:21                     ` Daniel Phillips
@ 2002-05-02 14:00                       ` Roman Zippel
  2002-05-01 14:08                         ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-02 14:00 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

Hi,

On Wed, 1 May 2002, Daniel Phillips wrote:

> > Just to throw in an alternative: On m68k we map currently everything
> > together into a single virtual area. This means the virtual<->physical
> > conversion is a bit more expensive and mem_map is simply indexed by the
> > the virtual address.
> 
> Are you talking about mm_ptov and friends here?  What are the loops for?

It simply searches through all memory nodes, it's not really efficient.

> Could you please describe the most extreme case of physical discontiguity
> you're handling?

I can't assume anything. I'm thinking about calculating the table
dynamically and patching the kernel at bootup, we are already doing
something similiar in the Amiga/ppc kernel.

bye, Roman


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 13:34                   ` Andrea Arcangeli
@ 2002-05-02 15:18                     ` Martin J. Bligh
  2002-05-02 15:35                       ` Andrea Arcangeli
  2002-05-02 16:00                     ` William Lee Irwin III
  1 sibling, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 15:18 UTC (permalink / raw)
  To: Andrea Arcangeli, Daniel Phillips; +Cc: Russell King, linux-kernel

> alpha is the same as mips I think. sparc would be the same too if
> there's any discontigmem sparc. Dunno of arm. We're talking about
> architectures needing discontigmem, 99% percent of users  doesn't need
> discontigmem in the first place, you never need discontigmem in x86 and

That's not true. We use discontigmem on the NUMA-Q boxes for NUMA support.
In some memory models, they're also really discontigous memory machines.
At the moment I use the contig memory model (so we only use discontig for
NUMA support) but I may need to change that in the future.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  1:01               ` Andrea Arcangeli
@ 2002-05-02 15:28                 ` Anton Blanchard
  2002-05-01 16:10                   ` Daniel Phillips
                                     ` (2 more replies)
  0 siblings, 3 replies; 165+ messages in thread
From: Anton Blanchard @ 2002-05-02 15:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Daniel Phillips, Russell King, linux-kernel, Jesse Barnes

 
> is this machine a numa machine? If not then discontigmem will work just
> fine. also it's a matter of administration, even if it's a numa machine
> you can use it just optimally with discontigmem+numa. Regardless of what
> we do if the partitioning is bad the kernel will do bad. If you create
> zillon discontigous nodes of 256K each, you'd need waste memory to
> handle them regardless of nonlinear or discontigmem (with discontigmem
> you will waste more memory than nonlinear yes, exactly because it's more
> powerful, but I think a machine with an huge lot of non contigous 256K
> chunks is misconfigured, it's like if you pretend to install linux on a
> machine after you partitioned the HD with thousand of logical volumes
> large 256K each [for the sake of this example let's assume there are
> more than 256LV available in LVM], a sane partitioning requires you to
> have at least a partition for /usr large 1 giga, depends what you're
> doing of course, but requiring sane partitioning it's an admin problem
> not a kernel problem IMHO).

Its not a NUMA machine, its one that allows shared processor logical
partitions. While I would prefer the hypervisor to give each partition
a nice memory map (and internally maintain a mapping to real memory)
it does not. I can imagine if the machine has been up for many months
memory could become very fragmented.

Also when we do hotplug memory support will discontigmem be able to
efficiently handle memory turning up all over the place in the memory
map?

Anton

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:18                     ` Martin J. Bligh
@ 2002-05-02 15:35                       ` Andrea Arcangeli
  2002-05-01 15:42                         ` Daniel Phillips
  2002-05-02 16:07                         ` Martin J. Bligh
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 15:35 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote:
> > alpha is the same as mips I think. sparc would be the same too if
> > there's any discontigmem sparc. Dunno of arm. We're talking about
> > architectures needing discontigmem, 99% percent of users  doesn't need
> > discontigmem in the first place, you never need discontigmem in x86 and
> 
> That's not true. We use discontigmem on the NUMA-Q boxes for NUMA support.
> In some memory models, they're also really discontigous memory machines.

With numa-q there's a 512M hole in each node IIRC. that's fine
configuration, similar to the wildfire btw.

> At the moment I use the contig memory model (so we only use discontig for
> NUMA support) but I may need to change that in the future.

I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
the current discontigmem-numa model too as far I can see.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:28                 ` Anton Blanchard
  2002-05-01 16:10                   ` Daniel Phillips
@ 2002-05-02 15:59                   ` Dave Engebretsen
  2002-05-01 17:24                     ` Daniel Phillips
  2002-05-02 16:31                   ` William Lee Irwin III
  2 siblings, 1 reply; 165+ messages in thread
From: Dave Engebretsen @ 2002-05-02 15:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: Daniel Phillips

Anton Blanchard wrote:
> 
> > more than 256LV available in LVM], a sane partitioning requires you to
> > have at least a partition for /usr large 1 giga, depends what you're
> > doing of course, but requiring sane partitioning it's an admin problem
> > not a kernel problem IMHO).
> 
> Its not a NUMA machine, its one that allows shared processor logical
> partitions. While I would prefer the hypervisor to give each partition
> a nice memory map (and internally maintain a mapping to real memory)
> it does not. I can imagine if the machine has been up for many months
> memory could become very fragmented.
> 
> Also when we do hotplug memory support will discontigmem be able to
> efficiently handle memory turning up all over the place in the memory
> map?
> 
> Anton

On this type of partitioned system where ppc64 runs, there is not much
administration that could be done to help the problem.  As Anton
mentioned, when the system has been up for a long time, and memory has
been moving between partitions which support dynamic memory movement, it
is assured that memory will become very fragmented.  As more partitions
on these systems become available, and resources migrate more freely,
the problem will get worse.

Whether this management from kernel to hardware addresses is done in the
hypervisor layer or the OS, the same overhead exists, given todays
hardware structure for PowerPC servers anyway.  In todays ppc64
implementation, we just use an array to map from what the kernel sees as
its address space to what is put in the hardware page table and I/O
translation tables, thus not requiring any changes in independant code. 
This does consume some storage, but the highly fragmented nature of our
platform memory drives this decision.  I would like to see that data
structure decision left to the archs as different platform design points
may lead to different mapping decisions.

Dave.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  2:37             ` William Lee Irwin III
@ 2002-05-02 15:59               ` Andrea Arcangeli
  2002-05-02 16:06                 ` William Lee Irwin III
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 15:59 UTC (permalink / raw)
  To: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel

On Wed, May 01, 2002 at 07:37:11PM -0700, William Lee Irwin III wrote:
> On Thu, May 02, 2002 at 02:47:40AM +0200, Andrea Arcangeli wrote:
> > Oh yeah, you save 1 microsecond every 10 years of uptime by taking
> > advantage of the potentially coalesced cacheline between the last page
> > in a node and the first page of the next node. Before you can care about
> > this optimizations you should remove from x86 the pgdat loops that are
> > not needed with discontigmem disabled like in x86 (this has nothing to
> > do with discontigmem/nonlinear). That wouldn't be measurable too but at
> > least it would be more worthwhile.
> 
> Which ones did you have in mind? I did poke around this area a bit, and

all of them, if you implement a mechanism to skip one of the pgdat
loops, you could skip them of all then.

> already have my eye on one...
> 
> 
> Cheers,
> Bill


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 13:34                   ` Andrea Arcangeli
  2002-05-02 15:18                     ` Martin J. Bligh
@ 2002-05-02 16:00                     ` William Lee Irwin III
  1 sibling, 0 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 16:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 03:34:02PM +0200, Andrea Arcangeli wrote:
> alpha is the same as mips I think. sparc would be the same too if
> there's any discontigmem sparc. Dunno of arm. We're talking about
> architectures needing discontigmem, 99% percent of users  doesn't need
> discontigmem in the first place, you never need discontigmem in x86 and
> even in new-numa you don't need discontigmem, you want to pass through
> discontigmem only to get the numa topology description that the current
> discontigmem provides via the pgdat.

Any chance you could name a few of these mysterious new NUMA machines?


Thanks,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:59               ` Andrea Arcangeli
@ 2002-05-02 16:06                 ` William Lee Irwin III
  0 siblings, 0 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 16:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

On Wed, May 01, 2002 at 07:37:11PM -0700, William Lee Irwin III wrote:
>> Which ones did you have in mind? I did poke around this area a bit, and

On Thu, May 02, 2002 at 05:59:46PM +0200, Andrea Arcangeli wrote:
> all of them, if you implement a mechanism to skip one of the pgdat
> loops, you could skip them of all then.

Not quite; I only had in mind a per-cpu free pages counter for nr_free_pages.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01 15:42                         ` Daniel Phillips
@ 2002-05-02 16:06                           ` Andrea Arcangeli
  2002-05-02 16:10                             ` Martin J. Bligh
  2002-05-02 23:42                             ` Daniel Phillips
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 16:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, Russell King, linux-kernel

On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote:
> > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote:
> > > At the moment I use the contig memory model (so we only use discontig for
> > > NUMA support) but I may need to change that in the future.
> > 
> > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
> > the current discontigmem-numa model too as far I can see.
> 
> No it doesn't.  The config_discontigmem model forces all zone_normal memory
> to be on node zero, so all the remaining nodes can only have highmem locally.

You can trivially map the phys mem between 1G and 1G+256M to be in a
direct mapping between 3G+256M and 3G+512M, then you can put such 256M
at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.

The constraints you have on the normal memory are only two:

1) direct mapping
2) DMA

so as far as the ram is capable of 32bit DMA with pci32 and it's mapped
in the direct mapping you can put it into the normal zone. There is no
difference at all between discontimem or nonlinear in this sense.

> Even with good cache hardware, this has to hurt.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:35                       ` Andrea Arcangeli
  2002-05-01 15:42                         ` Daniel Phillips
@ 2002-05-02 16:07                         ` Martin J. Bligh
  2002-05-02 16:58                           ` Gerrit Huizenga
  1 sibling, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 16:07 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

> With numa-q there's a 512M hole in each node IIRC. that's fine
> configuration, similar to the wildfire btw.

There's 2 different memory models - the NT mode we use currently
is contiguous, the PTX mode is discontiguous. I don't think it's
as simple as a 512Mb fixed size hole, though I'd have to look it
up to be sure.

>> At the moment I use the contig memory model (so we only use discontig for
>> NUMA support) but I may need to change that in the future.
> 
> I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
> the current discontigmem-numa model too as far I can see.

As Dan already mentioned, we need CONFIG_NONLINEAR to spread
around ZONE_NORMAL.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:06                           ` Andrea Arcangeli
@ 2002-05-02 16:10                             ` Martin J. Bligh
  2002-05-02 16:40                               ` Andrea Arcangeli
  2002-05-02 23:42                             ` Daniel Phillips
  1 sibling, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Daniel Phillips; +Cc: Russell King, linux-kernel

> You can trivially map the phys mem between 1G and 1G+256M to be in a
> direct mapping between 3G+256M and 3G+512M, then you can put such 256M
> at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.
> 
> The constraints you have on the normal memory are only two:
> 
> 1) direct mapping
> 2) DMA
> 
> so as far as the ram is capable of 32bit DMA with pci32 and it's mapped
> in the direct mapping you can put it into the normal zone. There is no
> difference at all between discontimem or nonlinear in this sense.

Now imagine an 8 node system, with 4Gb of memory in each node.
First 4Gb is in node 0, second 4Gb is in node 1, etc.

Even with 64 bit DMA, the real problem is breaking the assumption
that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
That's 90% of the difficulty of what Dan's doing anyway, as I
see it.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:31                   ` William Lee Irwin III
@ 2002-05-02 16:21                     ` Dave Engebretsen
  2002-05-02 17:28                       ` William Lee Irwin III
  0 siblings, 1 reply; 165+ messages in thread
From: Dave Engebretsen @ 2002-05-02 16:21 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

William Lee Irwin III wrote:
> 
> On Fri, May 03, 2002 at 01:28:25AM +1000, Anton Blanchard wrote:
> > Also when we do hotplug memory support will discontigmem be able to
> > efficiently handle memory turning up all over the place in the memory
> > map?
> 
> Would the flip side of that coin perhaps be implementing a way to be a
> good logically partitioned citizen and cooperatively offline memory?
> 
> Cheers,
> Bill

Yes, both add and remove are needed to be a good citizen.  One could
spend all kinds of time coming up with good huristicts to do that
automatically :)

At a mimimum, manual off line of memory would be nice.

Dave.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 15:28                 ` Anton Blanchard
  2002-05-01 16:10                   ` Daniel Phillips
  2002-05-02 15:59                   ` Dave Engebretsen
@ 2002-05-02 16:31                   ` William Lee Irwin III
  2002-05-02 16:21                     ` Dave Engebretsen
  2 siblings, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 16:31 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel,
	Jesse Barnes

On Fri, May 03, 2002 at 01:28:25AM +1000, Anton Blanchard wrote:
> Also when we do hotplug memory support will discontigmem be able to
> efficiently handle memory turning up all over the place in the memory
> map?

Would the flip side of that coin perhaps be implementing a way to be a
good logically partitioned citizen and cooperatively offline memory?


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:10                             ` Martin J. Bligh
@ 2002-05-02 16:40                               ` Andrea Arcangeli
  2002-05-02 17:16                                 ` William Lee Irwin III
                                                   ` (2 more replies)
  0 siblings, 3 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 16:40 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> > You can trivially map the phys mem between 1G and 1G+256M to be in a
> > direct mapping between 3G+256M and 3G+512M, then you can put such 256M
> > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.
> > 
> > The constraints you have on the normal memory are only two:
> > 
> > 1) direct mapping
> > 2) DMA
> > 
> > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped
> > in the direct mapping you can put it into the normal zone. There is no
> > difference at all between discontimem or nonlinear in this sense.
> 
> Now imagine an 8 node system, with 4Gb of memory in each node.
> First 4Gb is in node 0, second 4Gb is in node 1, etc.
> 
> Even with 64 bit DMA, the real problem is breaking the assumption
> that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> That's 90% of the difficulty of what Dan's doing anyway, as I
> see it.

You don't need any additional common code abstraction to make virtual
address 3G+256G to point to physical address 1G as in my example above,
after that you're free to put the physical ram between 1G and 1G+256M
into the zone normal of node 1 and the stuff should keep working but
with zone-normal spread in more than one node.  You just have full
control on virt_to_page, pci_map_single, __va.  Actually it may be as
well cleaner to just let the arch define page_address() when
discontigmem is enabled (instead of hacking on top of __va), that's a
few liner. (the only true limit you have is on the phys ram above 4G,
that cannot definitely go into zone-normal regardless if it belongs to a
direct mapping or not because of pci32 API)

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-01 17:24                     ` Daniel Phillips
@ 2002-05-02 16:44                       ` Dave Engebretsen
  0 siblings, 0 replies; 165+ messages in thread
From: Dave Engebretsen @ 2002-05-02 16:44 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel


> And it is left up to the arch in my patch, I've simply imposed a little more
> order on what, up till now, has been a pretty chaotic corner of the kernel,
> and provided a template that satisfies a wider variety of needs than the old
> one.

Yep, got that - just reenforcing the point.

> It sounds like the table translation you're doing in the hypervisor is
> exactly what I've implemented in the kernel.  One advantage of going with
> the kernel's implementation is that you get the benefit of improvements
> made to it, for example, the proposed hashing scheme to handle extremely
> fragmented physical memory maps.
> 

I should clarify a bit -- we run on two different hypervisor
interfaces.  The iSeries interface leaves this translation work to the
OS.  In that case Linux has an array translation lookup which is
analogous to your patch.  We just managed to hide everything in
arch/ppc64 by doing this lookup when inserting hashed page table and I/O
table mappings.  Other than at that low level, the remappings are
transparent to Linux -- it just sees a nice big flat physical address
space.

On pSeries, the hypervisor does the translation work under the covers,
but as you point out, Linux doesn't get the chance to play with
different mapping schemes.  Then again, that does simplify my life ...

Dave.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:07                         ` Martin J. Bligh
@ 2002-05-02 16:58                           ` Gerrit Huizenga
  2002-05-02 18:10                             ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Gerrit Huizenga @ 2002-05-02 16:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel

In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes:
> > With numa-q there's a 512M hole in each node IIRC. that's fine
> > configuration, similar to the wildfire btw.
> 
> There's 2 different memory models - the NT mode we use currently
> is contiguous, the PTX mode is discontiguous. I don't think it's
> as simple as a 512Mb fixed size hole, though I'd have to look it
> up to be sure.

No - it definitely isn't as simple as a 512 MB hole.  Depends on how much
memory is in each node, holes could be all kinds of sizes.  You could,
in theory, have had 128 MB in one node and 8 GB in another node.  I don't
think we had holes within the node from the software side - I think the
requirement was that all DIMMS were added in low to high memory slots.
Not sure what forced that requirement - could have been PTX, BIOS,
cache controllers, etc.

gerrit

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:40                               ` Andrea Arcangeli
@ 2002-05-02 17:16                                 ` William Lee Irwin III
  2002-05-02 18:41                                   ` Andrea Arcangeli
  2002-05-02 18:25                                 ` Daniel Phillips
  2002-05-02 19:31                                 ` Martin J. Bligh
  2 siblings, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 17:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
>> Even with 64 bit DMA, the real problem is breaking the assumption
>> that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
>> That's 90% of the difficulty of what Dan's doing anyway, as I
>> see it.

On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote:
> control on virt_to_page, pci_map_single, __va.  Actually it may be as
> well cleaner to just let the arch define page_address() when
> discontigmem is enabled (instead of hacking on top of __va), that's a
> few liner. (the only true limit you have is on the phys ram above 4G,
> that cannot definitely go into zone-normal regardless if it belongs to a
> direct mapping or not because of pci32 API)
> Andrea

Being unable to have any ZONE_NORMAL above 4GB allows no change at all.
32-bit PCI is not used on NUMA-Q AFAIK.

So long as zones are physically contiguous and __va() does what its
name implies, page_address() should operate properly aside from the
sizeof(phys_addr) > sizeof(unsigned long) overflow issue (which I
believe was recently resolved; if not I will do so myself shortly).
With SGI's discontigmem, one would need an UNMAP_NR_DENSE() as the
position in mem_map array does not describe the offset into the region
of physical memory occupied by the zone. UNMAP_NR_DENSE() may be
expensive enough architectures using MAP_NR_DENSE() may be better off
using ARCH_WANT_VIRTUAL, as page_address() is a common operation. If
space conservation is as important a consideration for stability as it
is on architectures with severely limited kernel virtual address spaces,
it may be preferable to implement such despite the computational expense.
iSeries will likely have physically discontiguous zones and so it won't
be able to use an address calculation based page_address() either.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:21                     ` Dave Engebretsen
@ 2002-05-02 17:28                       ` William Lee Irwin III
  0 siblings, 0 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 17:28 UTC (permalink / raw)
  To: Dave Engebretsen; +Cc: linux-kernel

William Lee Irwin III wrote:
>> Would the flip side of that coin perhaps be implementing a way to be a
>> good logically partitioned citizen and cooperatively offline memory?
>> Cheers,
>> Bill

On Thu, May 02, 2002 at 11:21:59AM -0500, Dave Engebretsen wrote:
> Yes, both add and remove are needed to be a good citizen.  One could
> spend all kinds of time coming up with good huristicts to do that
> automatically :)
> At a mimimum, manual off line of memory would be nice.
> Dave.

I have a particular interest in the implementation of at least one
mechanism in the kernel (rmap) which could be exploited to assist
in this. If there are other efforts in progress toward this end I'd
be happy to investigate methods of using the additional machinery
provided by rmap to assist in this.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-01 14:08                         ` Daniel Phillips
@ 2002-05-02 17:56                           ` Roman Zippel
  2002-05-01 17:59                             ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-02 17:56 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

Daniel Phillips wrote:

> Maybe this is a good place to try out a hash table variant of
> config_nonlinear.  It's got to be more efficient than searching all the
> nodes, don't you think?

Most of the time there are only a few nodes, I just don't know where and
how big they are, so I don't think a hash based approach will be a lot
faster. When I'm going to change this, I'd rather try the dynamic table
approach.

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:58                           ` Gerrit Huizenga
@ 2002-05-02 18:10                             ` Andrea Arcangeli
  2002-05-02 19:28                               ` Gerrit Huizenga
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 18:10 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:58:02AM -0700, Gerrit Huizenga wrote:
> In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes:
> > > With numa-q there's a 512M hole in each node IIRC. that's fine
> > > configuration, similar to the wildfire btw.
> > 
> > There's 2 different memory models - the NT mode we use currently
> > is contiguous, the PTX mode is discontiguous. I don't think it's
> > as simple as a 512Mb fixed size hole, though I'd have to look it
> > up to be sure.
> 
> No - it definitely isn't as simple as a 512 MB hole.  Depends on how much

I meant that as an example, I recall that was valid config, 512M of ram
and 512M hole, then next node 512M ram and 512M hole etc... Of course it
must be possible to vary the mem size if you want more or less ram in
each node but still it doesn't generate a problematic layout for
discontigmem (i.e. not 256 discontigous chunks or something of that
order).

> memory is in each node, holes could be all kinds of sizes.  You could,
> in theory, have had 128 MB in one node and 8 GB in another node.  I don't
> think we had holes within the node from the software side - I think the
> requirement was that all DIMMS were added in low to high memory slots.
> Not sure what forced that requirement - could have been PTX, BIOS,
> cache controllers, etc.
> 
> gerrit


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:40                               ` Andrea Arcangeli
  2002-05-02 17:16                                 ` William Lee Irwin III
@ 2002-05-02 18:25                                 ` Daniel Phillips
  2002-05-02 18:44                                   ` Andrea Arcangeli
  2002-05-02 19:31                                 ` Martin J. Bligh
  2 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli, Martin J. Bligh; +Cc: Russell King, linux-kernel

On Thursday 02 May 2002 18:40, Andrea Arcangeli wrote:
> On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> > > You can trivially map the phys mem between 1G and 1G+256M to be in a
> > > direct mapping between 3G+256M and 3G+512M, then you can put such 256M
> > > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.
> > > 
> > > The constraints you have on the normal memory are only two:
> > > 
> > > 1) direct mapping
> > > 2) DMA
> > > 
> > > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped
> > > in the direct mapping you can put it into the normal zone. There is no
> > > difference at all between discontimem or nonlinear in this sense.
> > 
> > Now imagine an 8 node system, with 4Gb of memory in each node.
> > First 4Gb is in node 0, second 4Gb is in node 1, etc.
> > 
> > Even with 64 bit DMA, the real problem is breaking the assumption
> > that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> > That's 90% of the difficulty of what Dan's doing anyway, as I
> > see it.
> 
> You don't need any additional common code abstraction to make virtual
> address 3G+256G to point to physical address 1G as in my example above,
          M ----^
> after that you're free to put the physical ram between 1G and 1G+256M
> into the zone normal of node 1 and the stuff should keep working but
> with zone-normal spread in more than one node.

I don't see that you accomplished that at all, with config_discontig.
How can you address the memory at 3G+256M?  That looks like highmem to
me.  No good at all for kmem caches, buffers, struct pages, etc.
Without config_nonlinear, those structures will all have to be off-node
for most nodes.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-01 17:59                             ` Daniel Phillips
@ 2002-05-02 18:26                               ` Roman Zippel
  2002-05-02 18:32                                 ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-02 18:26 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

Hi,

Daniel Phillips wrote:

> > Most of the time there are only a few nodes, I just don't know where and
> > how big they are, so I don't think a hash based approach will be a lot
> > faster. When I'm going to change this, I'd rather try the dynamic table
> > approach.
> 
> Which dynamic table approach is that?

I mean calculating the lookup table and patching the kernel at startup.
Anyway, I agree with Andrea, that another mapping isn't really needed.
Clever use of the mmu should give you almost the same result.

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 18:26                               ` Roman Zippel
@ 2002-05-02 18:32                                 ` Daniel Phillips
  2002-05-02 19:40                                   ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 18:32 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

On Thursday 02 May 2002 20:26, Roman Zippel wrote:
> Hi,
> 
> Daniel Phillips wrote:
> 
> > > Most of the time there are only a few nodes, I just don't know where and
> > > how big they are, so I don't think a hash based approach will be a lot
> > > faster. When I'm going to change this, I'd rather try the dynamic table
> > > approach.
> > 
> > Which dynamic table approach is that?
> 
> I mean calculating the lookup table and patching the kernel at startup.

Patching the kernel how, and where?

Calculating the lookup table automatically at startup is definitely planned,
and yes, essential to avoid an unmanageble proliferation of configuration
files.  It's also possible to pass the configuration as a list of
mem=size@physaddr kernel command line entries, which is a pragmatic solution
for configurations with unusual memory mappings, but not too many of them.

> Anyway, I agree with Andrea, that another mapping isn't really needed.
> Clever use of the mmu should give you almost the same result.

We *are* making clever use of the mmu in config_nonlinear, it is doing the
nonlinear kernel virtual mapping for us.  Did you have something more clever
in mind?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02  8:50                   ` Roman Zippel
  2002-05-01 13:21                     ` Daniel Phillips
@ 2002-05-02 18:35                     ` Geert Uytterhoeven
  2002-05-02 18:39                       ` Daniel Phillips
  1 sibling, 1 reply; 165+ messages in thread
From: Geert Uytterhoeven @ 2002-05-02 18:35 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Andrea Arcangeli, Ralf Baechle, Daniel Phillips, Russell King,
	Linux Kernel Development

On Thu, 2 May 2002, Roman Zippel wrote:
> On Thu, 2 May 2002, Andrea Arcangeli wrote:
> > What I
> > care about is not to clobber the common code with additional overlapping
> > common code abstractions.
> 
> Just to throw in an alternative: On m68k we map currently everything
> together into a single virtual area. This means the virtual<->physical
> conversion is a bit more expensive and mem_map is simply indexed by the
> the virtual address.
> It works nicely, it just needs two small patches in the initializition
> code, which aren't integrated yet. I think it's very close to what Daniel
> wants, only that the logical and virtual address are identical.

I also want to add that the order (by address) of the virtual chunk is not
necessarily the same as the order (by address) of the physical chunks.

So it's perfect possible to put the kernel in the second physical chunk, in
which case the first physical chunk (with a lower physical address) ends up in
the virtual list behind the first physical chunk.

IIRC (/me no Linux mm whizard), the above reason was the main reason why the
current zone system doesn't work well for m68k boxes (mainly talking about
Amiga).

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 18:35                     ` Geert Uytterhoeven
@ 2002-05-02 18:39                       ` Daniel Phillips
  0 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 18:39 UTC (permalink / raw)
  To: Geert Uytterhoeven, Roman Zippel
  Cc: Andrea Arcangeli, Ralf Baechle, Russell King, Linux Kernel Development

On Thursday 02 May 2002 20:35, Geert Uytterhoeven wrote:
> On Thu, 2 May 2002, Roman Zippel wrote:
> > On Thu, 2 May 2002, Andrea Arcangeli wrote:
> > > What I
> > > care about is not to clobber the common code with additional overlapping
> > > common code abstractions.
> > 
> > Just to throw in an alternative: On m68k we map currently everything
> > together into a single virtual area. This means the virtual<->physical
> > conversion is a bit more expensive and mem_map is simply indexed by the
> > the virtual address.
> > It works nicely, it just needs two small patches in the initializition
> > code, which aren't integrated yet. I think it's very close to what Daniel
> > wants, only that the logical and virtual address are identical.
> 
> I also want to add that the order (by address) of the virtual chunk is not
> necessarily the same as the order (by address) of the physical chunks.
> 
> So it's perfect possible to put the kernel in the second physical chunk, in
> which case the first physical chunk (with a lower physical address) ends up in
> the virtual list behind the first physical chunk.
> 
> IIRC (/me no Linux mm whizard), the above reason was the main reason why the
> current zone system doesn't work well for m68k boxes (mainly talking about
> Amiga).

Config_nonlinear will handle this situation without difficulty.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 17:16                                 ` William Lee Irwin III
@ 2002-05-02 18:41                                   ` Andrea Arcangeli
  2002-05-02 19:19                                     ` William Lee Irwin III
  2002-05-02 19:22                                     ` Daniel Phillips
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 18:41 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips,
	Russell King, linux-kernel

On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> >> Even with 64 bit DMA, the real problem is breaking the assumption
> >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> >> That's 90% of the difficulty of what Dan's doing anyway, as I
> >> see it.
> 
> On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote:
> > control on virt_to_page, pci_map_single, __va.  Actually it may be as
> > well cleaner to just let the arch define page_address() when
> > discontigmem is enabled (instead of hacking on top of __va), that's a
> > few liner. (the only true limit you have is on the phys ram above 4G,
> > that cannot definitely go into zone-normal regardless if it belongs to a
> > direct mapping or not because of pci32 API)
> > Andrea
> 
> Being unable to have any ZONE_NORMAL above 4GB allows no change at all.

No change if your first node maps the whole first 4G of physical address
space, but in such case nonlinear cannot help you in any way anyways.
The fact you can make no change at all has only to do with the fact
GFP_KERNEL must return memory accessible from a pci32 device.

I think most configurations have more than one node mapped into the
first 4G, and so in those configurations you can do changes and spread
the direct mapping across all the nodes mapped in the first 4G phys.

the fact you can or you can't have something to change with discontigmem
or nonlinear, it's all about pci32.

> 32-bit PCI is not used on NUMA-Q AFAIK.

but can you plugin 32bit pci hardware into your 64bit-pci slots, right?
If not, and if you're also sure the linux drivers for your hardware are all
64bit-pci capable then you can do the changes regardless of the 4G
limit, in such case you can spread the direct mapping all over the whole
64G physical ram, whereever you want, no 4G constraint anymore.

> 
> So long as zones are physically contiguous and __va() does what its

zones remains physically contigous, it's the virtual address returned by
page_address that changes. Also the kmap header will need some
modification, you should always check for PageHIGHMEM in all places to
know if you must kmap or not, that's a few liner.

> name implies, page_address() should operate properly aside from the
> sizeof(phys_addr) > sizeof(unsigned long) overflow issue (which I
> believe was recently resolved; if not I will do so myself shortly).
> With SGI's discontigmem, one would need an UNMAP_NR_DENSE() as the
> position in mem_map array does not describe the offset into the region
> of physical memory occupied by the zone. UNMAP_NR_DENSE() may be
> expensive enough architectures using MAP_NR_DENSE() may be better off
> using ARCH_WANT_VIRTUAL, as page_address() is a common operation. If

Yes, as alternative to moving page_address to the arch code, you can set
WANT_PAGE_VIRTUAL since as you say such a function is going to be more
expensive, (if it's only a few instructions you can instead consider
moving page_address in the arch code as said in the previous email
instead of hacking on __va).

> space conservation is as important a consideration for stability as it
> is on architectures with severely limited kernel virtual address spaces,
> it may be preferable to implement such despite the computational expense.
> iSeries will likely have physically discontiguous zones and so it won't
> be able to use an address calculation based page_address() either.

If you need to support an huge number of discontigous zones then I'm the
first to agree you want nonlinear instead of discontigmem, I wasn't
aware that such an hardware that normally needs to support hundred or
thousand of discontigmem zones exists, for it discontigmem is
prohibitive due the O(N) complexity of some code path. That's not the
case for NUMA-Q though that also needs the different pgdat structures
for the numa optimizations anyways (and still to me a phys memory
partitioned with hundred discontigous zones looks like an harddisk
partitioned in hundred of different blkdevs).

BTW, about the pgdat loops optimizations, you misunderstood what I meant
in some previous email, with "removing them" I didn't meant to remove
them in the discontigmem case, that would had to be done case by case,
with removing them I meant to remove them only for mainline 2.4.19-pre7
when kernel is compiled for x86 target like 99% of userbase uses it. A
discontigmem using nonlinear also doesn't need to loop. It's a 1 branch
removal optimization (doesn't decrease the complexity of the algorithm
for discontigmem enabled). It's all in function of #ifndef CONFIG_DISCONTIGMEM.
Dropping the loop when discontigmem is enabled is much more interesting
optimization of course.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 18:25                                 ` Daniel Phillips
@ 2002-05-02 18:44                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 18:44 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, Russell King, linux-kernel

On Thu, May 02, 2002 at 08:25:35PM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 18:40, Andrea Arcangeli wrote:
> > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> > > > You can trivially map the phys mem between 1G and 1G+256M to be in a
> > > > direct mapping between 3G+256M and 3G+512M, then you can put such 256M
> > > > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.
> > > > 
> > > > The constraints you have on the normal memory are only two:
> > > > 
> > > > 1) direct mapping
> > > > 2) DMA
> > > > 
> > > > so as far as the ram is capable of 32bit DMA with pci32 and it's mapped
> > > > in the direct mapping you can put it into the normal zone. There is no
> > > > difference at all between discontimem or nonlinear in this sense.
> > > 
> > > Now imagine an 8 node system, with 4Gb of memory in each node.
> > > First 4Gb is in node 0, second 4Gb is in node 1, etc.
> > > 
> > > Even with 64 bit DMA, the real problem is breaking the assumption
> > > that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> > > That's 90% of the difficulty of what Dan's doing anyway, as I
> > > see it.
> > 
> > You don't need any additional common code abstraction to make virtual
								  ^^^^^^^
> > address 3G+256G to point to physical address 1G as in my example above,
>           M ----^

indeed

> > after that you're free to put the physical ram between 1G and 1G+256M
> > into the zone normal of node 1 and the stuff should keep working but
> > with zone-normal spread in more than one node.
> 
> I don't see that you accomplished that at all, with config_discontig.
> How can you address the memory at 3G+256M?  That looks like highmem to

that's virtual memory, to access it you only need to dereference the
address. To get the page * you can simply use virt_to_page(3G+256M) and
it will return a page at phys address 1G+256M.

> me.  No good at all for kmem caches, buffers, struct pages, etc.

It is good for kmem buffers struct pages, pci32, it's ZONE_NORMAL memory.

> Without config_nonlinear, those structures will all have to be off-node
> for most nodes.
> 
> -- 
> Daniel


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:31                                 ` Martin J. Bligh
@ 2002-05-02 18:57                                   ` Andrea Arcangeli
  2002-05-02 19:08                                     ` Daniel Phillips
  2002-05-02 22:39                                     ` Martin J. Bligh
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-02 18:57 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 12:31:52PM -0700, Martin J. Bligh wrote:
> between physical to virtual memory to a non 1-1 mapping.

correct. The direct mapping is nothing magic, it's like a big static
kmap area.  Everybody is required to use
virt_to_page/page_address/pci_map_single/... to switch between virtual
address and mem_map anyways (thanks to the discontigous mem_map), so you
can use this property by making discontigous the virtual space as well,
not only the mem_map.  discontigmem basically just allows that.

> No, you don't need to call changing that mapping "CONFIG_NONLINEAR",
> but that's basically what the bulk of Dan's patch does, so I think we should 
> steal it with impunity ;-) 

The difference is that if you use discontigmem you don't clobber the
common code in any way, there is no "logical/ordinal" abstraction,
there is no special table, it's all hidden in the arch section, and the
pgdat you need them anyways to allocate from affine memory with numa.

Actually the same mmu technique can be used to coalesce in virtual
memory the discontigous chunks of iSeries, then you left the lookup in
the tree to resolve from mem_map to the right virtual address and from
the right virtual address back to mem_map. (and you left DISCONTIGMEM
disabled) I think it should be possible.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 18:57                                   ` Andrea Arcangeli
@ 2002-05-02 19:08                                     ` Daniel Phillips
  2002-05-03  5:15                                       ` Andrea Arcangeli
  2002-05-02 22:39                                     ` Martin J. Bligh
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 19:08 UTC (permalink / raw)
  To: Andrea Arcangeli, Martin J. Bligh; +Cc: Russell King, linux-kernel

On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote:
> On Thu, May 02, 2002 at 12:31:52PM -0700, Martin J. Bligh wrote:
> > between physical to virtual memory to a non 1-1 mapping.
> 
> correct. The direct mapping is nothing magic, it's like a big static
> kmap area.  Everybody is required to use
> virt_to_page/page_address/pci_map_single/... to switch between virtual
> address and mem_map anyways (thanks to the discontigous mem_map), so you
> can use this property by making discontigous the virtual space as well,
> not only the mem_map.  discontigmem basically just allows that.

And what if you don't have enough virtual space to fit all the memory you
need, plus the holes?  Config_nonlinear handles that, config_discontig
doesn't.

> > No, you don't need to call changing that mapping "CONFIG_NONLINEAR",
> > but that's basically what the bulk of Dan's patch does, so I think we should 
> > steal it with impunity ;-) 
> 
> The difference is that if you use discontigmem you don't clobber the
> common code in any way,

First that's wrong.  Look at _alloc_pages and tell me that config_discontig
doesn't impact the common code (in fact, it adds two extra subroutine
calls, including two loops, to every alloc_pages call).

Secondly, config_nonlinear does not clobber the common code.  If it does,
please show me where.

When config_nonlinear is not enabled, suitable stubs are provided to make it
transparent.

> Actually the same mmu technique can be used to coalesce in virtual
> memory the discontigous chunks of iSeries, then you left the lookup in
> the tree to resolve from mem_map to the right virtual address and from
> the right virtual address back to mem_map. (and you left DISCONTIGMEM
> disabled) I think it should be possible.

So you're proposing a new patch?  Have you chosen a name for it?  How
about 'config_nonlinear'? ;-)

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 18:41                                   ` Andrea Arcangeli
@ 2002-05-02 19:19                                     ` William Lee Irwin III
  2002-05-02 19:27                                       ` Daniel Phillips
                                                         ` (2 more replies)
  2002-05-02 19:22                                     ` Daniel Phillips
  1 sibling, 3 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 19:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
>> Being unable to have any ZONE_NORMAL above 4GB allows no change at all.

On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> No change if your first node maps the whole first 4G of physical address
> space, but in such case nonlinear cannot help you in any way anyways.
> The fact you can make no change at all has only to do with the fact
> GFP_KERNEL must return memory accessible from a pci32 device.

Without relaxing this invariant for this architecture there is no hope
that NUMA-Q can ever be efficiently operated by this kernel.


On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> I think most configurations have more than one node mapped into the
> first 4G, and so in those configurations you can do changes and spread
> the direct mapping across all the nodes mapped in the first 4G phys.

These would be partially-populated nodes. There may be up to 16 nodes.
Some unusual management of the interrupt controllers is required to get
the last 4 cpus. Those who know how tend to disavow the knowledge. =)


On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> the fact you can or you can't have something to change with discontigmem
> or nonlinear, it's all about pci32.

Artificially tying together the device-addressibility of memory and
virtual addressibility of memory is a fundamental design decision which
seems to behave poorly for NUMA-Q, though general it seems to work okay.


On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
>> 32-bit PCI is not used on NUMA-Q AFAIK.

On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> but can you plugin 32bit pci hardware into your 64bit-pci slots, right?
> If not, and if you're also sure the linux drivers for your hardware are all
> 64bit-pci capable then you can do the changes regardless of the 4G
> limit, in such case you can spread the direct mapping all over the whole
> 64G physical ram, whereever you want, no 4G constraint anymore.

I believe 64-bit PCI is pretty much taken to be a requirement; if it
weren't the 4GB limit would once again apply and we'd be in much
trouble, or we'd have to implement a different method of accommodating
limited device addressing capabilities and would be in trouble again.


On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
>> So long as zones are physically contiguous and __va() does what its

On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> zones remains physically contigous, it's the virtual address returned by
> page_address that changes. Also the kmap header will need some
> modification, you should always check for PageHIGHMEM in all places to
> know if you must kmap or not, that's a few liner.

I've not been using the generic page_address() in conjunction with
highmem, but this sounds like a very natural thing to do when the need
to do so arises; arranging for storage of the virtual address sounds
trickier, though doable. I'm not sure if mainline would want it, and
I don't feel a pressing need to implement it yet, but then again, I've
not yet been parked in front of a 64GB x86 machine yet...


On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> BTW, about the pgdat loops optimizations, you misunderstood what I meant
> in some previous email, with "removing them" I didn't meant to remove
> them in the discontigmem case, that would had to be done case by case,
> with removing them I meant to remove them only for mainline 2.4.19-pre7
> when kernel is compiled for x86 target like 99% of userbase uses it. A
> discontigmem using nonlinear also doesn't need to loop. It's a 1 branch
> removal optimization (doesn't decrease the complexity of the algorithm
> for discontigmem enabled). It's all in function of
> #ifndef CONFIG_DISCONTIGMEM.

>From my point of view this would be totally uncontroversial. Some arch
maintainers might want a different #ifdef condition but it's fairly
trivial to adjust that to their needs when they speak up.


On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> Dropping the loop when discontigmem is enabled is much more interesting
> optimization of course.
> Andrea

Absolutely; I'd be very supportive of improvements for this case as well.
Many of the systems with the need for discontiguous memory support will
also benefit from parallelizations or other methods of avoiding references
to remote nodes/zones or iterations over all nodes/zones.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 18:41                                   ` Andrea Arcangeli
  2002-05-02 19:19                                     ` William Lee Irwin III
@ 2002-05-02 19:22                                     ` Daniel Phillips
  2002-05-03  6:06                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 19:22 UTC (permalink / raw)
  To: Andrea Arcangeli, William Lee Irwin III, Martin J. Bligh,
	Russell King, linux-kernel

On Thursday 02 May 2002 20:41, Andrea Arcangeli wrote:
> On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> > >> Even with 64 bit DMA, the real problem is breaking the assumption
> > >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> > >> That's 90% of the difficulty of what Dan's doing anyway, as I
> > >> see it.
> > 
> > On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote:
> > > control on virt_to_page, pci_map_single, __va.  Actually it may be as
> > > well cleaner to just let the arch define page_address() when
> > > discontigmem is enabled (instead of hacking on top of __va), that's a
> > > few liner. (the only true limit you have is on the phys ram above 4G,
> > > that cannot definitely go into zone-normal regardless if it belongs to a
> > > direct mapping or not because of pci32 API)
> > > Andrea
> > 
> > Being unable to have any ZONE_NORMAL above 4GB allows no change at all.
> 
> No change if your first node maps the whole first 4G of physical address
> space, but in such case nonlinear cannot help you in any way anyways.

You *still don't have a clue what config_nonlinear does*.

It doesn't matter if the first 4G of physical memory belongs to node zero.
Config_nonlinear allows you to map only part of that to the kernel virtual
space, and the rest would be mapped to highmem.  The next node will map part
of its local memory (perhaps the next 4 gig of physical memory) to a different
part of the kernel virtual space, and so on, so that in the end, all nodes
have at least *some* zone_normal memory.

Do you now see why config_nonlinear is needed in this case?  Are you
willing to recognize the possibility that you might have missed some other
cases where config_nonlinear is needed, and config_discontigmem won't do
the job?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:19                                     ` William Lee Irwin III
@ 2002-05-02 19:27                                       ` Daniel Phillips
  2002-05-02 19:38                                         ` William Lee Irwin III
  2002-05-03  6:10                                         ` Andrea Arcangeli
  2002-05-02 22:20                                       ` Martin J. Bligh
  2002-05-03  6:04                                       ` Andrea Arcangeli
  2 siblings, 2 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 19:27 UTC (permalink / raw)
  To: William Lee Irwin III, Andrea Arcangeli
  Cc: Martin J. Bligh, Russell King, linux-kernel

On Thursday 02 May 2002 21:19, William Lee Irwin III wrote:
> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > Dropping the loop when discontigmem is enabled is much more interesting
> > optimization of course.
> > Andrea
> 
> Absolutely; I'd be very supportive of improvements for this case as well.
> Many of the systems with the need for discontiguous memory support will
> also benefit from parallelizations or other methods of avoiding references
> to remote nodes/zones or iterations over all nodes/zones.

Which loop in which function are we talking about?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 18:10                             ` Andrea Arcangeli
@ 2002-05-02 19:28                               ` Gerrit Huizenga
  2002-05-02 22:23                                 ` Martin J. Bligh
  2002-05-03  6:20                                 ` Andrea Arcangeli
  0 siblings, 2 replies; 165+ messages in thread
From: Gerrit Huizenga @ 2002-05-02 19:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

In message <20020502201043.L11414@dualathlon.random>, > : Andrea Arcangeli writ
es:
> On Thu, May 02, 2002 at 09:58:02AM -0700, Gerrit Huizenga wrote:
> > In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes:
> > > > With numa-q there's a 512M hole in each node IIRC. that's fine
> > > > configuration, similar to the wildfire btw.
> > > 
> > > There's 2 different memory models - the NT mode we use currently
> > > is contiguous, the PTX mode is discontiguous. I don't think it's
> > > as simple as a 512Mb fixed size hole, though I'd have to look it
> > > up to be sure.
> > 
> > No - it definitely isn't as simple as a 512 MB hole.  Depends on how much
> 
> I meant that as an example, I recall that was valid config, 512M of ram
> and 512M hole, then next node 512M ram and 512M hole etc... Of course it
> must be possible to vary the mem size if you want more or less ram in
> each node but still it doesn't generate a problematic layout for
> discontigmem (i.e. not 256 discontigous chunks or something of that
> order).

I *think* the ranges were typically aligned to 4 GB, although with 8 GB
in a single node, I don't remember what the mapping layout looked like.

Which made everything but node 0 into HIGHMEM.

With the "flat" addressing mode that Martin has been using (the
dummied down for NT version) everything is squished together.  That
makes it a bit harder to do node local data structures, although he
may have enough data from the MPS table to split memory appropriately.

gerrit

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:40                               ` Andrea Arcangeli
  2002-05-02 17:16                                 ` William Lee Irwin III
  2002-05-02 18:25                                 ` Daniel Phillips
@ 2002-05-02 19:31                                 ` Martin J. Bligh
  2002-05-02 18:57                                   ` Andrea Arcangeli
  2 siblings, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 19:31 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

> You don't need any additional common code abstraction to make virtual
> address 3G+256G to point to physical address 1G as in my example above,
> after that you're free to put the physical ram between 1G and 1G+256M
> into the zone normal of node 1 and the stuff should keep working but
> with zone-normal spread in more than one node.  You just have full
> control on virt_to_page, pci_map_single, __va.  Actually it may be as
> well cleaner to just let the arch define page_address() when
> discontigmem is enabled (instead of hacking on top of __va), that's a
> few liner. (the only true limit you have is on the phys ram above 4G,
> that cannot definitely go into zone-normal regardless if it belongs to a
> direct mapping or not because of pci32 API)

The thing that's special about ZONE_NORMAL is that it's permanently
mapped into kernel virtual address space, so you *cannot* put memory
in other nodes into ZONE_NORMAL without changing the mapping
between physical to virtual memory to a non 1-1 mapping.

No, you don't need to call changing that mapping "CONFIG_NONLINEAR",
but that's basically what the bulk of Dan's patch does, so I think we should 
steal it with impunity ;-) 

M.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:27                                       ` Daniel Phillips
@ 2002-05-02 19:38                                         ` William Lee Irwin III
  2002-05-02 19:58                                           ` Daniel Phillips
  2002-05-03  6:28                                           ` Andrea Arcangeli
  2002-05-03  6:10                                         ` Andrea Arcangeli
  1 sibling, 2 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 19:38 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrea Arcangeli, Martin J. Bligh, Russell King, linux-kernel

On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
>>> Dropping the loop when discontigmem is enabled is much more interesting
>>> optimization of course.

On Thursday 02 May 2002 21:19, William Lee Irwin III wrote:
>> Absolutely; I'd be very supportive of improvements for this case as well.
>> Many of the systems with the need for discontiguous memory support will
>> also benefit from parallelizations or other methods of avoiding references
>> to remote nodes/zones or iterations over all nodes/zones.

On Thu, May 02, 2002 at 09:27:00PM +0200, Daniel Phillips wrote:
> Which loop in which function are we talking about?

I believe it's just for_each_zone() and for_each_pgdat(), and their
usage in general. I brewed them up to keep things clean (and by and
large they produced largely equivalent code to what preceded it), but
there's no harm in conditionally defining them. I think it's even
beneficial, since things can use them without concerning themselves
about "will this be inefficient for the common case of UP single-node
x86?" and might also have the potential to remove some other #ifdefs.

In the more general case, avoiding an O(fragments) (or sometimes even
O(mem)) iteration in favor of, say, O(lg(fragments)) or O(cpus)
iteration when fragments is very large would be an excellent optimization.

Andrea, if the definitions of these helpers start getting large, I think
it would help to move them to a separate header. akpm has already done so
with page->flags manipulations in 2.5, and it seems like it wouldn't
do any harm to do something similar in 2.4 either. Does that sound good?

Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 18:32                                 ` Daniel Phillips
@ 2002-05-02 19:40                                   ` Roman Zippel
  2002-05-02 20:14                                     ` Daniel Phillips
  2002-05-03  6:30                                     ` Andrea Arcangeli
  0 siblings, 2 replies; 165+ messages in thread
From: Roman Zippel @ 2002-05-02 19:40 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

Daniel Phillips wrote:

> Patching the kernel how, and where?

Check for example in asm-ppc/page.h the __va/__pa functions.

> > Anyway, I agree with Andrea, that another mapping isn't really needed.
> > Clever use of the mmu should give you almost the same result.
> 
> We *are* making clever use of the mmu in config_nonlinear, it is doing the
> nonlinear kernel virtual mapping for us.  Did you have something more clever
> in mind?

I mean to map the memory where you need it. The physical<->virtual
mapping won't be one to one, but you won't need another abstraction and
the current vm is already basically able to handle it.

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:38                                         ` William Lee Irwin III
@ 2002-05-02 19:58                                           ` Daniel Phillips
  2002-05-03  6:28                                           ` Andrea Arcangeli
  1 sibling, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 19:58 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Thursday 02 May 2002 21:38, William Lee Irwin III wrote:
> In the more general case, avoiding an O(fragments) (or sometimes even
> O(mem)) iteration in favor of, say, O(lg(fragments)) or O(cpus)
> iteration when fragments is very large would be an excellent optimization.

In general, config_nonlinear gets it down to O(NR_ZONES), i.e., O(1), by
eliminating the loops across nodes in the non-numa case.

Yes, teaching for_each_* about the 'list length equals one' case would be
worthwhile.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 19:40                                   ` Roman Zippel
@ 2002-05-02 20:14                                     ` Daniel Phillips
  2002-05-03  6:34                                       ` Andrea Arcangeli
  2002-05-03  9:33                                       ` Roman Zippel
  2002-05-03  6:30                                     ` Andrea Arcangeli
  1 sibling, 2 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 20:14 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

On Thursday 02 May 2002 21:40, Roman Zippel wrote:
> Daniel Phillips wrote:
> 
> > Patching the kernel how, and where?
> 
> Check for example in asm-ppc/page.h the __va/__pa functions.

OK, by 'patching the kernel' you must mean 'initialize the m68k_memory array'.

The loop you use does have one advantage: it can handle size variation vs a
shift-lookup strategy.  It's a lot more expensive though, and these are
heavily used operations.

> > > Anyway, I agree with Andrea, that another mapping isn't really needed.
> > > Clever use of the mmu should give you almost the same result.
> > 
> > We *are* making clever use of the mmu in config_nonlinear, it is doing the
> > nonlinear kernel virtual mapping for us.  Did you have something more clever
> > in mind?
> 
> I mean to map the memory where you need it. The physical<->virtual
> mapping won't be one to one, but you won't need another abstraction and
> the current vm is already basically able to handle it.

I'll accept 'not needed for 68K', though I guess config_nonlinear will work
perfectly well for you and be faster than the loops.  However, some of the
problems that config_nonlinear solves cannot be solved by any existing kernel
mechanism.  We've been over the NUMA-Q and mips32 cases in detail, so I won't
reiterate.

Thanks for your input.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 22:20                                       ` Martin J. Bligh
@ 2002-05-02 21:28                                         ` William Lee Irwin III
  2002-05-02 21:52                                           ` Kurt Ferreira
  2002-05-03  6:38                                         ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 21:28 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
>> I believe 64-bit PCI is pretty much taken to be a requirement; if it
>> weren't the 4GB limit would once again apply and we'd be in much
>> trouble, or we'd have to implement a different method of accommodating
>> limited device addressing capabilities and would be in trouble again.

On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote:
> IIRC, there are some funny games you can play with 32bit PCI DMA.
> You're not necessarily restricted to the bottom 4Gb of phys addr space, 
> you're restricted to a 4Gb window, which you can shift by programming 
> a register on the card. Fixing that register to point to a window for the 
> node in question allows you to allocate from a node's pg_data_t and 
> assure DMAable RAM is returned.
> M.


Woops, I forgot about the BAR, thanks. Heck, IIRC you were even the one
who told me about this trick.

Thanks,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 21:28                                         ` William Lee Irwin III
@ 2002-05-02 21:52                                           ` Kurt Ferreira
  2002-05-02 21:55                                             ` William Lee Irwin III
  0 siblings, 1 reply; 165+ messages in thread
From: Kurt Ferreira @ 2002-05-02 21:52 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Martin J. Bligh, linux-kernel

> On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote:
> > IIRC, there are some funny games you can play with 32bit PCI DMA.
> > You're not necessarily restricted to the bottom 4Gb of phys addr space,
> > you're restricted to a 4Gb window, which you can shift by programming
> > a register on the card. Fixing that register to point to a window for the
> > node in question allows you to allocate from a node's pg_data_t and
> > assure DMAable RAM is returned.
> > M.
>
>
> Woops, I forgot about the BAR, thanks. Heck, IIRC you were even the one
> who told me about this trick.
>
By this do you mean setting bits BAR[2:1]=b'10?  Just making sure I get
it.

Thanks
Kurt



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 21:52                                           ` Kurt Ferreira
@ 2002-05-02 21:55                                             ` William Lee Irwin III
  0 siblings, 0 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-02 21:55 UTC (permalink / raw)
  To: Kurt Ferreira; +Cc: Martin J. Bligh, linux-kernel

On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote:
>> Woops, I forgot about the BAR, thanks. Heck, IIRC you were even the one
>> who told me about this trick.

On Thu, May 02, 2002 at 03:52:53PM -0600, Kurt Ferreira wrote:
> By this do you mean setting bits BAR[2:1]=b'10?  Just making sure I get
> it.
> Thanks
> Kurt

I'm not that well-versed in PCI programming; I don't believe I was told
in any greater level of detail than has already crossed this list.

Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:19                                     ` William Lee Irwin III
  2002-05-02 19:27                                       ` Daniel Phillips
@ 2002-05-02 22:20                                       ` Martin J. Bligh
  2002-05-02 21:28                                         ` William Lee Irwin III
  2002-05-03  6:38                                         ` Andrea Arcangeli
  2002-05-03  6:04                                       ` Andrea Arcangeli
  2 siblings, 2 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 22:20 UTC (permalink / raw)
  To: William Lee Irwin III, Andrea Arcangeli
  Cc: Daniel Phillips, Russell King, linux-kernel

> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
>> but can you plugin 32bit pci hardware into your 64bit-pci slots, right?
>> If not, and if you're also sure the linux drivers for your hardware are all
>> 64bit-pci capable then you can do the changes regardless of the 4G
>> limit, in such case you can spread the direct mapping all over the whole
>> 64G physical ram, whereever you want, no 4G constraint anymore.
> 
> I believe 64-bit PCI is pretty much taken to be a requirement; if it
> weren't the 4GB limit would once again apply and we'd be in much
> trouble, or we'd have to implement a different method of accommodating
> limited device addressing capabilities and would be in trouble again.

IIRC, there are some funny games you can play with 32bit PCI DMA.
You're not necessarily restricted to the bottom 4Gb of phys addr space, 
you're restricted to a 4Gb window, which you can shift by programming 
a register on the card. Fixing that register to point to a window for the 
node in question allows you to allocate from a node's pg_data_t and 
assure DMAable RAM is returned.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:28                               ` Gerrit Huizenga
@ 2002-05-02 22:23                                 ` Martin J. Bligh
  2002-05-03  6:20                                 ` Andrea Arcangeli
  1 sibling, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 22:23 UTC (permalink / raw)
  To: Gerrit Huizenga, Andrea Arcangeli
  Cc: Daniel Phillips, Russell King, linux-kernel

> I *think* the ranges were typically aligned to 4 GB, although with 8 GB
> in a single node, I don't remember what the mapping layout looked like.
> 
> Which made everything but node 0 into HIGHMEM.
> 
> With the "flat" addressing mode that Martin has been using (the
> dummied down for NT version) everything is squished together.  That
> makes it a bit harder to do node local data structures, although he
> may have enough data from the MPS table to split memory appropriately.

I have enough data, I know which phys mem ranges are in each node,
but I still need to change the virtual <-> physical mapping in order to
spread ZONE_NORMAL around. Pat has already spread the high memory
around into specific pg_data_t's per node.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 18:57                                   ` Andrea Arcangeli
  2002-05-02 19:08                                     ` Daniel Phillips
@ 2002-05-02 22:39                                     ` Martin J. Bligh
  2002-05-03  7:04                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-02 22:39 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Daniel Phillips, Russell King, linux-kernel

> The difference is that if you use discontigmem you don't clobber the
> common code in any way, there is no "logical/ordinal" abstraction,
> there is no special table, it's all hidden in the arch section, and the
> pgdat you need them anyways to allocate from affine memory with numa.

I *want* the logical / ordinal abstraction. That's not a negative thing -
it reduces the number of complicated things I have to think about,
allowing me to think more clearly, and write correct code ;-)

Not having a multitude of zones to balance in the normal discontigmem
case also seems like a powerful argument to me ...

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
  2002-05-01  1:35               ` Daniel Phillips
  2002-05-02  1:01               ` Andrea Arcangeli
@ 2002-05-02 23:05               ` Daniel Phillips
  2002-05-03  0:05                 ` William Lee Irwin III
  2002-05-03 23:52               ` David Mosberger
  3 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 23:05 UTC (permalink / raw)
  To: Anton Blanchard, Andrea Arcangeli
  Cc: Russell King, linux-kernel, Jesse Barnes

On Thursday 02 May 2002 02:20, Anton Blanchard wrote:
> > so ia64 is one of those archs with a ram layout with huge holes in the
> > middle of the ram of the nodes? I'd be curious to know what's the
> > hardware advantage of designing the ram layout in such a way, compared
> > to all other numa archs that I deal with. Also if you know other archs
> > with huge holes in the middle of the ram of the nodes I'd be curious to
> > know about them too. thanks for the interesting info!
> 
> From arch/ppc64/kernel/iSeries_setup.c:
> 
>  * The iSeries may have very large memories ( > 128 GB ) and a partition
>  * may get memory in "chunks" that may be anywhere in the 2**52 real
>  * address space.  The chunks are 256K in size.
> 
> Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
> solution to this problem.

Hmm, I just re-read your numbers above.  Supposing you have 256 GB of
'installed' memory, divided into 256K chunks at random places in the 52
bit address space, a hash table with 1M entries could map all that
physical memory.  You'd need 16 bytes or so per hash table entry, making
the table 16MB in size.  This would be about .0006% of memory.

More-or-less equivalently, a tree could be used, with the tradeoff being
a little better locality of reference vs more search steps.  The hash
structure can also be tweaked to improve locality by making each hash
entry map several adjacent memory chunks, and hoping that the chunks tend
to occur in groups, which they most probably do.

I'm offering the hash table, combined with config_nonlinear as a generic
solution.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 16:06                           ` Andrea Arcangeli
  2002-05-02 16:10                             ` Martin J. Bligh
@ 2002-05-02 23:42                             ` Daniel Phillips
  2002-05-03  7:45                               ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-02 23:42 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel

On Thursday 02 May 2002 18:06, Andrea Arcangeli wrote:
> On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote:
> > On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote:
> > > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote:
> > > > At the moment I use the contig memory model (so we only use discontig for
> > > > NUMA support) but I may need to change that in the future.
> > > 
> > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
> > > the current discontigmem-numa model too as far I can see.
> > 
> > No it doesn't.  The config_discontigmem model forces all zone_normal memory
> > to be on node zero, so all the remaining nodes can only have highmem locally.
> 
> You can trivially map the phys mem between 1G and 1G+256M to be in a
> direct mapping between 3G+256M and 3G+512M, then you can put such 256M
> at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.

Andrea, I'm re-reading this and I'm guilty of misreading your 3G+512M, what
you meant is PAGE_OFFSET+512M.  Yes, in fact this is exactly what
config_nonlinear does.  Config_discontigmem does not do this, not without
your 'trivial map', and that's all config_nonlinear is: a trivial map done
in an organized way.  This same trivial mapping is capable of replacing all
known non-numa uses of config_discontigmem.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 23:05               ` Daniel Phillips
@ 2002-05-03  0:05                 ` William Lee Irwin III
  2002-05-03  1:19                   ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-03  0:05 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Anton Blanchard, Andrea Arcangeli, Russell King, linux-kernel,
	Jesse Barnes

On Thursday 02 May 2002 02:20, Anton Blanchard wrote:
>> From arch/ppc64/kernel/iSeries_setup.c:
>>  * The iSeries may have very large memories ( > 128 GB ) and a partition
>>  * may get memory in "chunks" that may be anywhere in the 2**52 real
>>  * address space.  The chunks are 256K in size.
>> Also check out CONFIG_MSCHUNKS code and see why I'd love to see a generic
>> solution to this problem.

On Fri, May 03, 2002 at 01:05:45AM +0200, Daniel Phillips wrote:
> Hmm, I just re-read your numbers above.  Supposing you have 256 GB of
> 'installed' memory, divided into 256K chunks at random places in the 52
> bit address space, a hash table with 1M entries could map all that
> physical memory.  You'd need 16 bytes or so per hash table entry, making
> the table 16MB in size.  This would be about .0006% of memory.

Doh! I made all that noise about "contiguously allocated" and the
relaxation of the contiguous allocation requirement on the aggregate
was the whole reason I liked trees so much! Regardless, if there's
virtual contiguity the table can work, and what can I say, it's not my
patch, and there probably isn't a real difference given that your
ratio to memory size is probably small enough to cope.


On Fri, May 03, 2002 at 01:05:45AM +0200, Daniel Phillips wrote:
> More-or-less equivalently, a tree could be used, with the tradeoff being
> a little better locality of reference vs more search steps.  The hash
> structure can also be tweaked to improve locality by making each hash
> entry map several adjacent memory chunks, and hoping that the chunks tend
> to occur in groups, which they most probably do.
> I'm offering the hash table, combined with config_nonlinear as a generic
> solution.

Is the virtual remapping for virtual contiguity available at the time
this remapping table is set up? A 1M-entry table is larger than the
largest available fragment of physically contiguous memory even at
1B/entry. If it's used to direct the virtual remapping you might need
to perform some arch-specific bootstrapping phases. Also, what is the
recourse of a boot-time allocated table when it overflows due to the
onlining of sufficient physical memory? Or are there pointer links
within the table entries so as to provide collision chains? If so,
then the memory requirements are even larger... If you limit the size
of the table to consume an entire hypervisor-allocated memory fragment
would that not require boot-time allocation of a fresh chunk from the
hypervisor and virtually mapping the new chunk? How do you know what
the size of the table should be if the number of chunks varies dramatically?


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  0:05                 ` William Lee Irwin III
@ 2002-05-03  1:19                   ` Daniel Phillips
  2002-05-03 19:47                     ` Dave Engebretsen
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-03  1:19 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Anton Blanchard, Andrea Arcangeli, linux-kernel, Jesse Barnes

On Friday 03 May 2002 02:05, William Lee Irwin III wrote:
> On Fri, May 03, 2002 at 01:05:45AM +0200, Daniel Phillips wrote:
> > More-or-less equivalently, a tree could be used, with the tradeoff being
> > a little better locality of reference vs more search steps.  The hash
> > structure can also be tweaked to improve locality by making each hash
> > entry map several adjacent memory chunks, and hoping that the chunks tend
> > to occur in groups, which they most probably do.
> > I'm offering the hash table, combined with config_nonlinear as a generic
> > solution.
> 
> Is the virtual remapping for virtual contiguity available at the time
> this remapping table is set up? A 1M-entry table is larger than the
> largest available fragment of physically contiguous memory even at
> 1B/entry. If it's used to direct the virtual remapping you might need
> to perform some arch-specific bootstrapping phases.

Interesting point.  Fortunately, the logical_to_phys table doesn't have to
be a hash, making it considerably smaller.  Then we get to the interesting
part: allocating the phys_to_logical hash table.

The boot loader must have provided at least some contiguous physical
memory in order to load the kernel, the compressed disk image and give
us a little working memory.  (For practical purposes, we're most likely to
have been provided with a full gig, or whatever is appropriate according
to the mem= command line setting, but lets pretend it's a lot less than
that.)  Now, the first thing we need to do is fill in enough of the
vsection table to allocate the table itself.  Fortunately, the bottom part
of the table is the part we need to fill in, and we surely have enough
memory to do that.  We just have to be sure that the process of filling
it in doesn't require any bootmem allocations, which is not so hard - we
the existing memory initialization code already has to obey that
requirement.

Naturally, during initialization of the hash table, we want to be sure
not to perform and phys_to_logical translations, as would be required to 
read values from the page tables during swap-out for example.  Probably
there's already no possibility of that, but it needs a comment at least.

I can't provide any more details than that, because I'm not familiar
with the way the iseries boots.  Anton is the man there.

> Also, what is the
> recourse of a boot-time allocated table when it overflows due to the
> onlining of sufficient physical memory?

We ought to have some clue about the maximum number of physical memory
chunks available to us.  I doubt *every* partition is going to be
provided 256 GB of memory.  In fact, the real amount we need will be
considerably less, and the phys_to_logical table will be smaller than
16 MB, say, 1 MB.  Just allocate the whole thing and be done with it.

> Or are there pointer links
> within the table entries so as to provide collision chains?

For this one I'd think a classic, nonlist, hash table is the way to go.

At 16 bytes, I overestimated the per-entry size, really 8 bytes is more
realistic.  We need 34 bits for the key field (52 bit physical range,
less 18 bits chunk size) and considerably less than 32 bits for the
value field (a logical section) so it works out nicely.

> If so,
> then the memory requirements are even larger... If you limit the size
> of the table to consume an entire hypervisor-allocated memory fragment
> would that not require boot-time allocation of a fresh chunk from the
> hypervisor and virtually mapping the new chunk?

I think that the bootstrapping method described above is sufficiently
simple and robust to obviate this requirement.

> How do you know what
> the size of the table should be if the number of chunks varies
> dramatically?

The most obvious and practical approach is to have the boot loader tell
us, we allocate the maximum size needed, and won't worry about that
again.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:08                                     ` Daniel Phillips
@ 2002-05-03  5:15                                       ` Andrea Arcangeli
  2002-05-05 23:54                                         ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  5:15 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote:
> > On Thu, May 02, 2002 at 12:31:52PM -0700, Martin J. Bligh wrote:
> > > between physical to virtual memory to a non 1-1 mapping.
> > 
> > correct. The direct mapping is nothing magic, it's like a big static
> > kmap area.  Everybody is required to use
> > virt_to_page/page_address/pci_map_single/... to switch between virtual
> > address and mem_map anyways (thanks to the discontigous mem_map), so you
> > can use this property by making discontigous the virtual space as well,
> > not only the mem_map.  discontigmem basically just allows that.
> 
> And what if you don't have enough virtual space to fit all the memory you

ZONE_NORMAL is by definition limited by the direct mapping size, so if
you don't have enough virtual space you cannot enlarge the zone_normal
anyways. If need more virtual space you can only do  things like
CONFIG_2G.

> need, plus the holes?  Config_nonlinear handles that, config_discontig
> doesn't.
> 
> > > No, you don't need to call changing that mapping "CONFIG_NONLINEAR",
> > > but that's basically what the bulk of Dan's patch does, so I think we should 
> > > steal it with impunity ;-) 
> > 
> > The difference is that if you use discontigmem you don't clobber the
> > common code in any way,
> 
> First that's wrong.  Look at _alloc_pages and tell me that config_discontig
> doesn't impact the common code (in fact, it adds two extra subroutine
> calls, including two loops, to every alloc_pages call).

there are no two subroutines, check -aa. And the whole point is that we
need a topology description of the machine for numa, nonlinear or not,
what you're talking about is the whole numa concept in 2.4, it is all
but superflous, while nonlinear implications in common code are
superflous just to provide ZONE_NORMAL in more than one node in numa-q.

> 
> Secondly, config_nonlinear does not clobber the common code.  If it does,
> please show me where.
> 
> When config_nonlinear is not enabled, suitable stubs are provided to make it
> transparent.

it's the stubs that are visible to the common code and that are
superflous.

> > Actually the same mmu technique can be used to coalesce in virtual
> > memory the discontigous chunks of iSeries, then you left the lookup in
> > the tree to resolve from mem_map to the right virtual address and from
> > the right virtual address back to mem_map. (and you left DISCONTIGMEM
> > disabled) I think it should be possible.
> 
> So you're proposing a new patch?  Have you chosen a name for it?  How
> about 'config_nonlinear'? ;-)

They're called CONFIG_MULTIQUAD and CONFIG_MSCHUNKS.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:19                                     ` William Lee Irwin III
  2002-05-02 19:27                                       ` Daniel Phillips
  2002-05-02 22:20                                       ` Martin J. Bligh
@ 2002-05-03  6:04                                       ` Andrea Arcangeli
  2002-05-03  6:33                                         ` Martin J. Bligh
  2002-05-03  9:24                                         ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III
  2 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:04 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips,
	Russell King, linux-kernel

On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
> On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> >> Being unable to have any ZONE_NORMAL above 4GB allows no change at all.
> 
> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > No change if your first node maps the whole first 4G of physical address
> > space, but in such case nonlinear cannot help you in any way anyways.
> > The fact you can make no change at all has only to do with the fact
> > GFP_KERNEL must return memory accessible from a pci32 device.
> 
> Without relaxing this invariant for this architecture there is no hope
> that NUMA-Q can ever be efficiently operated by this kernel.

I don't think it make sense to attempt breaking GFP_KERNEL semantics in
2.4 but for 2.5 we can change stuff so that all non-DMA users can ask
for ZONE_NORMAL that will be backed by physical memory over 4G (that's
fine for all inodes,dcache,files,bufferheader,kiobuf,vma and many other
in-core data structures never accessed by hardware via DMA, it's ok even
for the buffer cache because the lowlevel layer has the bounce buffer
layer that is smart enough to understand when bounce buffers are needed
on top of the physical address space pagecache).

> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > I think most configurations have more than one node mapped into the
> > first 4G, and so in those configurations you can do changes and spread
> > the direct mapping across all the nodes mapped in the first 4G phys.
> 
> These would be partially-populated nodes. There may be up to 16 nodes.
> Some unusual management of the interrupt controllers is required to get
> the last 4 cpus. Those who know how tend to disavow the knowledge. =)
> 
> 
> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > the fact you can or you can't have something to change with discontigmem
> > or nonlinear, it's all about pci32.
> 
> Artificially tying together the device-addressibility of memory and
> virtual addressibility of memory is a fundamental design decision which
> seems to behave poorly for NUMA-Q, though general it seems to work okay.

Yes, you know since a few months ago we weren't even capable of skipping
the bounce buffers for the memory between 1G and 4G and for the memory
above 4G with pci-64, now we can, in the future we can be more
finegrined if there's the need to.

Again note that nonlinear can do nothing to help you there, the
limitation you deal with is pci32 and the GFP API, not at all about
discontigmem or nonlinear. we basically changed topic from here.

> On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> >> 32-bit PCI is not used on NUMA-Q AFAIK.
> 
> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > but can you plugin 32bit pci hardware into your 64bit-pci slots, right?
> > If not, and if you're also sure the linux drivers for your hardware are all
> > 64bit-pci capable then you can do the changes regardless of the 4G
> > limit, in such case you can spread the direct mapping all over the whole
> > 64G physical ram, whereever you want, no 4G constraint anymore.
> 
> I believe 64-bit PCI is pretty much taken to be a requirement; if it
> weren't the 4GB limit would once again apply and we'd be in much
> trouble, or we'd have to implement a different method of accommodating
> limited device addressing capabilities and would be in trouble again.

If you're sure all the hw device are pci64 and the device drivers are
using DAC to submit the bus addresses, then you're just fine and you can
use pages over 4G for the ZONE_NORMAL too. and yes, if you add an IOMMU
unit like the GART then you can fill the zone_normal with phys pages
over 4G too because then the bus address won't be an identity anymore
with the phys addr, I just assumed it wasn't the case because most x86
doesn't have that capability besides the GART that isn't currently used
by the kernel as an iommu but that it's left to use to build contigous
ram for the AGP cards (and also not all x86 have an AGP so we couldn't
use it by default on x86 even assuming the graphics card doesn't need
it).

> On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> >> So long as zones are physically contiguous and __va() does what its
> 
> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > zones remains physically contigous, it's the virtual address returned by
> > page_address that changes. Also the kmap header will need some
> > modification, you should always check for PageHIGHMEM in all places to
> > know if you must kmap or not, that's a few liner.
> 
> I've not been using the generic page_address() in conjunction with
> highmem, but this sounds like a very natural thing to do when the need
> to do so arises; arranging for storage of the virtual address sounds
> trickier, though doable. I'm not sure if mainline would want it, and
> I don't feel a pressing need to implement it yet, but then again, I've
> not yet been parked in front of a 64GB x86 machine yet...

Personally I always had the hope to never need to see a 64G 32bit
machine 8). I mean, even if you manage to solve the pci32bit problem
with GFP_KERNEL, then you still have to share 800M across 16 nodes with
4G each. So by striping zone_normal over all the nodes to have numa-local
data structures with fast slab allocations will get at most 50mbyte per
node of which around 90% of this 50M will be eat by the mem_map array
for those 50M plus the other 4G-50M. So at the end you'll be left with
only say 5/10M per node of zone_normal that will be filled immediatly as
soon as you start reading some directory from disk. a few hundred mbyte
of vfs cache is the minimum for those machines, this doesn't even take
into account bh headers for the pagecache, physical address space
pagecache for the buffercache, kiobufs, vma, etc... Even ignoring the fact
it's NUMA a 64G machine will boot fine (thanks to your 2.4.19pre work
that shrinks of some bytes each page structure) but still it will work well
only depending on what you're doing, for example it's fine for number
cruncking but it will be bad for most other important workloads. And
this is only because of the 32bit address space, it doesn't have anything
to do with nonlinear/numa/discontigmem or pci32.  It's just that 1G of
virtual address space reserved for kernel is too low to handle
efficiently 64G of physical ram, this is a fact and you can't workaround
it. every workaround will add a penality here or there. The workaround
you will be mostly forced to take is CONFIG_2G, after that the userspace
will be limited to less than 1G per task returned by malloc (from over
1G to below 2G) and that will be a showstopper again for most userspace
apps that wants to run on a 64G box like a DBMS that wants almost 2G of
SGA. I'm glad we're finally going to migrate all to 64bit, just in time
not to see a relevant number of 32bit 64G boxes.

And of course, I don't mean a 64G 32bit machine doesn't make sense, it
can make perfect sense for a certain number of users with specific needs
of lots of ram and with very few kernel data structures, if you do that
that's because you know what you're doing and you know you can tweak
linux for your own workload and that's fine as far it's not supposed to
be a general purpose machine (with general purpose I mean pretending to
run a DBMS with a 1.7G SGA requiring CONFIG_3G, plus a cvs [or bk if
you're a bk fan] server dealing with huge vfs metadata at the same time,
for istance the cvs workload would run faster booting with mem=32G :)

> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > BTW, about the pgdat loops optimizations, you misunderstood what I meant
> > in some previous email, with "removing them" I didn't meant to remove
> > them in the discontigmem case, that would had to be done case by case,
> > with removing them I meant to remove them only for mainline 2.4.19-pre7
> > when kernel is compiled for x86 target like 99% of userbase uses it. A
> > discontigmem using nonlinear also doesn't need to loop. It's a 1 branch
> > removal optimization (doesn't decrease the complexity of the algorithm
> > for discontigmem enabled). It's all in function of
> > #ifndef CONFIG_DISCONTIGMEM.
> 
> >From my point of view this would be totally uncontroversial. Some arch
> maintainers might want a different #ifdef condition but it's fairly
> trivial to adjust that to their needs when they speak up.

Yep. Nobody did it probably just to left the code clean and because it
would only remove a branch that wouldn't be measurable in any benchmark.
Infact I'm not doing it either, I raised it just as a more
worthwhile improvement compared to sharing a cachline between the last
page of a mem_map and the first page of the mem_map in the next pgdat
(again assuming a sane number of discontig chunks, say 16 with 32/64G of
ram global, not hundred of chunks)

> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > Dropping the loop when discontigmem is enabled is much more interesting
> > optimization of course.
> > Andrea
> 
> Absolutely; I'd be very supportive of improvements for this case as well.
> Many of the systems with the need for discontiguous memory support will
> also benefit from parallelizations or other methods of avoiding references
> to remote nodes/zones or iterations over all nodes/zones.

I would suggest to start on case-by-case basis looking at the profiling,
so we make more complex only what is worth to optimize.  For example
nr_free_buffer_pages() I guess it will showup because it is used quite
frequently.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:22                                     ` Daniel Phillips
@ 2002-05-03  6:06                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: William Lee Irwin III, Martin J. Bligh, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:22:07PM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 20:41, Andrea Arcangeli wrote:
> > On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> > > On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> > > >> Even with 64 bit DMA, the real problem is breaking the assumption
> > > >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> > > >> That's 90% of the difficulty of what Dan's doing anyway, as I
> > > >> see it.
> > > 
> > > On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote:
> > > > control on virt_to_page, pci_map_single, __va.  Actually it may be as
> > > > well cleaner to just let the arch define page_address() when
> > > > discontigmem is enabled (instead of hacking on top of __va), that's a
> > > > few liner. (the only true limit you have is on the phys ram above 4G,
> > > > that cannot definitely go into zone-normal regardless if it belongs to a
> > > > direct mapping or not because of pci32 API)
> > > > Andrea
> > > 
> > > Being unable to have any ZONE_NORMAL above 4GB allows no change at all.
> > 
> > No change if your first node maps the whole first 4G of physical address
> > space, but in such case nonlinear cannot help you in any way anyways.
> 
> You *still don't have a clue what config_nonlinear does*.
> 
> It doesn't matter if the first 4G of physical memory belongs to node zero.
> Config_nonlinear allows you to map only part of that to the kernel virtual
> space, and the rest would be mapped to highmem.  The next node will map part
> of its local memory (perhaps the next 4 gig of physical memory) to a different
> part of the kernel virtual space, and so on, so that in the end, all nodes
> have at least *some* zone_normal memory.

You are the one that has no clue of what I'm talking about. Go ahead, do
that and you'll see the corruption you get after the first vmalloc32 or
similar.

This has nothing to do with nonlinaer or anything discontigmem/numa.
This is all about the GFP kernel API with pci32.

> 
> Do you now see why config_nonlinear is needed in this case?  Are you
> willing to recognize the possibility that you might have missed some other
> cases where config_nonlinear is needed, and config_discontigmem won't do
> the job?
> 
> -- 
> Daniel


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:27                                       ` Daniel Phillips
  2002-05-02 19:38                                         ` William Lee Irwin III
@ 2002-05-03  6:10                                         ` Andrea Arcangeli
  1 sibling, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:10 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: William Lee Irwin III, Martin J. Bligh, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:27:00PM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 21:19, William Lee Irwin III wrote:
> > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> > > Dropping the loop when discontigmem is enabled is much more interesting
> > > optimization of course.
> > > Andrea
> > 
> > Absolutely; I'd be very supportive of improvements for this case as well.
> > Many of the systems with the need for discontiguous memory support will
> > also benefit from parallelizations or other methods of avoiding references
> > to remote nodes/zones or iterations over all nodes/zones.
> 
> Which loop in which function are we talking about?

the pgdat loops. example, this could be optimized for the 99% of userbase to:

	do {
                zonelist_t *zonelist = pgdat->node_zonelists + (GFP_USER & GFP_ZONEMASK);
                zone_t **zonep = zonelist->zones;
                zone_t *zone;

                for (zone = *zonep++; zone; zone = *zonep++) {
                        unsigned long size = zone->size;
                        unsigned long high = zone->pages_high;
                        if (size > high)
                                sum += size - high;
                }
	
#ifdef CONFIG_DISCONTIGMEM
		pgdat = pgdat->node_next;
	} while (pgdat);
#else
	} while (0)
#endif

so allowing the compiler to remove a branch and a few instructions from
the asm, but it would be a microoptimization not visible in benchmarks,
I'm not actually suggesting that mostly for code clarity, branch
prediction should also take it right if it starts to be executed
frequently (hopefully the asm is large enough that it doesn't get
confused by the inner loop that is quite near).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:28                               ` Gerrit Huizenga
  2002-05-02 22:23                                 ` Martin J. Bligh
@ 2002-05-03  6:20                                 ` Andrea Arcangeli
  2002-05-03  6:39                                   ` Martin J. Bligh
  1 sibling, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:20 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 12:28:52PM -0700, Gerrit Huizenga wrote:
> In message <20020502201043.L11414@dualathlon.random>, > : Andrea Arcangeli writ
> es:
> > On Thu, May 02, 2002 at 09:58:02AM -0700, Gerrit Huizenga wrote:
> > > In message <3971861785.1020330424@[10.10.2.3]>, > : "Martin J. Bligh" writes:
> > > > > With numa-q there's a 512M hole in each node IIRC. that's fine
> > > > > configuration, similar to the wildfire btw.
> > > > 
> > > > There's 2 different memory models - the NT mode we use currently
> > > > is contiguous, the PTX mode is discontiguous. I don't think it's
> > > > as simple as a 512Mb fixed size hole, though I'd have to look it
> > > > up to be sure.
> > > 
> > > No - it definitely isn't as simple as a 512 MB hole.  Depends on how much
> > 
> > I meant that as an example, I recall that was valid config, 512M of ram
> > and 512M hole, then next node 512M ram and 512M hole etc... Of course it
> > must be possible to vary the mem size if you want more or less ram in
> > each node but still it doesn't generate a problematic layout for
> > discontigmem (i.e. not 256 discontigous chunks or something of that
> > order).
> 
> I *think* the ranges were typically aligned to 4 GB, although with 8 GB
> in a single node, I don't remember what the mapping layout looked like.
> 
> Which made everything but node 0 into HIGHMEM.

ok.

> 
> With the "flat" addressing mode that Martin has been using (the
> dummied down for NT version) everything is squished together.  That
> makes it a bit harder to do node local data structures, although he
> may have enough data from the MPS table to split memory appropriately.

sure, the only issue is the API that the hardware provides to advertise
the start/end of the memory for each node. It doesn't matter if it's
squashed or not as long as you still know the start/end of the phys ram
per node. It also won't make any difference with nonlinear or
discontigmem because you need to fill the pgdat anyways to enable the
numa heuristics (node-affine-allocations being the most sensible etc..).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 19:38                                         ` William Lee Irwin III
  2002-05-02 19:58                                           ` Daniel Phillips
@ 2002-05-03  6:28                                           ` Andrea Arcangeli
  1 sibling, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:28 UTC (permalink / raw)
  To: William Lee Irwin III, Daniel Phillips, Martin J. Bligh,
	Russell King, linux-kernel

On Thu, May 02, 2002 at 12:38:47PM -0700, William Lee Irwin III wrote:
> On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> >>> Dropping the loop when discontigmem is enabled is much more interesting
> >>> optimization of course.
> 
> On Thursday 02 May 2002 21:19, William Lee Irwin III wrote:
> >> Absolutely; I'd be very supportive of improvements for this case as well.
> >> Many of the systems with the need for discontiguous memory support will
> >> also benefit from parallelizations or other methods of avoiding references
> >> to remote nodes/zones or iterations over all nodes/zones.
> 
> On Thu, May 02, 2002 at 09:27:00PM +0200, Daniel Phillips wrote:
> > Which loop in which function are we talking about?
> 
> I believe it's just for_each_zone() and for_each_pgdat(), and their
> usage in general. I brewed them up to keep things clean (and by and
> large they produced largely equivalent code to what preceded it), but
> there's no harm in conditionally defining them. I think it's even
> beneficial, since things can use them without concerning themselves
> about "will this be inefficient for the common case of UP single-node
> x86?" and might also have the potential to remove some other #ifdefs.
> 
> In the more general case, avoiding an O(fragments) (or sometimes even
> O(mem)) iteration in favor of, say, O(lg(fragments)) or O(cpus)
> iteration when fragments is very large would be an excellent optimization.
> 
> Andrea, if the definitions of these helpers start getting large, I think
> it would help to move them to a separate header. akpm has already done so

sure for 2.5. However for 2.4 still I'm not very excited about those
optimizations getting in now, at least until some of the other more
important pending patches are included. I don't care if those
optimizations are obvious or not, it's just more work for Marcelo and I
prefer him to spend all his cpu time on things that matters, not on
unnecessary cleanups (at least until there will be pending things that
matters, if there aren't it's fine to work on
microoptimizations/cleanups), and nevertheless it would generate rejects
and more work for me too but I have ways that reduces my overhead,
so my reject solving work would be really my last concern.

> with page->flags manipulations in 2.5, and it seems like it wouldn't
> do any harm to do something similar in 2.4 either. Does that sound good?
> 
> Cheers,
> Bill


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 19:40                                   ` Roman Zippel
  2002-05-02 20:14                                     ` Daniel Phillips
@ 2002-05-03  6:30                                     ` Andrea Arcangeli
  1 sibling, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:30 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Daniel Phillips, Ralf Baechle, Russell King, linux-kernel

On Thu, May 02, 2002 at 09:40:48PM +0200, Roman Zippel wrote:
> mapping won't be one to one, but you won't need another abstraction and
> the current vm is already basically able to handle it.

this was basically my whole point, agreed.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  6:04                                       ` Andrea Arcangeli
@ 2002-05-03  6:33                                         ` Martin J. Bligh
  2002-05-03  8:38                                           ` Andrea Arcangeli
  2002-05-03  9:24                                         ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III
  1 sibling, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03  6:33 UTC (permalink / raw)
  To: Andrea Arcangeli, William Lee Irwin III, Daniel Phillips,
	Russell King, linux-kernel

FYI, whilst we've mentioned NUMA-Q in these arguments, much of this
is generic to any 32 bit NUMA machine, the new x440 for example.

> I don't think it make sense to attempt breaking GFP_KERNEL semantics in
> 2.4 but for 2.5 we can change stuff so that all non-DMA users can ask
> for ZONE_NORMAL that will be backed by physical memory over 4G (that's
> fine for all inodes,dcache,files,bufferheader,kiobuf,vma and many other
> in-core data structures never accessed by hardware via DMA, it's ok even
> for the buffer cache because the lowlevel layer has the bounce buffer
> layer that is smart enough to understand when bounce buffers are needed
> on top of the physical address space pagecache).

Sounds good. Hopefully we can kill off ZONE_DMA for the old ISA stuff
at the same time except as a backwards compatibility config option that
you'd have to explicitly enable ...
 
> Again note that nonlinear can do nothing to help you there, the
> limitation you deal with is pci32 and the GFP API, not at all about
> discontigmem or nonlinear. we basically changed topic from here.

There are several different problems we seem to be discussing here:

1. Cleaning up discontig mem alloc for UMA machines.
2. Having a non-contiguous ZONE_NORMAL across NUMA nodes.
3. DMA addressibility of memory.

(and probably others I've missed). Nonlinear is more about the
first two, and not the third, at least to my mind.

> Personally I always had the hope to never need to see a 64G 32bit
> machine 8). I mean, even if you manage to solve the pci32bit problem
> with GFP_KERNEL, then you still have to share 800M across 16 nodes with
> 4G each. So by striping zone_normal over all the nodes to have numa-local
> data structures with fast slab allocations will get at most 50mbyte per
> node of which around 90% of this 50M will be eat by the mem_map array
> for those 50M plus the other 4G-50M. 

You're assuming we're always going to globally map every struct page
into kernel address space for ever. That's a fundamental scalability
problem for a 32 bit machine, and I think we need to fix it. If we
map only the pages the process is using into the user-kernel address
space area, rather than the global KVA, we get rid of some of these
problems. Not that that plan doesn't have its own problems, but ... ;-)

Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit 
virtual addr space a long time ago with Dynix/PTX.

> So at the end you'll be left with
> only say 5/10M per node of zone_normal that will be filled immediatly as
> soon as you start reading some directory from disk. a few hundred mbyte
> of vfs cache is the minimum for those machines, this doesn't even take
> into account bh headers for the pagecache, physical address space
> pagecache for the buffercache, kiobufs, vma, etc... 

Bufferheads are another huge problem right now. For a P4 machine, they
round off to 128 bytes per data structure. I was just looking at a 16Gb
machine that had completely wedged itself by filling ZONE_NORMAL with 
unfreeable overhead - 440Mb of bufferheads alone. Globally mapping the
bufferheads is probably another thing that'll have to go.

> It's just that 1G of
> virtual address space reserved for kernel is too low to handle
> efficiently 64G of physical ram, this is a fact and you can't 
> workaround it. 

Death to global mappings! ;-)

I'd agree that a 64 bit vaddr space makes much more sense, but we're
stuck with the chips we've got for a little while yet. AMD were a few
years too late for the bleeding edge Intel arch people amongst us.

M.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 20:14                                     ` Daniel Phillips
@ 2002-05-03  6:34                                       ` Andrea Arcangeli
  2002-05-03  9:33                                       ` Roman Zippel
  1 sibling, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:34 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Roman Zippel, Ralf Baechle, Russell King, linux-kernel

On Thu, May 02, 2002 at 10:14:02PM +0200, Daniel Phillips wrote:
> mechanism.  We've been over the NUMA-Q and mips32 cases in detail, so I won't

I didn't hear the mips32 argument, but for NUMA-Q nonlinear is
definitely the last thing you want, there is no discontinuity in the ram
in each node. nonlinaer can make sense _only_ when there are ram holes
in the middle of the per-numa-node-mem_map. NUMA-Q has the need of
spreading the zone_normal over the different nodes and nonlinaer is
definitely not needed and it won't help in achieving that object, NUMA-Q
infact needs discontigmem topology description to allow the numa
optimizations so it cannot use nonlinear anyways to handle the holes
between the numa-nodes.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 22:20                                       ` Martin J. Bligh
  2002-05-02 21:28                                         ` William Lee Irwin III
@ 2002-05-03  6:38                                         ` Andrea Arcangeli
  2002-05-03  6:58                                           ` Martin J. Bligh
  1 sibling, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  6:38 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 03:20:39PM -0700, Martin J. Bligh wrote:
> > On Thu, May 02, 2002 at 08:41:36PM +0200, Andrea Arcangeli wrote:
> >> but can you plugin 32bit pci hardware into your 64bit-pci slots, right?
> >> If not, and if you're also sure the linux drivers for your hardware are all
> >> 64bit-pci capable then you can do the changes regardless of the 4G
> >> limit, in such case you can spread the direct mapping all over the whole
> >> 64G physical ram, whereever you want, no 4G constraint anymore.
> > 
> > I believe 64-bit PCI is pretty much taken to be a requirement; if it
> > weren't the 4GB limit would once again apply and we'd be in much
> > trouble, or we'd have to implement a different method of accommodating
> > limited device addressing capabilities and would be in trouble again.
> 
> IIRC, there are some funny games you can play with 32bit PCI DMA.
> You're not necessarily restricted to the bottom 4Gb of phys addr space, 
> you're restricted to a 4Gb window, which you can shift by programming 
> a register on the card. Fixing that register to point to a window for the 
> node in question allows you to allocate from a node's pg_data_t and 
> assure DMAable RAM is returned.

if you've as many windows as the number of nodes than you're just fine
in all cases.  you only need to teach pci_map_single and friends to
return the right bus address that won't be an identity anymore with the
phys addr, then you can forget the >4G phys constraint on the pages
returned by zone_normal :).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  6:20                                 ` Andrea Arcangeli
@ 2002-05-03  6:39                                   ` Martin J. Bligh
  0 siblings, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03  6:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Gerrit Huizenga
  Cc: Daniel Phillips, Russell King, linux-kernel

>> With the "flat" addressing mode that Martin has been using (the
>> dummied down for NT version) everything is squished together.  That
>> makes it a bit harder to do node local data structures, although he
>> may have enough data from the MPS table to split memory appropriately.
> 
> sure, the only issue is the API that the hardware provides to advertise
> the start/end of the memory for each node. It doesn't matter if it's
> squashed or not as long as you still know the start/end of the phys ram
> per node. It also won't make any difference with nonlinear or
> discontigmem because you need to fill the pgdat anyways to enable the
> numa heuristics (node-affine-allocations being the most sensible etc..).

Yup, we can grab that info from the BIOS generated tables - see 
Pat Gaughen's patch posted here a few days ago that parses those
tables and feeds the pgdats if you want the gory details.

Martin.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  6:38                                         ` Andrea Arcangeli
@ 2002-05-03  6:58                                           ` Martin J. Bligh
  0 siblings, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03  6:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel

>> IIRC, there are some funny games you can play with 32bit PCI DMA.
>> You're not necessarily restricted to the bottom 4Gb of phys addr space, 
>> you're restricted to a 4Gb window, which you can shift by programming 
>> a register on the card. Fixing that register to point to a window for the 
>> node in question allows you to allocate from a node's pg_data_t and 
>> assure DMAable RAM is returned.
> 
> if you've as many windows as the number of nodes than you're just fine
> in all cases.  you only need to teach pci_map_single and friends to
> return the right bus address that won't be an identity anymore with the
> phys addr, then you can forget the >4G phys constraint on the pages
> returned by zone_normal :).

I only have third hand information, rather than real experience in
this particular area, but I believe the general idea was to map
the window for any given card onto it's own node's physaddr space.

For a general dirty kludge, we could allocated DMAable memory by
simply doing an alloc_pages_node from node 0 (assuming a max of
4Gb in the first node ... if we really want a bounce buffer *and*
we have more than 4Gb in the first node *and* we have a 32 bit
DMA card, we can always alloc from ZONE_NORMAL on node 0 ... yes,
that's pretty disgusting ... but 99% of things will have 64 bit
DMA).

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 22:39                                     ` Martin J. Bligh
@ 2002-05-03  7:04                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  7:04 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 03:39:54PM -0700, Martin J. Bligh wrote:
> > The difference is that if you use discontigmem you don't clobber the
> > common code in any way, there is no "logical/ordinal" abstraction,
> > there is no special table, it's all hidden in the arch section, and the
> > pgdat you need them anyways to allocate from affine memory with numa.
> 
> I *want* the logical / ordinal abstraction. That's not a negative thing -
> it reduces the number of complicated things I have to think about,
> allowing me to think more clearly, and write correct code ;-)

That's just overhead. you don't need an additional table
ordinal/logical things.

the only case nonlinear will pay off is when you have to deal with a
single pgdat with physical huge holes in the middle of its per-node
mem_map. You don't have those holes in the middle of the mem_map of each
node, so it's cleaner and faster to avoid nonlinear for you, it's just
overhead.

nonlinear instead definitely pays off with the origin 2k layout shown by
Ralf, or with the iseries machine if the partitioning mandates an huge
number of discontigous chunks.

> 
> Not having a multitude of zones to balance in the normal discontigmem
> case also seems like a powerful argument to me ...
> 
> M.


Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02 23:42                             ` Daniel Phillips
@ 2002-05-03  7:45                               ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  7:45 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel

On Fri, May 03, 2002 at 01:42:56AM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 18:06, Andrea Arcangeli wrote:
> > On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote:
> > > On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote:
> > > > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote:
> > > > > At the moment I use the contig memory model (so we only use discontig for
> > > > > NUMA support) but I may need to change that in the future.
> > > > 
> > > > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
> > > > the current discontigmem-numa model too as far I can see.
> > > 
> > > No it doesn't.  The config_discontigmem model forces all zone_normal memory
> > > to be on node zero, so all the remaining nodes can only have highmem locally.
> > 
> > You can trivially map the phys mem between 1G and 1G+256M to be in a
> > direct mapping between 3G+256M and 3G+512M, then you can put such 256M
> > at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.
> 
> Andrea, I'm re-reading this and I'm guilty of misreading your 3G+512M, what
> you meant is PAGE_OFFSET+512M.  Yes, in fact this is exactly what

yes, I was short in explaining it, 3G == PAGE_OFFSET for 99% of userbase
but it wasn't obvious.

> config_nonlinear does.  Config_discontigmem does not do this, not without
> your 'trivial map', and that's all config_nonlinear is: a trivial map done
> in an organized way.  This same trivial mapping is capable of replacing all
> known non-numa uses of config_discontigmem.

You add a table lookup, the lookup on such table or data structure is
pure overhead if your ram is contigous.  NUMA-Q has a contigous ram
within the node so it doesn't make sense to add the nonlinear overhead,
to provide normal memory from the other nodes they only need to change
virt_to_page and page_address, plus of course the initialization of the
direct mapping (and the window intialization of the pci32
BAR windows/pci_map_single, but this latter pci part is indipendent
of the discontigmem/nonlinear issue).

nonlinear make sense and it's not a pure overhead _only_ when the
mem_map has holes, so instead of wasting ram with the mem_map you pay a
CPU hit with your nonlinear lookup, and so it can pay off, if there's no
hole in the per-node mem_map pointed by the pgdat then nonlinear cannot
pay off. At the start of the thread I never heard of configurations with
huge ram holes in the middle of the nodes, I thought it had to be
misconfigured hardware, origin 2k and iseries falls in this category so
for them nonlinear can pay off (but if I had an iSeries I would know how
to partition it efficiently and I would be fine with discontigmem be
sure, the other is a fascinating but slow dinosaur anyways).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  6:33                                         ` Martin J. Bligh
@ 2002-05-03  8:38                                           ` Andrea Arcangeli
  2002-05-03  9:26                                             ` William Lee Irwin III
  2002-05-03 15:17                                             ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
  0 siblings, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  8:38 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel

On Thu, May 02, 2002 at 11:33:43PM -0700, Martin J. Bligh wrote:
> into kernel address space for ever. That's a fundamental scalability
> problem for a 32 bit machine, and I think we need to fix it. If we
> map only the pages the process is using into the user-kernel address
> space area, rather than the global KVA, we get rid of some of these
> problems. Not that that plan doesn't have its own problems, but ... ;-)

:) As said every workaround has a significant drawback at this point.
Starting flooding the tlb with invlpg and pagetable walking every time
we need to do a set_bit or clear_bit test_bit or an unlock_page is both
overkill at runtime and overcomplex on the software side too to manage
those kernel pools in user memory.

just assume we do that and that you're ok to pay for the hit in general
purpose usage, then the next year how will you plan to workaround the
limitation of 64G of physical ram, are you going to multiplex another
64G of ram via a pci register so you can handle 128G of ram on x86 just
not simultaneously? (but that's ok in theory, the cpu won't notice
you're swapping the ram under it, and you cannot keep mapped in virtual
mem more than 4G anyways simultaneously, so it doesn't matter if some
ram isn't visible on the phsical side either)

I mean, in theory there's no limit, but in practice there's a limit, 64G
is just over the limit for general purpose x86 IMHO, it's at a point
where every workaround for something has a significant performance (or
memory drawback), still very fine for custom apps that needs that much
ram but 32G is the pratical limit of general purpose x86 IMHO.

Ah, and of course you could also use 2M pagetables by default to make it
more usable but still you would run in some huge ram wastage in certain
usages with small files, huge pageins and reads swapout and swapins,
plus it wouldn't be guaranteed to be transparent to the userspace
binaries (for istance mmap offset fields would break backwards
compatibility on the required alignment, that's probably the last
problem though). Despite its also significant drawbacks and the
complexity of the change, probably the 4M pagetables would be the saner
approch to manage more efficiently 64G with only a 800M kernel window.

> Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit 
> virtual addr space a long time ago with Dynix/PTX.

You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said
in the earlier email there are many applications that doesn't care if
there's only a few meg of zone_normal and for them 2.4.19pre8 is just
fine (actually -aa is much better for the bounce buffers and other vm
fixes in that area). If all the load is in userspace current 2.4 is just
optimal and you'll take advantage of all the ram without problems (let's
assume it's not a numa machine, with numa you'd be better with the fixes
I included in my tree).  But if you need the kernel to do some amount of
work, like vfs caching, blkdev cache, lots of bh on pagecache, lots of
vma, lots of kiobufs, skb etc..  then you'd probably be faster if you
boot with mem=32G or at least you should take actions like recompiling
the kernel as CONFIG_2G that would then break SGA large 1.7G etc...

> > So at the end you'll be left with
> > only say 5/10M per node of zone_normal that will be filled immediatly as
> > soon as you start reading some directory from disk. a few hundred mbyte
> > of vfs cache is the minimum for those machines, this doesn't even take
> > into account bh headers for the pagecache, physical address space
> > pagecache for the buffercache, kiobufs, vma, etc... 
> 
> Bufferheads are another huge problem right now. For a P4 machine, they
> round off to 128 bytes per data structure. I was just looking at a 16Gb
> machine that had completely wedged itself by filling ZONE_NORMAL with 

Go ahead, use -aa or the vm-33 update, I fixed that problem a few days
after hearing about it the first time (with the due credit to Rik in a
comment for showing me such problem btw, I never noticed it before).

> unfreeable overhead - 440Mb of bufferheads alone. Globally mapping the
> bufferheads is probably another thing that'll have to go.
> 
> > It's just that 1G of
> > virtual address space reserved for kernel is too low to handle
> > efficiently 64G of physical ram, this is a fact and you can't 
> > workaround it. 
> 
> Death to global mappings! ;-)
> 
> I'd agree that a 64 bit vaddr space makes much more sense, but we're

This is my whole point yes :)

> stuck with the chips we've got for a little while yet. AMD were a few
> years too late for the bleeding edge Intel arch people amongst us.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  6:04                                       ` Andrea Arcangeli
  2002-05-03  6:33                                         ` Martin J. Bligh
@ 2002-05-03  9:24                                         ` William Lee Irwin III
  2002-05-03 10:30                                           ` Andrea Arcangeli
  2002-05-03 15:32                                           ` Martin J. Bligh
  1 sibling, 2 replies; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-03  9:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

I apologize in advance for the untimeliness of this response; I took
perhaps more time than necessary to consider the contents thereof.

On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
>> Without relaxing this invariant for this architecture there is no hope
>> that NUMA-Q can ever be efficiently operated by this kernel.

On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> I don't think it make sense to attempt breaking GFP_KERNEL semantics in
> 2.4 but for 2.5 we can change stuff so that all non-DMA users can ask
> for ZONE_NORMAL that will be backed by physical memory over 4G (that's
> fine for all inodes,dcache,files,bufferheader,kiobuf,vma and many other
> in-core data structures never accessed by hardware via DMA, it's ok even
> for the buffer cache because the lowlevel layer has the bounce buffer
> layer that is smart enough to understand when bounce buffers are needed
> on top of the physical address space pagecache).

Well, in a sense, they're already facing some problems from the
progressively stranger hardware people have been porting Linux to. For
instance, suppose there were a machine whose buses were only capable
of addressing memory on nodes local to them... The assumption that a
membership within a single address region suffices to ensure that
devices are capable of addressing it already then breaks down.
(The workaround was to IPI and issue the command from another node.)


On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
>> Artificially tying together the device-addressibility of memory and
>> virtual addressibility of memory is a fundamental design decision which
>> seems to behave poorly for NUMA-Q, though general it seems to work okay.

On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> Yes, you know since a few months ago we weren't even capable of skipping
> the bounce buffers for the memory between 1G and 4G and for the memory
> above 4G with pci-64, now we can, in the future we can be more
> finegrined if there's the need to.
> Again note that nonlinear can do nothing to help you there, the
> limitation you deal with is pci32 and the GFP API, not at all about
> discontigmem or nonlinear. we basically changed topic from here.

Given the amount of traffic that's already happened for that thread,
I'd be glad to change subjects. =)

While I don't have a particular plan to address what changes to the
GFP API might be required to make these scenarios work, a quick thought
is to pass in indices into a table of zones corresponding to regions of
memory addressible by some devices and not others. It'd give rise to a
partition like what is already present with foreknowledge of ISA DMA
and 32-bit PCI, but there would be strange corner cases, for instance,
devices claiming to be 32-bit PCI that don't wire all the address lines.
I'm not entirely sure how smoothly these cases are now handled anyway.


On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
>> I believe 64-bit PCI is pretty much taken to be a requirement; if it
>> weren't the 4GB limit would once again apply and we'd be in much
>> trouble, or we'd have to implement a different method of accommodating
>> limited device addressing capabilities and would be in trouble again.

On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> If you're sure all the hw device are pci64 and the device drivers are
> using DAC to submit the bus addresses, then you're just fine and you can
> use pages over 4G for the ZONE_NORMAL too. and yes, if you add an IOMMU
> unit like the GART then you can fill the zone_normal with phys pages
> over 4G too because then the bus address won't be an identity anymore
> with the phys addr, I just assumed it wasn't the case because most x86
> doesn't have that capability besides the GART that isn't currently used
> by the kernel as an iommu but that it's left to use to build contigous
> ram for the AGP cards (and also not all x86 have an AGP so we couldn't
> use it by default on x86 even assuming the graphics card doesn't need
> it).

That sounds a bit painful; digging through drivers to check if any are
missing DAC support is not my idea of a good time.


On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
>> I've not been using the generic page_address() in conjunction with
>> highmem, but this sounds like a very natural thing to do when the need
>> to do so arises; arranging for storage of the virtual address sounds
>> trickier, though doable. I'm not sure if mainline would want it, and
>> I don't feel a pressing need to implement it yet, but then again, I've
>> not yet been parked in front of a 64GB x86 machine yet...

On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> Personally I always had the hope to never need to see a 64G 32bit
> machine 8). I mean, even if you manage to solve the pci32bit problem
> with GFP_KERNEL, then you still have to share 800M across 16 nodes with
> 4G each. So by striping zone_normal over all the nodes to have numa-local
> data structures with fast slab allocations will get at most 50mbyte per
> node of which around 90% of this 50M will be eat by the mem_map array
> for those 50M plus the other 4G-50M. So at the end you'll be left with
> only say 5/10M per node of zone_normal that will be filled immediatly as
> soon as you start reading some directory from disk. a few hundred mbyte
> of vfs cache is the minimum for those machines, this doesn't even take
> into account bh headers for the pagecache, physical address space
> pagecache for the buffercache, kiobufs, vma, etc... Even ignoring the fact
> it's NUMA a 64G machine will boot fine (thanks to your 2.4.19pre work
> that shrinks of some bytes each page structure) but still it will work well
> only depending on what you're doing, for example it's fine for number
> cruncking but it will be bad for most other important workloads. And
> this is only because of the 32bit address space, it doesn't have anything
> to do with nonlinear/numa/discontigmem or pci32.  It's just that 1G of
> virtual address space reserved for kernel is too low to handle
> efficiently 64G of physical ram, this is a fact and you can't workaround
> it. every workaround will add a penality here or there. The workaround
> you will be mostly forced to take is CONFIG_2G, after that the userspace
> will be limited to less than 1G per task returned by malloc (from over
> 1G to below 2G) and that will be a showstopper again for most userspace
> apps that wants to run on a 64G box like a DBMS that wants almost 2G of
> SGA. I'm glad we're finally going to migrate all to 64bit, just in time
> not to see a relevant number of 32bit 64G boxes.

64GB machines are not new. NUMA-Q's original OS (DYNIX/ptx) must have
been doing something radically different, for it appeared to run well
there, and it did so years ago. The amount of data actually required to
be globally mapped should in principle be no larger than the kernel's
loaded image, and everything else can be dynamically mapped by mapping
pages as pointers into them are followed. The practical reality of
getting Linux to do this for a significant fraction of its globally-mapped
structures (or anyone accepting a patch to make it do so) is another
matter entirely.  Optional facilities for the worst offenders might be
more practical, for instance:

(1) Given per-zone kswapd's, i.e. separate process contexts for each
	large fragment of mem_map, it should be possible to reserve
	a large portion of the process' address space for mapping in
	its local mem_map. Algorithms allowing sufficient locality
	of reference (e.g. reverse-mappings) would be required for
	this to be effective.

(2) Various large boot-time allocated structures (think big hash
	tables) could be changed so that either the algorithm only
	requires a small root to be globally mapped in the kernel
	virtual address space (trees), localized on a per-object basis
	if there is an object to hang them off of (e.g. ratcache), or
	highmem allocate the table with a globally-mapped physical
	address available for mapping in the needed portions on-demand
	(like the above mem_map suggestion but without any way to
	give process contexts the ability to restrict themselves
	to orthogonal subsets of the structure).

(3) In order to accommodate the sheer number of dynamic mappings
	going on a large process/mmu-context-local cache of virtual
	address space for mapping them in would be needed for
	efficiency, changing the memory map of Linux/i386 as well
	as adding another kind of (address-space local) kmapping.

(4) The bootstrap sequence would need to be altered so that dynamic
	mappings of boot-time allocated structures residing outside
	the direct-mapped portion of the kernel virtual address space
	are possible, as well as the usual sprinkling of small chunks
	of ZONE_NORMAL across nodes so that something is possible.

Almost anything could exhaust of the kernel virtual address space if
left permanently mapped. And worse yet, there are some DBMS's that want
3.5GB, not just 3GB. These potentially very time-consuming changes
basically kmap everything, including the larger portions of mem_map.
The answer I seem to hear most often is "get a 64-bit CPU".

But I believe it's fully possible to get the larger highmem systems to
what is very near a sane working state and feed back to mainline a good
portion of the less invasive patches required to address fundamental
stability issues associated with highmem, and welcome any assistance
toward that end.

What is likely the more widely beneficial aspect of this work is that
it can expose the fundamental stability issues of the highmem
implementation very readily and so provide users of more common 32-bit
highmem systems a greater degree of stability than they have previously
enjoyed owing to kva exhaustion issues.


On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> And of course, I don't mean a 64G 32bit machine doesn't make sense, it
> can make perfect sense for a certain number of users with specific needs
> of lots of ram and with very few kernel data structures, if you do that
> that's because you know what you're doing and you know you can tweak
> linux for your own workload and that's fine as far it's not supposed to
> be a general purpose machine (with general purpose I mean pretending to
> run a DBMS with a 1.7G SGA requiring CONFIG_3G, plus a cvs [or bk if
> you're a bk fan] server dealing with huge vfs metadata at the same time,
> for istance the cvs workload would run faster booting with mem=32G :)

Well, this is certainly not the case with other OS's. The design
limitations of Linux' i386 memory layout, while they now severely
hamper performance on NUMA-Q, are a tradeoff that has proved
advantageous on other platforms, and should be approached with some
degree of caution even while Martin Bligh (truly above all others),
myself, and others attempt to address the issues raised by it on NUMA-Q.
But I believe it is possible to achieve a good degree of virtual
address space conservation without compromising the general design,
and if I may be so bold as to speak on behalf of my friends, I believe
we are willing to, capable of, and now exercising that caution.


On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
>> Absolutely; I'd be very supportive of improvements for this case as well.
>> Many of the systems with the need for discontiguous memory support will
>> also benefit from parallelizations or other methods of avoiding references
>> to remote nodes/zones or iterations over all nodes/zones.

On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> I would suggest to start on case-by-case basis looking at the profiling,
> so we make more complex only what is worth to optimize.  For example
> nr_free_buffer_pages() I guess it will showup because it is used quite
> frequently.

I think I see nr_free_pages(), but nr_free_buffer_pages() sounds very
likely as well. Both of these would likely benefit from per-cpu
counters.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  8:38                                           ` Andrea Arcangeli
@ 2002-05-03  9:26                                             ` William Lee Irwin III
  2002-05-03 15:38                                               ` Martin J. Bligh
  2002-05-03 15:17                                             ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
  1 sibling, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-03  9:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin J. Bligh, Daniel Phillips, Russell King, linux-kernel

On Fri, May 03, 2002 at 10:38:13AM +0200, Andrea Arcangeli wrote:
> You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said
> in the earlier email there are many applications that doesn't care if
> there's only a few meg of zone_normal and for them 2.4.19pre8 is just
> fine (actually -aa is much better for the bounce buffers and other vm
> fixes in that area). If all the load is in userspace current 2.4 is just

Have you done testing with 64GB? What sort of failure modes are you
seeing with it? I've been hearing about more severe failure modes in
practice on 32GB, Martin, could you comment on this?


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: discontiguous memory platforms
  2002-05-02 20:14                                     ` Daniel Phillips
  2002-05-03  6:34                                       ` Andrea Arcangeli
@ 2002-05-03  9:33                                       ` Roman Zippel
  1 sibling, 0 replies; 165+ messages in thread
From: Roman Zippel @ 2002-05-03  9:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Andrea Arcangeli, Ralf Baechle, Russell King, linux-kernel

Hi,

On Thu, 2 May 2002, Daniel Phillips wrote:

> I'll accept 'not needed for 68K', though I guess config_nonlinear will work
> perfectly well for you and be faster than the loops.  However, some of the
> problems that config_nonlinear solves cannot be solved by any existing kernel
> mechanism.  We've been over the NUMA-Q and mips32 cases in detail, so I won't
> reiterate.

Maybe I missed that, but could you give me an example of a memory
configuration, which would be difficult to handle with the current
vm? Could you describe, how in your model the physical address space would
be mapped to the logical and virtual address space and how they are mapped
into the pgdat nodes?
Some real numbers would help me a lot to understand, what you have in
mind. I have a rough idea of it, but I want to be sure we are talking
about the same thing.

bye, Roman


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  9:24                                         ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III
@ 2002-05-03 10:30                                           ` Andrea Arcangeli
  2002-05-03 11:09                                             ` William Lee Irwin III
  2002-05-03 15:42                                             ` Martin J. Bligh
  2002-05-03 15:32                                           ` Martin J. Bligh
  1 sibling, 2 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 10:30 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips,
	Russell King, linux-kernel

On Fri, May 03, 2002 at 02:24:26AM -0700, William Lee Irwin III wrote:
> 64GB machines are not new. NUMA-Q's original OS (DYNIX/ptx) must have
> been doing something radically different, for it appeared to run well
> there, and it did so years ago. The amount of data actually required to

Did you ever benchmarked DYNIX/ptx against Linux on a 64bit machine or
on a 4G x86 machine? Special changes to deal with the small KVA as said
are possible but they will have to affect performance somehow. One way
to reduce the regression on the normal 32bit machines could be to take
the special actions like putting the mem_map in highmem only dependent
on the amount of ram (there would be still the branches for every access
of a page structure, at least unless you take the messy self modifying code
way).

> The answer I seem to hear most often is "get a 64-bit CPU".
> 
> But I believe it's fully possible to get the larger highmem systems to
> what is very near a sane working state and feed back to mainline a good
> portion of the less invasive patches required to address fundamental
> stability issues associated with highmem, and welcome any assistance
> toward that end.

The stability should be just complete in current -aa, it's just the
performance that won't be ok. If you want more cache, larger hashes,
more skb etc... you'll need to pay with something else that would then
only hurt on a 64bit arch or on a smaller box then.

> What is likely the more widely beneficial aspect of this work is that
> it can expose the fundamental stability issues of the highmem
> implementation very readily and so provide users of more common 32-bit
> highmem systems a greater degree of stability than they have previously
> enjoyed owing to kva exhaustion issues.

Agreed, infact if somebody can test current -aa on a 64G x86 box I'd be
glad to hear the results. It should just work stable, at least as far as
the VM is concerned (mainline should have some problem instead), except
it will probably return -ENOMEM on mmap/open/etc.. after you finish
normal_zone, and there can be packet loss too, but that's expected
(CONFIG_2G will make it almost completly usable on the kernel side, but
reducing userspace). The important thing is that it never deadlocks or
malfunction with CONFIG_3G.

> Well, this is certainly not the case with other OS's. The design
> limitations of Linux' i386 memory layout, while they now severely

I see it's limited for your needs on a 64G box, but "limited" looks like
"weak", while it's really the optimal design for 64bit archs and normal
32bit machines.

> hamper performance on NUMA-Q, are a tradeoff that has proved
> advantageous on other platforms, and should be approached with some
> degree of caution even while Martin Bligh (truly above all others),
> myself, and others attempt to address the issues raised by it on NUMA-Q.
> But I believe it is possible to achieve a good degree of virtual
> address space conservation without compromising the general design,
> and if I may be so bold as to speak on behalf of my friends, I believe
> we are willing to, capable of, and now exercising that caution.

Putting the mem_map in highmem would be the first step, after that you
should be just at at the 90% of work done to make it general purpose,
you should wrap most actions on the page struct with wrappers and it
will be quite an invasive change (much more invasive than pte-highmem),
but it could be done. For this one (unlike pte-highmem) you definitely
need a config option to select it, most people doesn't need this feature
enabled because they've less than 8G of ram and also considering it will
have a significant runtime cost.

> On Thu, May 02, 2002 at 12:19:03PM -0700, William Lee Irwin III wrote:
> >> Absolutely; I'd be very supportive of improvements for this case as well.
> >> Many of the systems with the need for discontiguous memory support will
> >> also benefit from parallelizations or other methods of avoiding references
> >> to remote nodes/zones or iterations over all nodes/zones.
> 
> On Fri, May 03, 2002 at 08:04:33AM +0200, Andrea Arcangeli wrote:
> > I would suggest to start on case-by-case basis looking at the profiling,
> > so we make more complex only what is worth to optimize.  For example
> > nr_free_buffer_pages() I guess it will showup because it is used quite
> > frequently.
> 
> I think I see nr_free_pages(), but nr_free_buffer_pages() sounds very
> likely as well. Both of these would likely benefit from per-cpu
> counters.

nr_free_pages() actually could be mostly optimized out by setting
overcommit to 1 :), for the rest is used basically only for /proc/
stats, but yes, with overcommit to 0 (default) every mmap will take the
hit in nr_free_pages() so in most workloads it would be even more
frequent than nr_free_buffer_pages() (with the difference that
nr_free_buffer_pages cannot be avoided).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03 10:30                                           ` Andrea Arcangeli
@ 2002-05-03 11:09                                             ` William Lee Irwin III
  2002-05-03 11:27                                               ` Andrea Arcangeli
  2002-05-03 15:42                                             ` Martin J. Bligh
  1 sibling, 1 reply; 165+ messages in thread
From: William Lee Irwin III @ 2002-05-03 11:09 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, Daniel Phillips, linux-kernel

On Fri, May 03, 2002 at 12:30:09PM +0200, Andrea Arcangeli wrote:
> Putting the mem_map in highmem would be the first step, after that you
> should be just at at the 90% of work done to make it general purpose,
> you should wrap most actions on the page struct with wrappers and it
> will be quite an invasive change (much more invasive than pte-highmem),
> but it could be done. For this one (unlike pte-highmem) you definitely
> need a config option to select it, most people doesn't need this feature
> enabled because they've less than 8G of ram and also considering it will
> have a significant runtime cost.

Invasive or not, if running is impossible without it, it must be done.

This is a probable first order of business given that it is the single
largest consumer of KVA with only really enough mitigation for
bootability provided by my prior efforts at reducing the size of struct
page. A clean, perhaps even mergeable design for this would be a great
boon to all users of larger highmem systems. IIRC buffer_heads were the
specific reported problem, and though they themselves consume excessive
KVA only under some circumstances, they present a much greater danger
in combination with the excessively large boot-time KVA allocation.

Martin, can you take over? I've got plenty of ideas about what to code
up, but you've actually got your hands on the machine and are knee-deep
in the issues. I'm getting hit up for specifics I can't answer.

Andrea, it might also be helpful to hear your input during the LSE
conference call tomorrow. The topic is KVA exhaustion scenarios, which
seem to be of interest to you as well.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03 11:09                                             ` William Lee Irwin III
@ 2002-05-03 11:27                                               ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 11:27 UTC (permalink / raw)
  To: William Lee Irwin III, Martin J. Bligh, Daniel Phillips, linux-kernel

On Fri, May 03, 2002 at 04:09:51AM -0700, William Lee Irwin III wrote:
> page. A clean, perhaps even mergeable design for this would be a great
> boon to all users of larger highmem systems. IIRC buffer_heads were the
> specific reported problem, and though they themselves consume excessive

bh problems should be fixed with my latest vm updates, while it's nice
to cache the bh across multiple writes, it's not a big problem having to ask
the fs again to translate from logical to physical so dropping bh
aggressively when needed is ok and the right thing to do.

> Andrea, it might also be helpful to hear your input during the LSE
> conference call tomorrow. The topic is KVA exhaustion scenarios, which
> seem to be of interest to you as well.

ok.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03  8:38                                           ` Andrea Arcangeli
  2002-05-03  9:26                                             ` William Lee Irwin III
@ 2002-05-03 15:17                                             ` Martin J. Bligh
  2002-05-03 15:58                                               ` Andrea Arcangeli
  2002-05-03 16:02                                               ` Daniel Phillips
  1 sibling, 2 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03 15:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel

> On Thu, May 02, 2002 at 11:33:43PM -0700, Martin J. Bligh wrote:
>> into kernel address space for ever. That's a fundamental scalability
>> problem for a 32 bit machine, and I think we need to fix it. If we
>> map only the pages the process is using into the user-kernel address
>> space area, rather than the global KVA, we get rid of some of these
>> problems. Not that that plan doesn't have its own problems, but ... ;-)
> 
> :) As said every workaround has a significant drawback at this point.
> Starting flooding the tlb with invlpg and pagetable walking every time
> we need to do a set_bit or clear_bit test_bit or an unlock_page is both
> overkill at runtime and overcomplex on the software side too to manage
> those kernel pools in user memory.

Whilst I take your point in principle, and acknowledge that there is
some cost to pay, I don't believe that the working set of one task is
all that dynamic (see also second para below). Some stuff really is 
global data, that's used by a lot of processes, but lots of other 
things really are per task. If only one process has a given file open, that's the only process that needs to see the pagecache control 
structures for that file. 

We don't have to tlb flush every time we map something in, only when
we delete it. For the sake of illustration, imagine a huge kmap pool
for each task, we just map things in as we need them (say some pagecache
structures when we open a file that's already partly in cache), and
use lazy TLB flushing to tear down those structures for free when we
context switch. If we run out of virtual space, yes, we'll have to 
flush, but I suspect that won't be too bad (for most workloads) if
we careful how we flush.
 
> just assume we do that and that you're ok to pay for the hit in general
> purpose usage, then the next year how will you plan to workaround the
> limitation of 64G of physical ram,

;-) No, I agree we're pushing the limits here, and I don't want to be
fighting this too much longer. The next generation of machines will 
all have larger virtual address spaces, and I'll be happy when they
arrive. For now, we have to deal with what we have, and support the
machines that are in the marketplace, and ia32 is (to my mind) still
faster than ia64. 

I'm really looking forward to AMD's Hammer architecture, but it's 
simply not here right now, and even when it is, there will be these 
older 32 bit machines in the field for a few years yet to come, and
we have to cope with them as best we can.


> Ah, and of course you could also use 2M pagetables by default to make it
> more usable but still you would run in some huge ram wastage in certain
> usages with small files, huge pageins and reads swapout and swapins,
> plus it wouldn't be guaranteed to be transparent to the userspace
> binaries (for istance mmap offset fields would break backwards
> compatibility on the required alignment, that's probably the last
> problem though). Despite its also significant drawbacks and the
> complexity of the change, probably the 4M pagetables would be the saner
> approch to manage more efficiently 64G with only a 800M kernel window.

Though that'd reduce the size of some of the structures, I'd still
have other concerns (such as tlb size, which is something stupid
like 4 pages, IIRC), and the space wastage you mentioned. Page 
clustering is probably a more useful technique - letting the existing
control structures control groups of pages. For example, one struct
page could control aligned groups of 4 4K pages, giving us an 
effective page size of 16K from the management overhead point of
view (swap in and out in 4 page chunks, etc).

>> Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit 
>> virtual addr space a long time ago with Dynix/PTX.
> 
> You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said

I said *used*, not *booted* ;-) There's a whole host of problems
we still have to fix yet, and some tradeoffs to be made - we just
have to make those without affecting the people that don't need
them. It won't be easy, but I don't think it'll be impossible either.

>> Bufferheads are another huge problem right now. For a P4 machine, they
>> round off to 128 bytes per data structure. I was just looking at a 16Gb
>> machine that had completely wedged itself by filling ZONE_NORMAL with 
> 
> Go ahead, use -aa or the vm-33 update, I fixed that problem a few days
> after hearing about it the first time (with the due credit to Rik in a
> comment for showing me such problem btw, I never noticed it before).

Thanks - I'll have a close look at that ... I didn't know you'd
already fixed that one.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  9:24                                         ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III
  2002-05-03 10:30                                           ` Andrea Arcangeli
@ 2002-05-03 15:32                                           ` Martin J. Bligh
  1 sibling, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03 15:32 UTC (permalink / raw)
  To: William Lee Irwin III, Andrea Arcangeli
  Cc: Daniel Phillips, Russell King, linux-kernel

>> Again note that nonlinear can do nothing to help you there, the
>> limitation you deal with is pci32 and the GFP API, not at all about
>> discontigmem or nonlinear. 
>
> While I don't have a particular plan to address what changes to the
> GFP API might be required to make these scenarios work, a quick thought
> is to pass in indices into a table of zones corresponding to regions of
> memory addressible by some devices and not others. It'd give rise to a
> partition like what is already present with foreknowledge of ISA DMA
> and 32-bit PCI, but there would be strange corner cases, for instance,
> devices claiming to be 32-bit PCI that don't wire all the address lines.
> I'm not entirely sure how smoothly these cases are now handled anyway.

In my mind, one possibility for a powerful API would be to specify a
mask of acceptable physical addresses, and a "state" for what kind of
mapping you wanted - global kernel permanently mapped address, unmapped
address, per-task kernel mapped address, per-address space kernel
mapped address, etc.

Without thinking about it too much (aka I'm sticking my neck out and
am going to get shot down ;-)) it would seem possible to do the phys
mask idea inside the current buddy system without too much problem
if the mask was aligned on 2^MAX_ORDER * sizeof(struct page) boundarys?
I need to think about that one some more, but I thought I'd throw it
out to see what people think ...

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  9:26                                             ` William Lee Irwin III
@ 2002-05-03 15:38                                               ` Martin J. Bligh
  0 siblings, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03 15:38 UTC (permalink / raw)
  To: William Lee Irwin III, Andrea Arcangeli
  Cc: Daniel Phillips, Russell King, linux-kernel

> Have you done testing with 64GB? What sort of failure modes are you
> seeing with it? I've been hearing about more severe failure modes in
> practice on 32GB, Martin, could you comment on this?

I've never gone above 32Gb (yet ;-)). We don't have an SMP platform
that I know of that'll support 64Gb, only the NUMA platforms.

32Gb will boot and work with 1GB KVA, but if you actually want to
use the memory for something, a 2GB KVA seems imperative. It depends
on the workload you're using, but the things we tend to see are:

1. struct page.
2. buffer heads (will look at -aa tree)
3. user page tables (need highpte)
4. LDTs for threads filling the vmalloc space (seems to be fixed in 2.5)

I think the whole struct page issue needs some (complex, hard) work,
but in general, we're getting there. Fast ;-)

M.

PS. BTW, Andrea, your latest highpte looks like you obliterated the
kmap problem I was complaing of, but I've been having massive problems 
with other things which are blocking much of the real testing ... 
sorry about the time lag ;-)

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03 10:30                                           ` Andrea Arcangeli
  2002-05-03 11:09                                             ` William Lee Irwin III
@ 2002-05-03 15:42                                             ` Martin J. Bligh
  1 sibling, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03 15:42 UTC (permalink / raw)
  To: Andrea Arcangeli, William Lee Irwin III, Daniel Phillips,
	Russell King, linux-kernel

> Putting the mem_map in highmem would be the first step, after that you
> should be just at at the 90% of work done to make it general purpose,
> you should wrap most actions on the page struct with wrappers and it
> will be quite an invasive change (much more invasive than pte-highmem),
> but it could be done. For this one (unlike pte-highmem) you definitely
> need a config option to select it, most people doesn't need this feature
> enabled because they've less than 8G of ram and also considering it will
> have a significant runtime cost.

Absolutely agree making it an option - other people with smaller memory
configs may also find this useful for enlarging the user address space 
to 3.5Gb for databases et al. with a 8Gb or 16Gb machine.
 
M.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 15:17                                             ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
@ 2002-05-03 15:58                                               ` Andrea Arcangeli
  2002-05-03 16:10                                                 ` Martin J. Bligh
  2002-05-03 16:02                                               ` Daniel Phillips
  1 sibling, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 15:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Daniel Phillips, Russell King, linux-kernel

On Fri, May 03, 2002 at 08:17:23AM -0700, Martin J. Bligh wrote:
> We don't have to tlb flush every time we map something in, only when
> we delete it. For the sake of illustration, imagine a huge kmap pool
> for each task, we just map things in as we need them (say some pagecache

yes, the pool will "cache" the mem_map virtual window for a while, but
the complexity of the pool management isn't trivial, in the page
structure you won't find the associated per-task cached virtual address,
you will need something like a lookup on a data structure associated
with the task struct to find if you just have it in cache or not in the
per-process userspace kmap pool. The current kmap pool is an order of
magnitude simpler thanks to page->virtual but you cannot have a
page->virtual[nr_tasks] array.

Another interesting problem is that 'struct page *' will be as best a
cookie, not a valid pointer anymore, not sure what's the best way to
handle that. Working with pfn would be cleaner rather than working with
a cookie (somebody could dereference the cookie by mistake thinking it's
a page structure old style), but if __alloc_pages returns a pfn a whole
lot of kernel code will break.

> older 32 bit machines in the field for a few years yet to come, and
> we have to cope with them as best we can.

Sure.

> Though that'd reduce the size of some of the structures, I'd still
> have other concerns (such as tlb size, which is something stupid
> like 4 pages, IIRC), and the space wastage you mentioned. Page 

it has 8 pages for data and 2 for instructions, that's 16M data and 4M
of instructions with PAE.  4k pages can be cached with at most 64 slots
for data and 32 entries for instructions, that means 256K of data and
128k of instructions. The main disavantage is that we basically would
waste the 4k tlb slots, and we'd share the same slots with the kernel.
It mostly depend on the workload but in theory the 8 pages for data
could reduce the pte walking (also not to mention a layer less of pte
would make the pte walking faster too). So I think 2M pages could
speedup some application, but the main advantage remains that you
wouldn't need to change the page structure handling.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 15:17                                             ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
  2002-05-03 15:58                                               ` Andrea Arcangeli
@ 2002-05-03 16:02                                               ` Daniel Phillips
  2002-05-03 16:20                                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-03 16:02 UTC (permalink / raw)
  To: Martin J. Bligh, Andrea Arcangeli; +Cc: William Lee Irwin III, linux-kernel

On Friday 03 May 2002 17:17, Martin J. Bligh wrote:
> Andrea apparently wrote:
> > Ah, and of course you could also use 2M pagetables by default to make it
> > more usable but still you would run in some huge ram wastage in certain
> > usages with small files, huge pageins and reads swapout and swapins,
> > plus it wouldn't be guaranteed to be transparent to the userspace
> > binaries (for istance mmap offset fields would break backwards
> > compatibility on the required alignment, that's probably the last
> > problem though). Despite its also significant drawbacks and the
> > complexity of the change, probably the 4M pagetables would be the saner
> > approch to manage more efficiently 64G with only a 800M kernel window.
> 
> Though that'd reduce the size of some of the structures, I'd still
> have other concerns (such as tlb size, which is something stupid
> like 4 pages, IIRC), and the space wastage you mentioned. Page 
> clustering is probably a more useful technique - letting the existing
> control structures control groups of pages. For example, one struct
> page could control aligned groups of 4 4K pages, giving us an 
> effective page size of 16K from the management overhead point of
> view (swap in and out in 4 page chunks, etc).

IMHO, this will be a much easier change than storing mem_map in highmem,
and solves 75% of the problem.  It's not just ia32 numa that will benefit
from it.  For example, MIPS supports 16K pages in software, which will
take a lot of load off the tlb.  According to Ralf, there are benefits
re virtual aliasing as well.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 15:58                                               ` Andrea Arcangeli
@ 2002-05-03 16:10                                                 ` Martin J. Bligh
  2002-05-03 16:25                                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: William Lee Irwin III, Daniel Phillips, linux-kernel

> Another interesting problem is that 'struct page *' will be as best a
> cookie, not a valid pointer anymore, not sure what's the best way to
> handle that. Working with pfn would be cleaner rather than working with
> a cookie (somebody could dereference the cookie by mistake thinking it's
> a page structure old style), but if __alloc_pages returns a pfn a whole
> lot of kernel code will break.

Yup, a physical address pfn would probably be best.

(such as tlb size, which is something stupid like 4 pages, IIRC)

> it has 8 pages for data and 2 for instructions, that's 16M data and 4M
> of instructions with PAE

What is "it", a P4? I think the sizes are dependant on which chip you're
using. The x440 has the P4 chips, but the NUMA-Q is is P2 or P3 (even
PPro for the oldest ones, but those don't work at the moment with Linux
on multiquad).

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:02                                               ` Daniel Phillips
@ 2002-05-03 16:20                                                 ` Andrea Arcangeli
  2002-05-03 16:41                                                   ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 16:20 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> and solves 75% of the problem.  It's not just ia32 numa that will benefit
> from it.  For example, MIPS supports 16K pages in software, which will

the whole change would be specific to ia32, I don't see the connection
with mips. There would be nothing to share between ia32 2M pages and
mips 16K pages. You can do mips 16K just now indipendently from the
page_size of ia32. 16K should work without surprises because other archs
have pages of this size and even bigger. Nobody has pages large as much
as 2M yet, that's an order of magnitude bigger. 16K for example is just
fine for the read()/pagein/pageout I/O, DMA is usually done in larger
chunks anyways with readahead and async-flushing to be faster (but never
as big as 2M, the highest limit is 512k per scsi command).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:10                                                 ` Martin J. Bligh
@ 2002-05-03 16:25                                                   ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 16:25 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: William Lee Irwin III, Daniel Phillips, linux-kernel

On Fri, May 03, 2002 at 09:10:46AM -0700, Martin J. Bligh wrote:
> > Another interesting problem is that 'struct page *' will be as best a
> > cookie, not a valid pointer anymore, not sure what's the best way to
> > handle that. Working with pfn would be cleaner rather than working with
> > a cookie (somebody could dereference the cookie by mistake thinking it's
> > a page structure old style), but if __alloc_pages returns a pfn a whole
> > lot of kernel code will break.
> 
> Yup, a physical address pfn would probably be best.
> 
> (such as tlb size, which is something stupid like 4 pages, IIRC)

you recall correcty the mean :), it's 8 for data and 2 for instructions.
But I don't think the tlb is the problem, potentially it's a big win for
the big apps like database, more ram addressed via tlb and faster
pagetable lookups, it's the I/O granularity for the pageins that is
probably the most annoying part. Even if you've a fast disk, 2M instead
of kbytes is going to make difference, as well as the fact a 4M per page
and the bh on the pagecache would waste quite lots of ram with small
files.

> > it has 8 pages for data and 2 for instructions, that's 16M data and 4M
> > of instructions with PAE
> 
> What is "it", a P4? I think the sizes are dependant on which chip you're

I didn't read if P4 changes that, nor I checked the athlon yet, I read
it in the usual and a bit old system programmin manual 3.

> using. The x440 has the P4 chips, but the NUMA-Q is is P2 or P3 (even
> PPro for the oldest ones, but those don't work at the moment with Linux
> on multiquad).

that's the P6 family, so the PPro P2 P3 all included (only P5 excluded).

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:20                                                 ` Andrea Arcangeli
@ 2002-05-03 16:41                                                   ` Daniel Phillips
  2002-05-03 16:58                                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-03 16:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Friday 03 May 2002 18:20, Andrea Arcangeli wrote:
> On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> > and solves 75% of the problem.  It's not just ia32 numa that will benefit
> > from it.  For example, MIPS supports 16K pages in software, which will
> 
> the whole change would be specific to ia32, I don't see the connection
> with mips. There would be nothing to share between ia32 2M pages and
> mips 16K pages.

The topic here is 'page clustering'.  The idea is to use one struct page for
every four 4K page frames on ia32.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:41                                                   ` Daniel Phillips
@ 2002-05-03 16:58                                                     ` Andrea Arcangeli
  2002-05-03 18:08                                                       ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 16:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Fri, May 03, 2002 at 06:41:15PM +0200, Daniel Phillips wrote:
> On Friday 03 May 2002 18:20, Andrea Arcangeli wrote:
> > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> > > and solves 75% of the problem.  It's not just ia32 numa that will benefit
> > > from it.  For example, MIPS supports 16K pages in software, which will
> > 
> > the whole change would be specific to ia32, I don't see the connection
> > with mips. There would be nothing to share between ia32 2M pages and
> > mips 16K pages.
> 
> The topic here is 'page clustering'.  The idea is to use one struct page for
> every four 4K page frames on ia32.

ah ok, I meant physical hardware pages. physical hardware pages should
be doable without common code changes, a software PAGE_SIZE or the
PAGE_CACHE_SIZE raises non trivial problems instead.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:58                                                     ` Andrea Arcangeli
@ 2002-05-03 18:08                                                       ` Daniel Phillips
  0 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-03 18:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Friday 03 May 2002 18:58, Andrea Arcangeli wrote:
> On Fri, May 03, 2002 at 06:41:15PM +0200, Daniel Phillips wrote:
> > On Friday 03 May 2002 18:20, Andrea Arcangeli wrote:
> > > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> > > > and solves 75% of the problem.  It's not just ia32 numa that will benefit
> > > > from it.  For example, MIPS supports 16K pages in software, which will
> > > 
> > > the whole change would be specific to ia32, I don't see the connection
> > > with mips. There would be nothing to share between ia32 2M pages and
> > > mips 16K pages.
> > 
> > The topic here is 'page clustering'.  The idea is to use one struct page for
> > every four 4K page frames on ia32.
> 
> ah ok, I meant physical hardware pages. physical hardware pages should
> be doable without common code changes, a software PAGE_SIZE or the
> PAGE_CACHE_SIZE raises non trivial problems instead.

Yes, it's not too bad though.  In the swap-in path, the locking would be against
mem_map + (pfn >> 2).  The four pages don't have to be read in and valid all at
the same time - it's ok to take multiple faults on the cluster, not recommended,
but ok.  In the swap-out path, all four page frames have to be swapped out and
invalidated at the same time.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  1:19                   ` Daniel Phillips
@ 2002-05-03 19:47                     ` Dave Engebretsen
  2002-05-03 22:06                       ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Dave Engebretsen @ 2002-05-03 19:47 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: William Lee Irwin III, Andrea Arcangeli, linux-kernel

Daniel Phillips wrote:
> 
> The boot loader must have provided at least some contiguous physical
> memory in order to load the kernel, the compressed disk image and give
> us a little working memory.  (For practical purposes, we're most likely to
> have been provided with a full gig, or whatever is appropriate according
> to the mem= command line setting, but lets pretend it's a lot less than
> that.)  Now, the first thing we need to do is fill in enough of the
...
> 
> Naturally, during initialization of the hash table, we want to be sure
> not to perform and phys_to_logical translations, as would be required to
> read values from the page tables during swap-out for example.  Probably
> there's already no possibility of that, but it needs a comment at least.
> 
> I can't provide any more details than that, because I'm not familiar
> with the way the iseries boots.  Anton is the man there.

The way it works on iSeries is the HV provides a 64MB physicaly
contiguous load area.  The kernel & working storage, including the
logical->physical (or absolute in our terms) must fit in this space. 
Even with a 256KB chunk size and a simple array for translation, the
memory consumption is not excessive.  Each array entry is 32bits,
allowing a 32+12(page offset) = 44b of physical addessability and a 1MB
array allows 32GB of translations.

We don't need the reverse translation on iSeries as the kernel never
knows about the actual hardware address, other than when putting an
entry in the hardware page tables (processor and I/O).  One other thing
to not is that Linux _always_ runs with relocation enabled on iSeries so
there is never a point, other than way I mention above, when the
hardware address matters.

> We ought to have some clue about the maximum number of physical memory
> chunks available to us.  I doubt *every* partition is going to be
> provided 256 GB of memory.  In fact, the real amount we need will be
> considerably less, and the phys_to_logical table will be smaller than
> 16 MB, say, 1 MB.  Just allocate the whole thing and be done with it.

... 

> 
> > How do you know what
> > the size of the table should be if the number of chunks varies
> > dramatically?
> 
> The most obvious and practical approach is to have the boot loader tell
> us, we allocate the maximum size needed, and won't worry about that
> again.
> 

Yes, we do know the maximum memory possible, both system wide, and more
importantly within a partition.  In fact the way it works today, a
partition is defined to the hypervisor with a current memory size to use
and a max memory size.  The max is required because the hardware page
table for PowerPC is allocated to the max size before the partition
boots.  Becuase the page table must be physically contiguous, it is
allocated for the partition when the system boots.  The size of the
Linux translation tables is a similar issue where the worst case should
just be considered and allocated at Linux boot time.

Dave Engebretsen

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03 19:47                     ` Dave Engebretsen
@ 2002-05-03 22:06                       ` Daniel Phillips
  0 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-03 22:06 UTC (permalink / raw)
  To: Dave Engebretsen; +Cc: William Lee Irwin III, Andrea Arcangeli, linux-kernel

On Friday 03 May 2002 21:47, Dave Engebretsen wrote:
> We don't need the reverse translation on iSeries as the kernel never
> knows about the actual hardware address, other than when putting an
> entry in the hardware page tables (processor and I/O).

So the kernel page tables are carrying what I'd call a logical address,
that is, zero-based, indexing your logical-to-physical table (physical
taken in a non-literal sense).

This would suggest that your current arrangement is a strict subset of
my current config_nonlinear design, flat table and all, but with
phys_to_pagenum defined as a compile-time error.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
                                 ` (2 preceding siblings ...)
  2002-05-02 23:05               ` Daniel Phillips
@ 2002-05-03 23:52               ` David Mosberger
  3 siblings, 0 replies; 165+ messages in thread
From: David Mosberger @ 2002-05-03 23:52 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Andrea Arcangeli, Daniel Phillips, Russell King, linux-kernel,
	Jesse Barnes

[Looks like this buffer was laying dormant in my Emacs and never sent.
 Hence the delay... ;-) ]

>>>>> On Thu, 2 May 2002 10:20:11 +1000, Anton Blanchard <anton@samba.org> said:

  >> so ia64 is one of those archs with a ram layout with huge holes
  >> in the middle of the ram of the nodes? I'd be curious to know
  >> what's the hardware advantage of designing the ram layout in such
  >> a way, compared to all other numa archs that I deal with. Also if
  >> you know other archs with huge holes in the middle of the ram of
  >> the nodes I'd be curious to know about them too. thanks for the
  >> interesting info!

  >> From arch/ppc64/kernel/iSeries_setup.c:

  Anton>  * The iSeries may have very large memories ( > 128 GB ) and
  Anton> a partition * may get memory in "chunks" that may be anywhere
  Anton> in the 2**52 real * address space.  The chunks are 256K in
  Anton> size.

  Anton> Also check out CONFIG_MSCHUNKS code and see why I'd love to
  Anton> see a generic solution to this problem.

Me too.  HP's zx1 platform also has a rather giant hole above the 1GB
boundary.  I don't know the exact reasons for this hole, but it's
related to the fact that (many) PCI devices need <4GB memory.

The current solution for zx1 is to place the mem_map in virtual
memory.  This obviously increases TLB pressure when touching lots of
mem_map[] entries randomly, but I haven't really seen any benchmarks
so far (real or artificial) where this has a signifcant performance
effect.  The nice part of this approach is that it is a rather general
solution, provided the kernel's page-table mapped address space is
sufficiently big.

	--david

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-03  5:15                                       ` Andrea Arcangeli
@ 2002-05-05 23:54                                         ` Daniel Phillips
  2002-05-06  0:28                                           ` Andrea Arcangeli
                                                             ` (2 more replies)
  0 siblings, 3 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-05 23:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel

On Friday 03 May 2002 07:15, Andrea Arcangeli wrote:
> On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote:
> > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote:
> > > 
> > > correct. The direct mapping is nothing magic, it's like a big static
> > > kmap area.  Everybody is required to use
> > > virt_to_page/page_address/pci_map_single/... to switch between virtual
> > > address and mem_map anyways (thanks to the discontigous mem_map), so you
> > > can use this property by making discontigous the virtual space as well,
> > > not only the mem_map.  discontigmem basically just allows that.
> > 
> > And what if you don't have enough virtual space to fit all the memory you
> 
> ZONE_NORMAL is by definition limited by the direct mapping size, so if
> you don't have enough virtual space you cannot enlarge the zone_normal
> anyways. If need more virtual space you can only do  things like
> CONFIG_2G.

I must be guilty of not explaining clearly.  Suppose you have the following
physical memory map:

	          0: 128 MB
	  8000,0000: 128 MB
	1,0000,0000: 128 MB
	1,8000,0000: 128 MB
	2,0000,0000: 128 MB
	2,8000,0000: 128 MB
	3,0000,0000: 128 MB
	3,8000,0000: 128 MB

The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
can only handle 128 MB of it.  The rest falls out of the addressable range and
has to be handled as highmem, that is if you preserve the linear relationship
between kernel virtual memory and physical memory, as config_discontigmem does.
Even if you go to 2G of kernel memory (restricting user space to 2G of virtual)
you can only handle 256 MB.

By using config_nonlinear, the kernel can directly address all of that memory,
giving you the full 800MB or so to work with (leaving out the kmap regions etc)
as zone_normal.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-05 23:54                                         ` Daniel Phillips
@ 2002-05-06  0:28                                           ` Andrea Arcangeli
  2002-05-06  0:34                                             ` Daniel Phillips
  2002-05-06  0:55                                           ` Russell King
  2002-05-06  8:54                                           ` Roman Zippel
  2 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  0:28 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> On Friday 03 May 2002 07:15, Andrea Arcangeli wrote:
> > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote:
> > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote:
> > > > 
> > > > correct. The direct mapping is nothing magic, it's like a big static
> > > > kmap area.  Everybody is required to use
> > > > virt_to_page/page_address/pci_map_single/... to switch between virtual
> > > > address and mem_map anyways (thanks to the discontigous mem_map), so you
> > > > can use this property by making discontigous the virtual space as well,
> > > > not only the mem_map.  discontigmem basically just allows that.
> > > 
> > > And what if you don't have enough virtual space to fit all the memory you
> > 
> > ZONE_NORMAL is by definition limited by the direct mapping size, so if
> > you don't have enough virtual space you cannot enlarge the zone_normal
> > anyways. If need more virtual space you can only do  things like
> > CONFIG_2G.
> 
> I must be guilty of not explaining clearly.  Suppose you have the following
> physical memory map:
> 
> 	          0: 128 MB
> 	  8000,0000: 128 MB
> 	1,0000,0000: 128 MB
> 	1,8000,0000: 128 MB
> 	2,0000,0000: 128 MB
> 	2,8000,0000: 128 MB
> 	3,0000,0000: 128 MB
> 	3,8000,0000: 128 MB
> 
> The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> can only handle 128 MB of it.  The rest falls out of the addressable range and
> has to be handled as highmem, that is if you preserve the linear relationship
> between kernel virtual memory and physical memory, as config_discontigmem does.
> Even if you go to 2G of kernel memory (restricting user space to 2G of virtual)
> you can only handle 256 MB.
> 
> By using config_nonlinear, the kernel can directly address all of that memory,
> giving you the full 800MB or so to work with (leaving out the kmap regions etc)
> as zone_normal.

If those different 128M chunks aren't in different numa nodes that's
broken hardware that can be workarounded just fine with discontigmem. If
as expected they are (indeed similar to numa-q) placed on different numa
nodes, then they must go into pgdat regardless, so nonlinear or not
cannot make difference with numa. Either ways (both if it's broken
hardware workaroundable with discontigmem, or proper numa architecture)
there will be no problem at all in coalescing the blocks below 4G into
ZONE_NORMAL (and for the blocks above 4G nonlinaer can do nothing).

nonlinear is only needed with origin2k (and possibly iseries if the
partitioning is extremely inefficient)  where discontigmem with
hundred/thousand of pgdat would not capable of workarounding the
hardware weird mem phys layout because it would perform too poorly.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  0:28                                           ` Andrea Arcangeli
@ 2002-05-06  0:34                                             ` Daniel Phillips
  2002-05-06  1:01                                               ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06  0:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel

On Monday 06 May 2002 02:28, Andrea Arcangeli wrote:
> On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > On Friday 03 May 2002 07:15, Andrea Arcangeli wrote:
> > > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote:
> > > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote:
> > > > > 
> > > > > correct. The direct mapping is nothing magic, it's like a big static
> > > > > kmap area.  Everybody is required to use
> > > > > virt_to_page/page_address/pci_map_single/... to switch between virtual
> > > > > address and mem_map anyways (thanks to the discontigous mem_map), so you
> > > > > can use this property by making discontigous the virtual space as well,
> > > > > not only the mem_map.  discontigmem basically just allows that.
> > > > 
> > > > And what if you don't have enough virtual space to fit all the memory you
> > > 
> > > ZONE_NORMAL is by definition limited by the direct mapping size, so if
> > > you don't have enough virtual space you cannot enlarge the zone_normal
> > > anyways. If need more virtual space you can only do  things like
> > > CONFIG_2G.
> > 
> > I must be guilty of not explaining clearly.  Suppose you have the following
> > physical memory map:
> > 
> > 	          0: 128 MB
> > 	  8000,0000: 128 MB
> > 	1,0000,0000: 128 MB
> > 	1,8000,0000: 128 MB
> > 	2,0000,0000: 128 MB
> > 	2,8000,0000: 128 MB
> > 	3,0000,0000: 128 MB
> > 	3,8000,0000: 128 MB
> > 
> > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > can only handle 128 MB of it.  The rest falls out of the addressable range and
> > has to be handled as highmem, that is if you preserve the linear relationship
> > between kernel virtual memory and physical memory, as config_discontigmem does.
> > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual)
> > you can only handle 256 MB.
> > 
> > By using config_nonlinear, the kernel can directly address all of that memory,
> > giving you the full 800MB or so to work with (leaving out the kmap regions etc)
> > as zone_normal.
> 
> If those different 128M chunks aren't in different numa nodes that's
> broken hardware that can be workarounded just fine with discontigmem.

It's real hardware - broken operating system.  And no, it's not numa.

Could you please explain how to work around it with discontigmem?

> If
> as expected they are (indeed similar to numa-q) placed on different numa
> nodes, then they must go into pgdat regardless, so nonlinear or not
> cannot make difference with numa. Either ways (both if it's broken
> hardware workaroundable with discontigmem, or proper numa architecture)
> there will be no problem at all in coalescing the blocks below 4G into
> ZONE_NORMAL (and for the blocks above 4G nonlinaer can do nothing).

Why can config_nonlinear do nothing with blocks above 4G physical?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-05 23:54                                         ` Daniel Phillips
  2002-05-06  0:28                                           ` Andrea Arcangeli
@ 2002-05-06  0:55                                           ` Russell King
  2002-05-06  1:07                                             ` Daniel Phillips
                                                               ` (3 more replies)
  2002-05-06  8:54                                           ` Roman Zippel
  2 siblings, 4 replies; 165+ messages in thread
From: Russell King @ 2002-05-06  0:55 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> I must be guilty of not explaining clearly.  Suppose you have the following
> physical memory map:
> 
> 	          0: 128 MB
> 	  8000,0000: 128 MB
> 	1,0000,0000: 128 MB
> 	1,8000,0000: 128 MB
> 	2,0000,0000: 128 MB
> 	2,8000,0000: 128 MB
> 	3,0000,0000: 128 MB
> 	3,8000,0000: 128 MB
> 
> The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> can only handle 128 MB of it.

I see no problem with the above with the existing discontigmem stuff.
discontigmem does *not* require a linear relationship between kernel
virtual and physical memory.  I've been running kernels for a while
on such systems.

Which was the reason for my comment at the start of this thread:
| On ARM, however, we have cherry to add here.  __va() may alias certain
| physical memory addresses to the same virtual memory address, which
| makes:

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  0:34                                             ` Daniel Phillips
@ 2002-05-06  1:01                                               ` Andrea Arcangeli
  0 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  1:01 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 02:34:49AM +0200, Daniel Phillips wrote:
> On Monday 06 May 2002 02:28, Andrea Arcangeli wrote:
> > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > > On Friday 03 May 2002 07:15, Andrea Arcangeli wrote:
> > > > On Thu, May 02, 2002 at 09:08:18PM +0200, Daniel Phillips wrote:
> > > > > On Thursday 02 May 2002 20:57, Andrea Arcangeli wrote:
> > > > > > 
> > > > > > correct. The direct mapping is nothing magic, it's like a big static
> > > > > > kmap area.  Everybody is required to use
> > > > > > virt_to_page/page_address/pci_map_single/... to switch between virtual
> > > > > > address and mem_map anyways (thanks to the discontigous mem_map), so you
> > > > > > can use this property by making discontigous the virtual space as well,
> > > > > > not only the mem_map.  discontigmem basically just allows that.
> > > > > 
> > > > > And what if you don't have enough virtual space to fit all the memory you
> > > > 
> > > > ZONE_NORMAL is by definition limited by the direct mapping size, so if
> > > > you don't have enough virtual space you cannot enlarge the zone_normal
> > > > anyways. If need more virtual space you can only do  things like
> > > > CONFIG_2G.
> > > 
> > > I must be guilty of not explaining clearly.  Suppose you have the following
> > > physical memory map:
> > > 
> > > 	          0: 128 MB
> > > 	  8000,0000: 128 MB
> > > 	1,0000,0000: 128 MB
> > > 	1,8000,0000: 128 MB
> > > 	2,0000,0000: 128 MB
> > > 	2,8000,0000: 128 MB
> > > 	3,0000,0000: 128 MB
> > > 	3,8000,0000: 128 MB
> > > 
> > > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > > can only handle 128 MB of it.  The rest falls out of the addressable range and
> > > has to be handled as highmem, that is if you preserve the linear relationship
> > > between kernel virtual memory and physical memory, as config_discontigmem does.
> > > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual)
> > > you can only handle 256 MB.
> > > 
> > > By using config_nonlinear, the kernel can directly address all of that memory,
> > > giving you the full 800MB or so to work with (leaving out the kmap regions etc)
> > > as zone_normal.
> > 
> > If those different 128M chunks aren't in different numa nodes that's
> > broken hardware that can be workarounded just fine with discontigmem.
> 
> It's real hardware - broken operating system.  And no, it's not numa.

operative system can workaround such a weird mem layout just fine
with discontigmem, there is no problem making such hardware to work.

> Could you please explain how to work around it with discontigmem?

Are you serious? that's what ARM is doing for ages in 2.4, I think this
part was obvious under the whole previous discussions. just put each
discontigous chunk into a separated pgdat and it will work flawlessy
(also make sure to apply all pending vm/numa fixes in -aa first that are
needed for numa anyways). They will all be normal zones provided you
implement a static view of them in the kernel virtual address space, and
you also cover page_address/virt_to_page/pci_map* of course.

Yes, nonlinear would be just a bit faster than discontigmem in the above
scenario (it's non numa so you are not forced to describe the
discontigmem topology to common code and that would save some runtime
bit), but avoiding nonlinear also lefts the common code quite simpler
without adding further mm abstractions.

With hundred of pgdats the "discontigmem workaround" becomes prohibitive,
and so nonlinear become mandatory in a scenario like origin2k. But in
the above scenario, "nonlinear" would be just a minor optimizations that
also leads to additional common code complexity.

> > If
> > as expected they are (indeed similar to numa-q) placed on different numa
> > nodes, then they must go into pgdat regardless, so nonlinear or not
> > cannot make difference with numa. Either ways (both if it's broken
> > hardware workaroundable with discontigmem, or proper numa architecture)
> > there will be no problem at all in coalescing the blocks below 4G into
> > ZONE_NORMAL (and for the blocks above 4G nonlinaer can do nothing).
> 
> Why can config_nonlinear do nothing with blocks above 4G physical?

Just to be sure it's clear, with "do nothing", I mean it cannot put them
into zone_normal anyways. Putting the whole thing into zone_normal was
your whole point of the previous email: "By using config_nonlinear, the
kernel can directly address all of that memory...giving you the full
800MB...as zone_normal". I think I just told you why, grep for vmalloc32
and see why it doesn't passes to GFP the __GFP_HIGHMEM flag.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  0:55                                           ` Russell King
@ 2002-05-06  1:07                                             ` Daniel Phillips
  2002-05-06  1:20                                               ` Andrea Arcangeli
  2002-05-06  1:09                                             ` Andrea Arcangeli
                                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06  1:07 UTC (permalink / raw)
  To: Russell King; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 02:55, Russell King wrote:
> On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > I must be guilty of not explaining clearly.  Suppose you have the following
> > physical memory map:
> > 
> > 	          0: 128 MB
> > 	  8000,0000: 128 MB
> > 	1,0000,0000: 128 MB
> > 	1,8000,0000: 128 MB
> > 	2,0000,0000: 128 MB
> > 	2,8000,0000: 128 MB
> > 	3,0000,0000: 128 MB
> > 	3,8000,0000: 128 MB
> > 
> > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > can only handle 128 MB of it.
> 
> I see no problem with the above with the existing discontigmem stuff.
> discontigmem does *not* require a linear relationship between kernel
> virtual and physical memory.  I've been running kernels for a while
> on such systems.

Look, you've got this:

#define __phys_to_virt(ppage) ((unsigned long)(ppage) + PAGE_OFFSET - PHYS_OFFSET)

So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the
relation __pa(__va(kva)) == kva cannot hold.  Perhaps that doesn't bother you?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  0:55                                           ` Russell King
  2002-05-06  1:07                                             ` Daniel Phillips
@ 2002-05-06  1:09                                             ` Andrea Arcangeli
  2002-05-06  1:13                                             ` Daniel Phillips
  2002-05-06  2:03                                             ` Daniel Phillips
  3 siblings, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  1:09 UTC (permalink / raw)
  To: Russell King; +Cc: Daniel Phillips, Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 01:55:05AM +0100, Russell King wrote:
> I see no problem with the above with the existing discontigmem stuff.
> discontigmem does *not* require a linear relationship between kernel
> virtual and physical memory.  I've been running kernels for a while
> on such systems.

Indeed.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  0:55                                           ` Russell King
  2002-05-06  1:07                                             ` Daniel Phillips
  2002-05-06  1:09                                             ` Andrea Arcangeli
@ 2002-05-06  1:13                                             ` Daniel Phillips
  2002-05-06  2:03                                             ` Daniel Phillips
  3 siblings, 0 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06  1:13 UTC (permalink / raw)
  To: Russell King; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 02:55, Russell King wrote:
> On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > I must be guilty of not explaining clearly.  Suppose you have the following
> > physical memory map:
> > 
> > 	          0: 128 MB
> > 	  8000,0000: 128 MB
> > 	1,0000,0000: 128 MB
> > 	1,8000,0000: 128 MB
> > 	2,0000,0000: 128 MB
> > 	2,8000,0000: 128 MB
> > 	3,0000,0000: 128 MB
> > 	3,8000,0000: 128 MB
> > 
> > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > can only handle 128 MB of it.
> 
>... I've been running kernels for a while on such systems.

Could you provide me with an example memory map, please?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  1:07                                             ` Daniel Phillips
@ 2002-05-06  1:20                                               ` Andrea Arcangeli
  2002-05-06  1:24                                                 ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  1:20 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 03:07:07AM +0200, Daniel Phillips wrote:
> On Monday 06 May 2002 02:55, Russell King wrote:
> > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > > I must be guilty of not explaining clearly.  Suppose you have the following
> > > physical memory map:
> > > 
> > > 	          0: 128 MB
> > > 	  8000,0000: 128 MB
> > > 	1,0000,0000: 128 MB
> > > 	1,8000,0000: 128 MB
> > > 	2,0000,0000: 128 MB
> > > 	2,8000,0000: 128 MB
> > > 	3,0000,0000: 128 MB
> > > 	3,8000,0000: 128 MB
> > > 
> > > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > > can only handle 128 MB of it.
> > 
> > I see no problem with the above with the existing discontigmem stuff.
> > discontigmem does *not* require a linear relationship between kernel
> > virtual and physical memory.  I've been running kernels for a while
> > on such systems.
> 
> Look, you've got this:
> 
> #define __phys_to_virt(ppage) ((unsigned long)(ppage) + PAGE_OFFSET - PHYS_OFFSET)
> 
> So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the
> relation __pa(__va(kva)) == kva cannot hold.  Perhaps that doesn't bother you?

Check my previous email:

	[..] They will all be normal zones provided you implement a static
	view of them in the kernel virtual address space, and you also
	cover page_address/virt_to_page [..]

Depending on the kind of coalescing of those chunks in the direct
mapping virt_to_page/page_address will vary. virt_to_page and
page_address will have all the necessary internal knowledge in order to
make it all zone_normal.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  1:20                                               ` Andrea Arcangeli
@ 2002-05-06  1:24                                                 ` Daniel Phillips
  2002-05-06  1:42                                                   ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06  1:24 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 03:20, Andrea Arcangeli wrote:
> On Mon, May 06, 2002 at 03:07:07AM +0200, Daniel Phillips wrote:
> > On Monday 06 May 2002 02:55, Russell King wrote:
> > So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the
> > relation __pa(__va(kva)) == kva cannot hold.  Perhaps that doesn't bother you?
> 
> Check my previous email:
> 
> 	[..] They will all be normal zones provided you implement a static
> 	view of them in the kernel virtual address space, and you also
> 	cover page_address/virt_to_page [..]
> 
> Depending on the kind of coalescing of those chunks in the direct
> mapping virt_to_page/page_address will vary. virt_to_page and
> page_address will have all the necessary internal knowledge in order to
> make it all zone_normal.

What do you mean by 'implement a static view of them'?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  1:24                                                 ` Daniel Phillips
@ 2002-05-06  1:42                                                   ` Andrea Arcangeli
  2002-05-06  1:48                                                     ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  1:42 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1071 bytes --]

On Mon, May 06, 2002 at 03:24:58AM +0200, Daniel Phillips wrote:
> On Monday 06 May 2002 03:20, Andrea Arcangeli wrote:
> > On Mon, May 06, 2002 at 03:07:07AM +0200, Daniel Phillips wrote:
> > > On Monday 06 May 2002 02:55, Russell King wrote:
> > > So, since __phys_to_virt (and hence phys_to_virt and __va) is clearly linear, the
> > > relation __pa(__va(kva)) == kva cannot hold.  Perhaps that doesn't bother you?
> > 
> > Check my previous email:
> > 
> > 	[..] They will all be normal zones provided you implement a static
> > 	view of them in the kernel virtual address space, and you also
> > 	cover page_address/virt_to_page [..]
> > 
> > Depending on the kind of coalescing of those chunks in the direct
> > mapping virt_to_page/page_address will vary. virt_to_page and
> > page_address will have all the necessary internal knowledge in order to
> > make it all zone_normal.
> 
> What do you mean by 'implement a static view of them'?

See the attached email. assuming chunks of 256M ram every 1G, 1G phys
goes at 3G+256M virt, 2G goes at 3G+512M etc...

Andrea

[-- Attachment #2: Type: message/rfc822, Size: 1991 bytes --]

From: Andrea Arcangeli <andrea@suse.de>
To: Daniel Phillips <phillips@bonn-fries.net>
Cc: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>, Russell King <rmk@arm.linux.org.uk>, linux-kernel@vger.kernel.org
Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
Date: Thu, 2 May 2002 18:06:32 +0200
Message-ID: <20020502180632.I11414@dualathlon.random>

On Wed, May 01, 2002 at 05:42:40PM +0200, Daniel Phillips wrote:
> On Thursday 02 May 2002 17:35, Andrea Arcangeli wrote:
> > On Thu, May 02, 2002 at 08:18:33AM -0700, Martin J. Bligh wrote:
> > > At the moment I use the contig memory model (so we only use discontig for
> > > NUMA support) but I may need to change that in the future.
> > 
> > I wasn't thinking at numa-q, but regardless numa-Q fits perfectly into
> > the current discontigmem-numa model too as far I can see.
> 
> No it doesn't.  The config_discontigmem model forces all zone_normal memory
> to be on node zero, so all the remaining nodes can only have highmem locally.

You can trivially map the phys mem between 1G and 1G+256M to be in a
direct mapping between 3G+256M and 3G+512M, then you can put such 256M
at offset 1G into the ZONE_NORMAL of node-id 1 with discontigmem too.

The constraints you have on the normal memory are only two:

1) direct mapping
2) DMA

so as far as the ram is capable of 32bit DMA with pci32 and it's mapped
in the direct mapping you can put it into the normal zone. There is no
difference at all between discontimem or nonlinear in this sense.

> Even with good cache hardware, this has to hurt.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  1:42                                                   ` Andrea Arcangeli
@ 2002-05-06  1:48                                                     ` Daniel Phillips
  2002-05-06  2:06                                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06  1:48 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Russell King, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 03:42, Andrea Arcangeli wrote:
> On Mon, May 06, 2002 at 03:24:58AM +0200, Daniel Phillips wrote:
> > What do you mean by 'implement a static view of them'?
> 
> See the attached email. assuming chunks of 256M ram every 1G, 1G phys
> goes at 3G+256M virt, 2G goes at 3G+512M etc...

So, __va(0x40000000) = 0xc0000000, and __va(0x80000000) = 0, i.e., not a kernel
address at all, because with config_discontigmem __va is a simple linear
relation.  What do you do about that?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  0:55                                           ` Russell King
                                                               ` (2 preceding siblings ...)
  2002-05-06  1:13                                             ` Daniel Phillips
@ 2002-05-06  2:03                                             ` Daniel Phillips
  2002-05-06  2:31                                               ` Andrea Arcangeli
  2002-05-06  8:57                                               ` Russell King
  3 siblings, 2 replies; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06  2:03 UTC (permalink / raw)
  To: Russell King; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 02:55, Russell King wrote:
> On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > I must be guilty of not explaining clearly.  Suppose you have the following
> > physical memory map:
> > 
> > 	          0: 128 MB
> > 	  8000,0000: 128 MB
> > 	1,0000,0000: 128 MB
> > 	1,8000,0000: 128 MB
> > 	2,0000,0000: 128 MB
> > 	2,8000,0000: 128 MB
> > 	3,0000,0000: 128 MB
> > 	3,8000,0000: 128 MB
> > 
> > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > can only handle 128 MB of it.
> 
> I see no problem with the above with the existing discontigmem stuff.
> discontigmem does *not* require a linear relationship between kernel
> virtual and physical memory.  I've been running kernels for a while
> on such systems.

I just went through every variant of arm in the kernel tree, and I found that
*all* of them implement a simple linear relationship between kernel virtual and
physical memory, of the form:

   #define __virt_to_phys(vpage) ((vpage) - PAGE_OFFSET + PHYS_OFFSET)
   #define __phys_to_virt(ppage) ((ppage) + PAGE_OFFSET - PHYS_OFFSET)

With such a linear mapping you *cannot* map physical memory distributed across
more than one gig into one gig of kernel virtual memory.

Are you talking about code that isn't in the tree?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  1:48                                                     ` Daniel Phillips
@ 2002-05-06  2:06                                                       ` Andrea Arcangeli
  2002-05-06 17:40                                                         ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  2:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On Mon, May 06, 2002 at 03:48:30AM +0200, Daniel Phillips wrote:
> On Monday 06 May 2002 03:42, Andrea Arcangeli wrote:
> > On Mon, May 06, 2002 at 03:24:58AM +0200, Daniel Phillips wrote:
> > > What do you mean by 'implement a static view of them'?
> > 
> > See the attached email. assuming chunks of 256M ram every 1G, 1G phys
> > goes at 3G+256M virt, 2G goes at 3G+512M etc...
> 
> So, __va(0x40000000) = 0xc0000000, and __va(0x80000000) = 0, i.e., not a kernel

I said page_address() not necessairly __va. Assume the arch specify
WANT_PAGE_VIRTUAL because such a page_address wouldn't be that cheap
anyways, see my discussion with William as reference.

> address at all, because with config_discontigmem __va is a simple linear
> relation.  What do you do about that?

You can implement __va as you want, it doesn't need ot be a simple
linear relation (see also the attached email from Roman), but regardless
what matters really is page_address and virt_to_page, not only __va,
just initialize page->virtual to the static kernel window at boot time
using the proper virtual address and you won't run into __va (or let the
arch code specify page_address if CONFIG_DISCONTIGMEM is defined, this
would require a two liner to mm.h). this also is been just discussed
some day ago with William, see the other attached email.

Andrea

[-- Attachment #2: Type: message/rfc822, Size: 5614 bytes --]

From: Andrea Arcangeli <andrea@suse.de>
To: William Lee Irwin III <wli@holomorphy.com>, "Martin J. Bligh" <Martin.Bligh@us.ibm.com>, Daniel Phillips <phillips@bonn-fries.net>, Russell King <rmk@arm.linux.org.uk>, linux-kernel@vger.kernel.org
Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
Date: Thu, 2 May 2002 20:41:36 +0200
Message-ID: <20020502204136.M11414@dualathlon.random>

On Thu, May 02, 2002 at 10:16:55AM -0700, William Lee Irwin III wrote:
> On Thu, May 02, 2002 at 09:10:00AM -0700, Martin J. Bligh wrote:
> >> Even with 64 bit DMA, the real problem is breaking the assumption
> >> that mem between 0 and 896Mb phys maps 1-1 onto kernel space.
> >> That's 90% of the difficulty of what Dan's doing anyway, as I
> >> see it.
> 
> On Thu, May 02, 2002 at 06:40:37PM +0200, Andrea Arcangeli wrote:
> > control on virt_to_page, pci_map_single, __va.  Actually it may be as
> > well cleaner to just let the arch define page_address() when
> > discontigmem is enabled (instead of hacking on top of __va), that's a
> > few liner. (the only true limit you have is on the phys ram above 4G,
> > that cannot definitely go into zone-normal regardless if it belongs to a
> > direct mapping or not because of pci32 API)
> > Andrea
> 
> Being unable to have any ZONE_NORMAL above 4GB allows no change at all.

No change if your first node maps the whole first 4G of physical address
space, but in such case nonlinear cannot help you in any way anyways.
The fact you can make no change at all has only to do with the fact
GFP_KERNEL must return memory accessible from a pci32 device.

I think most configurations have more than one node mapped into the
first 4G, and so in those configurations you can do changes and spread
the direct mapping across all the nodes mapped in the first 4G phys.

the fact you can or you can't have something to change with discontigmem
or nonlinear, it's all about pci32.

> 32-bit PCI is not used on NUMA-Q AFAIK.

but can you plugin 32bit pci hardware into your 64bit-pci slots, right?
If not, and if you're also sure the linux drivers for your hardware are all
64bit-pci capable then you can do the changes regardless of the 4G
limit, in such case you can spread the direct mapping all over the whole
64G physical ram, whereever you want, no 4G constraint anymore.

> 
> So long as zones are physically contiguous and __va() does what its

zones remains physically contigous, it's the virtual address returned by
page_address that changes. Also the kmap header will need some
modification, you should always check for PageHIGHMEM in all places to
know if you must kmap or not, that's a few liner.

> name implies, page_address() should operate properly aside from the
> sizeof(phys_addr) > sizeof(unsigned long) overflow issue (which I
> believe was recently resolved; if not I will do so myself shortly).
> With SGI's discontigmem, one would need an UNMAP_NR_DENSE() as the
> position in mem_map array does not describe the offset into the region
> of physical memory occupied by the zone. UNMAP_NR_DENSE() may be
> expensive enough architectures using MAP_NR_DENSE() may be better off
> using ARCH_WANT_VIRTUAL, as page_address() is a common operation. If

Yes, as alternative to moving page_address to the arch code, you can set
WANT_PAGE_VIRTUAL since as you say such a function is going to be more
expensive, (if it's only a few instructions you can instead consider
moving page_address in the arch code as said in the previous email
instead of hacking on __va).

> space conservation is as important a consideration for stability as it
> is on architectures with severely limited kernel virtual address spaces,
> it may be preferable to implement such despite the computational expense.
> iSeries will likely have physically discontiguous zones and so it won't
> be able to use an address calculation based page_address() either.

If you need to support an huge number of discontigous zones then I'm the
first to agree you want nonlinear instead of discontigmem, I wasn't
aware that such an hardware that normally needs to support hundred or
thousand of discontigmem zones exists, for it discontigmem is
prohibitive due the O(N) complexity of some code path. That's not the
case for NUMA-Q though that also needs the different pgdat structures
for the numa optimizations anyways (and still to me a phys memory
partitioned with hundred discontigous zones looks like an harddisk
partitioned in hundred of different blkdevs).

BTW, about the pgdat loops optimizations, you misunderstood what I meant
in some previous email, with "removing them" I didn't meant to remove
them in the discontigmem case, that would had to be done case by case,
with removing them I meant to remove them only for mainline 2.4.19-pre7
when kernel is compiled for x86 target like 99% of userbase uses it. A
discontigmem using nonlinear also doesn't need to loop. It's a 1 branch
removal optimization (doesn't decrease the complexity of the algorithm
for discontigmem enabled). It's all in function of #ifndef CONFIG_DISCONTIGMEM.
Dropping the loop when discontigmem is enabled is much more interesting
optimization of course.

Andrea

[-- Attachment #3: Type: message/rfc822, Size: 2453 bytes --]

From: Roman Zippel <zippel@linux-m68k.org>
To: Daniel Phillips <phillips@bonn-fries.net>
Cc: Andrea Arcangeli <andrea@suse.de>, Ralf Baechle <ralf@uni-koblenz.de>, Russell King <rmk@arm.linux.org.uk>, linux-kernel@vger.kernel.org
Subject: Re: discontiguous memory platforms
Date: Thu, 02 May 2002 21:40:48 +0200
Message-ID: <3CD19640.3B85BF76@linux-m68k.org>

Daniel Phillips wrote:

> Patching the kernel how, and where?

Check for example in asm-ppc/page.h the __va/__pa functions.

> > Anyway, I agree with Andrea, that another mapping isn't really needed.
> > Clever use of the mmu should give you almost the same result.
> 
> We *are* making clever use of the mmu in config_nonlinear, it is doing the
> nonlinear kernel virtual mapping for us.  Did you have something more clever
> in mind?

I mean to map the memory where you need it. The physical<->virtual
mapping won't be one to one, but you won't need another abstraction and
the current vm is already basically able to handle it.

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  2:03                                             ` Daniel Phillips
@ 2002-05-06  2:31                                               ` Andrea Arcangeli
  2002-05-06  8:57                                               ` Russell King
  1 sibling, 0 replies; 165+ messages in thread
From: Andrea Arcangeli @ 2002-05-06  2:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 04:03:15AM +0200, Daniel Phillips wrote:
> On Monday 06 May 2002 02:55, Russell King wrote:
> > On Mon, May 06, 2002 at 01:54:52AM +0200, Daniel Phillips wrote:
> > > I must be guilty of not explaining clearly.  Suppose you have the following
> > > physical memory map:
> > > 
> > > 	          0: 128 MB
> > > 	  8000,0000: 128 MB
> > > 	1,0000,0000: 128 MB
> > > 	1,8000,0000: 128 MB
> > > 	2,0000,0000: 128 MB
> > > 	2,8000,0000: 128 MB
> > > 	3,0000,0000: 128 MB
> > > 	3,8000,0000: 128 MB
> > > 
> > > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > > can only handle 128 MB of it.
> > 
> > I see no problem with the above with the existing discontigmem stuff.
> > discontigmem does *not* require a linear relationship between kernel
> > virtual and physical memory.  I've been running kernels for a while
> > on such systems.
> 
> I just went through every variant of arm in the kernel tree, and I found that
> *all* of them implement a simple linear relationship between kernel virtual and
> physical memory, of the form:
> 
>    #define __virt_to_phys(vpage) ((vpage) - PAGE_OFFSET + PHYS_OFFSET)
>    #define __phys_to_virt(ppage) ((ppage) + PAGE_OFFSET - PHYS_OFFSET)

ARM is an example that the pgdat way is fine. As an example of the other
part about the zone_normal coalescing (page_address/__va/virt_to_page)
check ppc and m68k. ARM doesn't have highmem, so it's clearly not strict
in the address space since the first place (remeber, it's not an high
end cpu, it pays off big time in other areas), and it couldn't take
advantage in making the kernel virtual address space not a linear
mapping with the physical address space. Did you actually read Roman's
email of a few days ago that shows you __va is even just used as nonlinear?

> Are you talking about code that isn't in the tree?

first of all it doesn't matter if there wouldn't be a nonlinear __va in
ppc and m68k trees, if something can be done or not doesn't depend if
somebody did it before or not, but somebody just did it in practice too
in this case.

I've the feeling you reply too fast ignoring previous emails, so please
try to ask strict questions with non obvious stuff that you disagree
with on the past emails or you'll waste resources. If you ask stuff that
is just been discussed and that ignores the previous discussions
completly I will probably not have time to answer next times (like I've
no time for IRC for similar reasons), sorry.

Andrea

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-05 23:54                                         ` Daniel Phillips
  2002-05-06  0:28                                           ` Andrea Arcangeli
  2002-05-06  0:55                                           ` Russell King
@ 2002-05-06  8:54                                           ` Roman Zippel
  2002-05-06 15:26                                             ` Daniel Phillips
  2 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-06  8:54 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Hi,

On Mon, 6 May 2002, Daniel Phillips wrote:

> I must be guilty of not explaining clearly.  Suppose you have the following
> physical memory map:
> 
> 	          0: 128 MB
> 	  8000,0000: 128 MB
> 	1,0000,0000: 128 MB
> 	1,8000,0000: 128 MB
> 	2,0000,0000: 128 MB
> 	2,8000,0000: 128 MB
> 	3,0000,0000: 128 MB
> 	3,8000,0000: 128 MB
> 
> The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> can only handle 128 MB of it.  The rest falls out of the addressable range and
> has to be handled as highmem, that is if you preserve the linear relationship
> between kernel virtual memory and physical memory, as config_discontigmem does.
> Even if you go to 2G of kernel memory (restricting user space to 2G of virtual)
> you can only handle 256 MB.

Why do you want to preserve the linear relationship between virtual and
physical memory? There is little common code (and only during
initialization), which assumes a direct mapping. I can send you the
patches to fix this. Then you can map as much physical memory as you want
into a single virtual area and you only need a single pgdat.

bye, Roman


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  2:03                                             ` Daniel Phillips
  2002-05-06  2:31                                               ` Andrea Arcangeli
@ 2002-05-06  8:57                                               ` Russell King
  1 sibling, 0 replies; 165+ messages in thread
From: Russell King @ 2002-05-06  8:57 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Mon, May 06, 2002 at 04:03:15AM +0200, Daniel Phillips wrote:
> On Monday 06 May 2002 02:55, Russell King wrote:
> > I see no problem with the above with the existing discontigmem stuff.
> > discontigmem does *not* require a linear relationship between kernel
> > virtual and physical memory.  I've been running kernels for a while
> > on such systems.
> 
> I just went through every variant of arm in the kernel tree, and I found that
> *all* of them implement a simple linear relationship between kernel virtual
> and physical memory, of the form:

Whoops.  I didn't say _current_ kernels, did I? 8)  (Don't write mails at
2am...)

We got rid of it later as we cleaned up the kernel mappings to use ioremap
instead of static device mappings.  Hence 2.4/2.5 don't contain them any
more.  However, from 2.3.35:

diff -urN linux-orig/include/asm-arm/arch-sa1100/memory.h linux/include/asm-arm/arch-sa1100/memory.h
...
 /*
  * The following gives a maximum memory size of 128MB (32MB in each bank).
- *
- * Does this still need to be optimised for one bank machines?
  */
-#define __virt_to_phys(x)      (((x) & 0xe0ffffff) | ((x) & 0x06000000) << 2)
-#define __phys_to_virt(x)      (((x) & 0xe7ffffff) | ((x) & 0x30000000) >> 2)
+#define __virt_to_phys(x)      (((x) & 0xf9ffffff) | ((x) & 0x06000000) << 2)
+#define __phys_to_virt(x)      (((x) & 0xe7ffffff) | ((x) & 0x18000000) >> 2)

This type of mapping went away in 2.4.0-test9, which is after this
particular platform got discontig mem support in 2.3.99-pre2-rmk1.

An example is right up to date, and was the subject of the first mail
is:

+#define __virt_to_phys(vpage)   (((vpage) + ((vpage) & 0x18000000)) & \
+                                 ~0x40000000)
+
+#define __phys_to_virt(ppage)   (((ppage) & ~0x30000000) | \
+                                 (((ppage) & 0x30000000) >> 1) | \
+                                 0x40000000)

You won't find this one in my patches nor Linus' kernel tree though.

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  8:54                                           ` Roman Zippel
@ 2002-05-06 15:26                                             ` Daniel Phillips
  2002-05-06 19:07                                               ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06 15:26 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 10:54, Roman Zippel wrote:
> Hi,
> 
> On Mon, 6 May 2002, Daniel Phillips wrote:
> 
> > I must be guilty of not explaining clearly.  Suppose you have the following
> > physical memory map:
> > 
> > 	          0: 128 MB
> > 	  8000,0000: 128 MB
> > 	1,0000,0000: 128 MB
> > 	1,8000,0000: 128 MB
> > 	2,0000,0000: 128 MB
> > 	2,8000,0000: 128 MB
> > 	3,0000,0000: 128 MB
> > 	3,8000,0000: 128 MB
> > 
> > The total is 1 GB of installed ram.  Yet the kernel's 1G virtual space,
> > can only handle 128 MB of it.  The rest falls out of the addressable range and
> > has to be handled as highmem, that is if you preserve the linear relationship
> > between kernel virtual memory and physical memory, as config_discontigmem does.
> > Even if you go to 2G of kernel memory (restricting user space to 2G of virtual)
> > you can only handle 256 MB.
> 
> Why do you want to preserve the linear relationship between virtual and
> physical memory?

I don't, I observed that in all known instances of config_discontigmem, that
linear relationship is preserved.  Now, you and Andrea are suggesting that no
such linear relation is strictly necessary and I believe its worth investigating
further, to see how it would work and how it compares to config_nonlinear.

> There is little common code (and only during
> initialization), which assumes a direct mapping. I can send you the
> patches to fix this.

I already have patches to do that, that is, config_nonlinear.  I'm interested in
looking at your patches though, because we might as well give all the different
approaches a fair examination.

> Then you can map as much physical memory as you want
> into a single virtual area and you only need a single pgdat.

You're talking about your 68K solution with the loops that search through
memory regions?  If so, I've already looked at it and understand it.  Or, if
it's a new approach, then naturally I'd be interested.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06  2:06                                                       ` Andrea Arcangeli
@ 2002-05-06 17:40                                                         ` Daniel Phillips
  2002-05-06 19:09                                                           ` Martin J. Bligh
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-06 17:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, linux-kernel

This thread is already long enough, I propose that after your response
to this we take it private.  The executive summary of this post is:
"show me the code".

On Monday 06 May 2002 04:06, Andrea Arcangeli wrote:

> You can implement __va as you want, it doesn't need ot be a simple
> linear relation (see also the attached email from Roman),

Here's the relevant comment from Roman:

> I mean to map the memory where you need it. The physical<->virtual
> mapping won't be one to one, but you won't need another abstraction and
> the current vm is already basically able to handle it.
> 
> bye, Roman

Roman is talking about an implementation idea that so far hasn't been
presented in the form of working code.  I have already imlemented __va
as I want, it works, it's efficient, it's simple, clean, powerful and
extensible.  If Roman has an alternative, I'd be interested in looking
at the patch.

> but regardless
> what matters really is page_address and virt_to_page, not only __va,
> just initialize page->virtual to the static kernel window at boot time

OK, so you want to tie things to page->address.  It's an interesting
proposition, I'd like to see your code.

Keep in mind that your new use of page->address conflicts with the
current move to get rid of it from mainline, except for highmem use.
I also have doubts about the efficiency and cleanliness your proposal.
Your __pa and __va are going to get more expensive because they now
have to work through the struct page, requiring multiplies as well
as lookups.  I think you'll end up with something more complex and
less efficient than config_nonlinear - please prove me wrong by
showing me the code.

You also need some sort of structure that tells you how to set up your
static mapping in the kernel.  I already have that, you still need to
describe it.  In fact, config_nonlinear's way of doing the mem_map
initialization required no changes at all to the mem_map initialization
code.  Such results tend to suggest a particular design approach is
indeed correct.

Now, it would be interesting to see exactly what changes are required
to config_nonlinear to allow it to cover numa usage as well as
non-numa usage.  As far as I can see, I simply have to elaborate the
my mapping between pagenum and struct page, i.e., I have to do what's
necessary to put the mem_map structure into the local node.  I
believe that's possible without requiring any double table lookups.

Note that for NUMA-Q, the ->lmem_map arrays are currently off-node for
all but node zero, so the per-node ->lmem_map is doing nothing for
NUMA-Q at the moment.  In order for this to make sense for NUMA-Q, I
really do have to provide a local mapping of a portion of zone_numa,
otherwise we might as well just use config_nonlinear in its current
form.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06 15:26                                             ` Daniel Phillips
@ 2002-05-06 19:07                                               ` Roman Zippel
  2002-05-08 15:57                                                 ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-06 19:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Hi,

On Mon, 6 May 2002, Daniel Phillips wrote:

> I don't, I observed that in all known instances of config_discontigmem, that
> linear relationship is preserved.

That's true, but m68k isn't using config_discontigmem. :)

> > There is little common code (and only during
> > initialization), which assumes a direct mapping. I can send you the
> > patches to fix this.
> 
> I already have patches to do that, that is, config_nonlinear.  I'm interested in
> looking at your patches though, because we might as well give all the different
> approaches a fair examination.

See below, the patch is almost complete:
- the only other free_area_init_core() needs to be updated
- the virt_to_page(phys_to_virt()) sequence could be replaced now with
  pfn_page()

> You're talking about your 68K solution with the loops that search through
> memory regions?  If so, I've already looked at it and understand it.

That's just how the virtual<->physical conversion is implemented.

>  Or, if
> it's a new approach, then naturally I'd be interested.

It's not really new, you only have to take care, that you don't iterate
with the physical address over a pgdat, this is what the patch below
fixes, the rest can be hidden in the arch macros and no special config
options is needed.

bye, Roman

Index: mm/bootmem.c
===================================================================
RCS file: /home/linux-m68k/cvsroot/linux/mm/bootmem.c,v
retrieving revision 1.1.1.4
retrieving revision 1.5
diff -u -p -r1.1.1.4 -r1.5
--- mm/bootmem.c	11 Feb 2002 17:51:47 -0000	1.1.1.4
+++ mm/bootmem.c	11 Feb 2002 18:34:49 -0000	1.5
@@ -243,7 +243,7 @@ found:
 
 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
 {
-	struct page *page = pgdat->node_mem_map;
+	struct page *page;
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long i, count, total = 0;
 	unsigned long idx;
@@ -256,21 +256,22 @@ static unsigned long __init free_all_boo
 	map = bdata->node_bootmem_map;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
-		if (v) { 
-			unsigned long m;
-			for (m = 1; m && i < idx; m<<=1, page++, i++) { 
-				if (v & m) {
+		unsigned long m;
+		if (!v) {
+			i+=BITS_PER_LONG;
+			continue;
+		}
+		for (m = 1; m && i < idx; m<<=1, i++) {
+			if (!(v & m))
+				continue;
+			page = virt_to_page(phys_to_virt((i << PAGE_SHIFT) +
+							 bdata->node_boot_start));
 			count++;
 			ClearPageReserved(page);
 			set_page_count(page, 1);
 			__free_page(page);
 		}
 	}
-		} else {
-			i+=BITS_PER_LONG;
-			page+=BITS_PER_LONG; 
-		} 	
-	}	
 	total += count;
 
 	/*
Index: mm/page_alloc.c
===================================================================
RCS file: /home/linux-m68k/cvsroot/linux/mm/page_alloc.c,v
retrieving revision 1.1.1.14
retrieving revision 1.17
diff -u -p -r1.1.1.14 -r1.17
--- mm/page_alloc.c	6 May 2002 08:52:16 -0000	1.1.1.14
+++ mm/page_alloc.c	6 May 2002 09:11:36 -0000	1.17
@@ -796,7 +796,7 @@ static inline unsigned long wait_table_b
  *   - clear the memory bitmaps
  */
 void __init free_area_init_core(int nid, pg_data_t *pgdat, struct page **gmap,
-	unsigned long *zones_size, unsigned long zone_start_paddr, 
+	unsigned long *zones_size, unsigned long zone_start_vaddr, 
 	unsigned long *zholes_size, struct page *lmem_map)
 {
 	unsigned long i, j;
@@ -804,7 +804,7 @@ void __init free_area_init_core(int nid,
 	unsigned long totalpages, offset, realtotalpages;
 	const unsigned long zone_required_alignment = 1UL << (MAX_ORDER-1);
 
-	if (zone_start_paddr & ~PAGE_MASK)
+	if (zone_start_vaddr & ~PAGE_MASK)
 		BUG();
 
 	totalpages = 0;
@@ -837,7 +837,7 @@ void __init free_area_init_core(int nid,
 	}
 	*gmap = pgdat->node_mem_map = lmem_map;
 	pgdat->node_size = totalpages;
-	pgdat->node_start_paddr = zone_start_paddr;
+	pgdat->node_start_paddr = __pa(zone_start_vaddr);
 	pgdat->node_start_mapnr = (lmem_map - mem_map);
 	pgdat->nr_zones = 0;
 
@@ -889,9 +889,9 @@ void __init free_area_init_core(int nid,
 
 		zone->zone_mem_map = mem_map + offset;
 		zone->zone_start_mapnr = offset;
-		zone->zone_start_paddr = zone_start_paddr;
+		zone->zone_start_paddr = __pa(zone_start_vaddr);
 
-		if ((zone_start_paddr >> PAGE_SHIFT) & (zone_required_alignment-1))
+		if ((zone_start_vaddr >> PAGE_SHIFT) & (zone_required_alignment-1))
 			printk("BUG: wrong zone alignment, it will crash\n");
 
 		/*
@@ -906,8 +906,8 @@ void __init free_area_init_core(int nid,
 			SetPageReserved(page);
 			memlist_init(&page->list);
 			if (j != ZONE_HIGHMEM)
-				set_page_address(page, __va(zone_start_paddr));
-			zone_start_paddr += PAGE_SIZE;
+				set_page_address(page, zone_start_vaddr);
+			zone_start_vaddr += PAGE_SIZE;
 		}
 
 		offset += size;
@@ -954,7 +954,7 @@ void __init free_area_init_core(int nid,
 
 void __init free_area_init(unsigned long *zones_size)
 {
-	free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 0, 0, 0);
+	free_area_init_core(0, &contig_page_data, &mem_map, zones_size, PAGE_OFFSET, 0, 0);
 }
 
 static int __init setup_mem_frac(char *str)



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06 17:40                                                         ` Daniel Phillips
@ 2002-05-06 19:09                                                           ` Martin J. Bligh
  0 siblings, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-06 19:09 UTC (permalink / raw)
  To: Daniel Phillips, Andrea Arcangeli; +Cc: linux-kernel

> Note that for NUMA-Q, the ->lmem_map arrays are currently off-node for
> all but node zero, so the per-node ->lmem_map is doing nothing for
> NUMA-Q at the moment.  In order for this to make sense for NUMA-Q, I
> really do have to provide a local mapping of a portion of zone_numa,
> otherwise we might as well just use config_nonlinear in its current
> form.

To split hairs, they're not currently off node - as they have to reside in
ZONE_NORMAL, I can't make them so until I have the nonlinear stuff
(or equivalent). But they ought to be on their home node, so your point
is pretty much the same ;-) AFAIK, all other NUMA arches use the local
lmem_map already.

Is zone_numa a typo for zone_normal, or did I lose track of the conversation 
at some point? I'm not sure I grok the last sentence of yours ....

M.



^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-06 19:07                                               ` Roman Zippel
@ 2002-05-08 15:57                                                 ` Daniel Phillips
  2002-05-08 23:11                                                   ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-08 15:57 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Monday 06 May 2002 21:07, Roman Zippel wrote:
> Hi,
> 
> On Mon, 6 May 2002, Daniel Phillips wrote:
> 
> > I don't, I observed that in all known instances of config_discontigmem, 
> > that linear relationship is preserved.
> 
> That's true, but m68k isn't using config_discontigmem. :)

Right.  In fact, your two way, phys_to_virt and virt_to_phys mapping makes it 
more like config_nonlinear.  You don't define the contiguous logical memory 
space though, and perhaps that's the reason you need the free_area_init 
changes in the patch below.

Your patch preserves a linear relationship between physical and virtual 
memory, because you do both the ptov and vtop lookup in the same array.  As
such, you don't provide the functionality I provide of being able to fit a
large amount of physical memory into a small amount of virtual memory, and
you can't join all your separate pgdat's into one, as I do.  (The latter is 
desireable because it allows the memory manager to allocate from one 
homogeneous space, reducing the likelihood of zone balancing problems.)

We could, if we want, implement your variable sized memory chunk system with
config_nonlinear. You'd just have to replace the

    ulong psection[MAX_SECTIONS]

with:

    struct { ulong base; ulong size; } pchunk[MAX_CHUNKS];

and replace the four direct table lookup with loops.  Highmem does not need
to be a special case, by the way.  Another by the way: you've accidently
repeated the last four lines of mm_vtop.  Finally, it looks like your 
ZTWO_VADDR hack in mm_ptov would also cease to be a special case, at least,
the special case part would move to the initialization instead of every __va
operation.  So you would end up with *zero* special cases in the page 
translation functions of page.h.

> ...you only have to take care, that you don't iterate
> with the physical address over a pgdat, this is what the patch below
> fixes, the rest can be hidden in the arch macros and no special config
> options is needed.

You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK.  You just didn't
attempt to create the contiguous logical address space as I did, so you 
didn't need to go outside your arch.

The generic part of config_nonlinear is tiny anyway - only 200 lines, and 
might grow to 400 by the time all device driver usage of __pa is reclassified 
as either virt_to_phys or virt_to_logical - the latter being a rather nice 
distinction to make, even if the mapping is the same don't you think?  I.e, 
it's like the distinction between pointer and integer: if it's a physical 
address you can pass it to dma hardware, for example and if it's logical 
you're just using it for accounting.

Whenever it's possible to elevate a per-arch feature to the generic level
without compromising functionality, it should be done, modulo programmer 
time, and of course, assuming functionality isn't compromised.  At 
the generic level, it's easier to document, we get cross-pollination from 
improvements developed on different arches, and it's easier to build on.  
Going the other way and allowing design features to fray across architectures 
takes us in the direction of unmaintainable bloat.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-08 15:57                                                 ` Daniel Phillips
@ 2002-05-08 23:11                                                   ` Roman Zippel
  2002-05-09 16:08                                                     ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-08 23:11 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Hi,

Daniel Phillips wrote:

> Your patch preserves a linear relationship between physical and virtual
> memory, because you do both the ptov and vtop lookup in the same array.  As
> such, you don't provide the functionality I provide of being able to fit a
> large amount of physical memory into a small amount of virtual memory, and
> you can't join all your separate pgdat's into one, as I do.

Read the source again. arch/m68k/mm/motorola.c:map_chunk() maps all
memory areas into single virtual area (virtaddr is static there and only
increased starting from PAGE_OFFSET). In paging_init() there is only a
single call to free_area_init().

> and replace the four direct table lookup with loops.

The loops are only an implementation detail and can be replaced with
another algorithm.

> you've accidently
> repeated the last four lines of mm_vtop.  Finally, it looks like your
> ZTWO_VADDR hack in mm_ptov would also cease to be a special case, at least,

That stuff is obsolete since ages, it should be replaced with BUG().

> You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK.

That was our cheap answer to avoid the loops.

>  You just didn't
> attempt to create the contiguous logical address space as I did, so you
> didn't need to go outside your arch.

I don't need that, because I create a contiguous _virtual_ address
space.

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-08 23:11                                                   ` Roman Zippel
@ 2002-05-09 16:08                                                     ` Daniel Phillips
  2002-05-09 22:06                                                       ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-09 16:08 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Thursday 09 May 2002 01:11, Roman Zippel wrote:
> Hi,
> 
> Daniel Phillips wrote:
> 
> > Your patch preserves a linear relationship between physical and virtual
> > memory, because you do both the ptov and vtop lookup in the same array.  
As
> > such, you don't provide the functionality I provide of being able to fit a
> > large amount of physical memory into a small amount of virtual memory, and
> > you can't join all your separate pgdat's into one, as I do.
> 
> Read the source again. arch/m68k/mm/motorola.c:map_chunk() maps all
> memory areas into single virtual area (virtaddr is static there and only
> increased starting from PAGE_OFFSET). In paging_init() there is only a
> single call to free_area_init().

Oops, yes, I see how it works, it relies on your O(N) search for the inverse. 
(Obligatory snipe: there are almost no comments for this opaque code, I hope 
you share my feeling that needs fixing.)

Searching the table instead of doing a direct lookup allows you to eliminate 
one of my two tables.  This is not a property you'd want to tie yourself to 
though, since the cost for any large number of chunks will be excessive, and 
will show up in the page table manipulation overhead.

Now it seems our strategies are a lot more similar than different.  So what 
were we arguing about again?  I've just gone further with the generalization 
of this, and cast it into a more general form suitable for use across more 
than one arch.

Where you ignore the distinction between logical and physical, it costs you 
execution time, as where you wrote  page = virt_to_page(phys_to_virt((i << 
PAGE_SHIFT) + bdata->node_boot_start)) where formerly we just had page++.  
This in generic code too.  Unless you have an #ifdef CONFIG_SOMETHING there I 
recommend this code *not* be merged because it penalizes the common case for 
the sake of your arch.  And it's unnecessary even for your arch, as I've 
demonstrated.

Incidently, the reason I came up with the virtual/logical distinction in the 
first places is that I found myself writing such awkward constructs as you 
wrote above, and thought there must be a better way.  Indeed there is.

> > You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK.
> 
> That was our cheap answer to avoid the loops.

My cheap answer is to turn the option off.  So why don't I need a config 
option again?

> >  You just didn't
> > attempt to create the contiguous logical address space as I did, so you
> > didn't need to go outside your arch.
> 
> I don't need that, because I create a contiguous _virtual_ address
> space.

Again, we're arguing about what?  So do I.  The relationship between virtual 
and logical, for me, is just logical = virtual - PAGE_OFFSET, a meme you'll 
find in many places in the kernel source already, often obscured by the 
impression that physical addresses are really being manipulated when in fact 
nothing of the kind is going on - the simple truth is, the arithmetic gets 
easier then you work zero-based instead of PAGE_OFFSET based.

So now that we know we're both doing the same thing, could we please stop 
doing the catholics vs the protestants thing and maybe cooperate?

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-09 16:08                                                     ` Daniel Phillips
@ 2002-05-09 22:06                                                       ` Roman Zippel
  2002-05-09 22:22                                                         ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-09 22:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Hi,

Daniel Phillips wrote:

> Where you ignore the distinction between logical and physical, it costs you
> execution time, as where you wrote  page = virt_to_page(phys_to_virt((i <<
> PAGE_SHIFT) + bdata->node_boot_start)) where formerly we just had page++.
> This in generic code too.  Unless you have an #ifdef CONFIG_SOMETHING there I
> recommend this code *not* be merged because it penalizes the common case for
> the sake of your arch.  And it's unnecessary even for your arch, as I've
> demonstrated.

1. My patch only modifies init code, I don't think it's really a problem
if it's slightly slower.
2. Above can now be written as "page = pfn_to_page(i +
(bdata->node_boot_start >> PAGE_SHIFT))". Nice, isn't it? :)

> > > You do have a config option, it's CONFIG_SINGLE_MEMORY_CHUNK.
> >
> > That was our cheap answer to avoid the loops.
> 
> My cheap answer is to turn the option off.  So why don't I need a config
> option again?

You know, what that option does?

> > I don't need that, because I create a contiguous _virtual_ address
> > space.
> 
> Again, we're arguing about what?  So do I.  The relationship between virtual
> and logical, for me, is just logical = virtual - PAGE_OFFSET, a meme you'll
> find in many places in the kernel source already, often obscured by the
> impression that physical addresses are really being manipulated when in fact
> nothing of the kind is going on - the simple truth is, the arithmetic gets
> easier then you work zero-based instead of PAGE_OFFSET based.

Why do you want to introduce another abstraction? If the logical address
is basically the same as the virtual address, just use the virtual
address. What difference should that offset make? Could you show me
please one single example?

> So now that we know we're both doing the same thing, could we please stop
> doing the catholics vs the protestants thing and maybe cooperate?

I'm an atheist. >:-)

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-09 22:06                                                       ` Roman Zippel
@ 2002-05-09 22:22                                                         ` Daniel Phillips
  2002-05-09 23:00                                                           ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-09 22:22 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Friday 10 May 2002 00:06, Roman Zippel wrote:
> 1. My patch only modifies init code, I don't think it's really a problem
> if it's slightly slower.

But why be slower when we don't have to.  And why slow down *all* architectures?

> 2. Above can now be written as "page = pfn_to_page(i +
> (bdata->node_boot_start >> PAGE_SHIFT))". Nice, isn't it? :)

page++ is nicer yet.

> > > I don't need that, because I create a contiguous _virtual_ address
> > > space.
> > 
> > Again, we're arguing about what?  So do I.  The relationship between virtual
> > and logical, for me, is just logical = virtual - PAGE_OFFSET, a meme you'll
> > find in many places in the kernel source already, often obscured by the
> > impression that physical addresses are really being manipulated when in fact
> > nothing of the kind is going on - the simple truth is, the arithmetic gets
> > easier then you work zero-based instead of PAGE_OFFSET based.
> 
> Why do you want to introduce another abstraction?

The abstraction is already there.  I didn't create the logical space, I identified
it.  There are places where the code is really manipulating logical addresses, not
physical addresses, and these are not explicitly identified.  This makes the code
cleaner and easier to read.

Your question is really 'why introduce any abstraction', or maybe you're asking
'is this an abstraction worth introducing'?  Clearly it is, since it makes
bootmem run faster, with nothing but name changes.

> If the logical address
> is basically the same as the virtual address, just use the virtual
> address.

But kernel coders have already done that in lots of places.  Why?  Because it's
a pain to to arithmetic where everything is at an offset, and difficult to read.
Not to mention, bulkier.

> What difference should that offset make? Could you show me
> please one single example?

Look at drivers/char/mem.c, read_mem.  Clearly, the code is not dealing with
physical addresses.  Yet it starts off with virt_to_phys, and thereafter works
in zero-offset addresses.  Why?  Because it's clearer and more efficient to do
that.  The generic part of my nonlinear patch clarifies this usage by rewriting
it as virt_to_logical, which is really what's happening.

That's really what's happening in bootmem too.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-09 22:22                                                         ` Daniel Phillips
@ 2002-05-09 23:00                                                           ` Roman Zippel
  2002-05-09 23:22                                                             ` Daniel Phillips
  0 siblings, 1 reply; 165+ messages in thread
From: Roman Zippel @ 2002-05-09 23:00 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Hi,

Daniel Phillips wrote:

> On Friday 10 May 2002 00:06, Roman Zippel wrote:
> > 1. My patch only modifies init code, I don't think it's really a problem
> > if it's slightly slower.
> 
> But why be slower when we don't have to.  And why slow down *all* architectures?
> 
> > 2. Above can now be written as "page = pfn_to_page(i +
> > (bdata->node_boot_start >> PAGE_SHIFT))". Nice, isn't it? :)
> 
> page++ is nicer yet.

Is memmap[i++] so much worse? Let me repeat, this is only executed once
at boot!

> > Why do you want to introduce another abstraction?
> 
> The abstraction is already there.  I didn't create the logical space, I identified
> it.

And it's called virtual address space.

>  There are places where the code is really manipulating logical addresses, not
> physical addresses, and these are not explicitly identified.  This makes the code
> cleaner and easier to read.

_Please_ show me an example.

> Look at drivers/char/mem.c, read_mem.  Clearly, the code is not dealing with
> physical addresses.  Yet it starts off with virt_to_phys, and thereafter works
> in zero-offset addresses.  Why?  Because it's clearer and more efficient to do
> that.  The generic part of my nonlinear patch clarifies this usage by rewriting
> it as virt_to_logical, which is really what's happening.

Are we looking at the same code??? Where is that zero-offset thingie? It
just works with virtual and physical addresses and needs to convert
between them.

> That's really what's happening in bootmem too.

That also works with just physical and virtual addresses. What are you
talking about???

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-09 23:00                                                           ` Roman Zippel
@ 2002-05-09 23:22                                                             ` Daniel Phillips
  2002-05-10  0:13                                                               ` Roman Zippel
  0 siblings, 1 reply; 165+ messages in thread
From: Daniel Phillips @ 2002-05-09 23:22 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

On Friday 10 May 2002 01:00, Roman Zippel wrote:
> > Look at drivers/char/mem.c, read_mem.  Clearly, the code is not dealing with
> > physical addresses.  Yet it starts off with virt_to_phys, and thereafter works
> > in zero-offset addresses.  Why?  Because it's clearer and more efficient to do
> > that.  The generic part of my nonlinear patch clarifies this usage by rewriting
> > it as virt_to_logical, which is really what's happening.
> 
> Are we looking at the same code??? Where is that zero-offset thingie? It
> just works with virtual and physical addresses and needs to convert
> between them.

Show me where the 'physical' address is actually treated as a physical address.
You can't, because it isn't.  The 'physical' address is merely a zero-based
logical address, and the code *relies* on it being contiguous.

Your code is going to do __pa there, and you are going to go walking into places
you don't expect.  Even you need my logical address space abstraction, or else you
want to go making global changes to the common kernel code that just add cruft.

I enjoy the feeling of removing cruft, even when it's an uphill battle.

-- 
Daniel

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
  2002-05-09 23:22                                                             ` Daniel Phillips
@ 2002-05-10  0:13                                                               ` Roman Zippel
  0 siblings, 0 replies; 165+ messages in thread
From: Roman Zippel @ 2002-05-10  0:13 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Andrea Arcangeli, Martin J. Bligh, linux-kernel

Hi,

Daniel Phillips wrote:

> Show me where the 'physical' address is actually treated as a physical address.
> You can't, because it isn't.  The 'physical' address is merely a zero-based
> logical address, and the code *relies* on it being contiguous.

Most of the code doesn't care about physical addresses, because they
either work with virtual memory or with the page structure. Physical
addresses are only interesting to pass them to the hardware or to put
them into the page table.

> Your code is going to do __pa there, and you are going to go walking into places
> you don't expect.  Even you need my logical address space abstraction, or else you
> want to go making global changes to the common kernel code that just add cruft.

So far I've only seen a virtual address with some offset. You can maybe
move that offset around, but you can't remove it. In the end it's the
same.

> I enjoy the feeling of removing cruft, even when it's an uphill battle.

I'm happy to see patches.

bye, Roman

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-05  0:49         ` Denis Vlasenko
@ 2002-05-05 17:59           ` Martin J. Bligh
  0 siblings, 0 replies; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-05 17:59 UTC (permalink / raw)
  To: vda; +Cc: linux-kernel

> On 3 May 2002 20:35, Martin J. Bligh wrote:
>> > No. It's not stupid. Unix defines a kind of operating system that
>> > has certain characteristics and/or attributes. Process/kernel shared
>> > address space is one of them. It's a name that has historical
>> > signifigance.
>> 
>> Yes it is stupid. This is a small implementation detail, and has no
>> real importance whatsoever. People have done this in the past
>> (Dynix/PTX did it) will do so in the future. Nor does the kernel
>> address space have to be global and shared across all tasks
>> as stated earlier in this thread. What makes it Unix is the interface
>> it presents to the world, and how it behaves, not the little details
>> of how it's implemented inside.
> 
> I'm curious where it is visible to userspace?
> (I'm asking for educational purposes)

Where what is visible to userspace? If you mean the bit about 
"the interface it presents to the world", I meant Linux in 
general, not this feature. The whole point is that this is 
invisble to userspace (apart from performance and a lack of
architectural restrictions you might have been expecting) 
therefore it's irrelevant to whether it's "Unix" like or not.

M.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 22:35       ` Martin J. Bligh
@ 2002-05-05  0:49         ` Denis Vlasenko
  2002-05-05 17:59           ` Martin J. Bligh
  0 siblings, 1 reply; 165+ messages in thread
From: Denis Vlasenko @ 2002-05-05  0:49 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

On 3 May 2002 20:35, Martin J. Bligh wrote:
> > No. It's not stupid. Unix defines a kind of operating system that
> > has certain characteristics and/or attributes. Process/kernel shared
> > address space is one of them. It's a name that has historical
> > signifigance.
>
> Yes it is stupid. This is a small implementation detail, and has no
> real importance whatsoever. People have done this in the past
> (Dynix/PTX did it) will do so in the future. Nor does the kernel
> address space have to be global and shared across all tasks
> as stated earlier in this thread. What makes it Unix is the interface
> it presents to the world, and how it behaves, not the little details
> of how it's implemented inside.

I'm curious where it is visible to userspace?
(I'm asking for educational purposes)
--
vda

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 19:30     ` Richard B. Johnson
@ 2002-05-03 22:35       ` Martin J. Bligh
  2002-05-05  0:49         ` Denis Vlasenko
  0 siblings, 1 reply; 165+ messages in thread
From: Martin J. Bligh @ 2002-05-03 22:35 UTC (permalink / raw)
  To: root, Jeff Dike; +Cc: linux-kernel

> No. It's not stupid. Unix defines a kind of operating system that
> has certain characteristics and/or attributes. Process/kernel shared
> address space is one of them. It's a name that has historical
> signifigance.

Yes it is stupid. This is a small implementation detail, and has no
real importance whatsoever. People have done this in the past
(Dynix/PTX did it) will do so in the future. Nor does the kernel 
address space have to be global and shared across all tasks
as stated earlier in this thread. What makes it Unix is the interface
it presents to the world, and how it behaves, not the little details
of how it's implemented inside.

M.

PS. I've been told Solaris x86 can do 4Gb for each of kernel
and user space, though I've no first hand experience with that
OS.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
                     ` (3 preceding siblings ...)
  2002-05-03 19:50   ` Tony Luck
@ 2002-05-03 20:22   ` Jeff Dike
  2002-05-03 19:30     ` Richard B. Johnson
  4 siblings, 1 reply; 165+ messages in thread
From: Jeff Dike @ 2002-05-03 20:22 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

root@chaos.analogic.com said:
> Would you please tell me what Unix has 32-bit address space which is
> not shared with the kernel? 

I'm planning on doing that with UML at some point.

The claim that it's not Unix if it doesn't share the process address space
is just stupid.

				Jeff


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
                     ` (2 preceding siblings ...)
  2002-05-03 19:38   ` Matti Aarnio
@ 2002-05-03 19:50   ` Tony Luck
  2002-05-03 20:22   ` Jeff Dike
  4 siblings, 0 replies; 165+ messages in thread
From: Tony Luck @ 2002-05-03 19:50 UTC (permalink / raw)
  To: root; +Cc: linux-kernel


--- "Richard B. Johnson" <root@chaos.analogic.com>
wrote:
> 
> I think that if this shared address-space doesn't
> exist
> then you don't have "Unix". You have something
> (perhaps
> better), but it's not Unix. 

Looking back a little earlier in the history of Unix,
we see that early versions ran on 16-bit
architectures. Does anyone out there remember Version
6 on the pdp11. It most certainly did not share the
address space (all 64k of it) between user and kernel.
Are you trying to say that what Dennis and Ken wrote
is not "Unix"?

-Tony

__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
  2002-04-27  1:15   ` Pavel Machek
  2002-05-03 19:09   ` Christoph Hellwig
@ 2002-05-03 19:38   ` Matti Aarnio
  2002-05-03 19:50   ` Tony Luck
  2002-05-03 20:22   ` Jeff Dike
  4 siblings, 0 replies; 165+ messages in thread
From: Matti Aarnio @ 2002-05-03 19:38 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: linux-kernel

On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
...
> > This hasn't been an absolute requirement. There have
> > been 32-bit Unix implementations that gave separate
> > 4G address spaces to the kernel and to each user
> > process.  The only real downside to this is that
> > copyin()/copyout() are more complex. Some processors
> > provided special instructions to access user-mode
> > addresses from kernel to mitigate this complexity.
> > 
> > -Tony
> 
> Really? The only 32-bit Unix's I've seen the details of
> are SCO Unix, Interactive Unix, Linux, and BSD Unix.

   An example of hardware with fully separable user/kernel spaces
   are Motorola 68020-68060 series processors.

   They have those special instructions to choose (in kernel mode)
   what address spaces to use at which data access phase of the
   special moves.  There is some speed penalty, of course..

...
> Would you please tell me what Unix has 32-bit address space
> which is not shared with the kernel?

   That could be the one called "Linux", if a bunch of conditions
   are met -- beginning with suitable hardware.

> Cheers,
> Dick Johnson
> Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

/Matti Aarnio

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 20:22   ` Jeff Dike
@ 2002-05-03 19:30     ` Richard B. Johnson
  2002-05-03 22:35       ` Martin J. Bligh
  0 siblings, 1 reply; 165+ messages in thread
From: Richard B. Johnson @ 2002-05-03 19:30 UTC (permalink / raw)
  To: Jeff Dike; +Cc: linux-kernel

On Fri, 3 May 2002, Jeff Dike wrote:

> root@chaos.analogic.com said:
> > Would you please tell me what Unix has 32-bit address space which is
> > not shared with the kernel? 
> 
> I'm planning on doing that with UML at some point.
> 
> The claim that it's not Unix if it doesn't share the process address space
> is just stupid.
> 

No. It's not stupid. Unix defines a kind of operating system that
has certain characteristics and/or attributes. Process/kernel shared
address space is one of them. It's a name that has historical
signifigance.

Linux does not have to be Unix. In fact, divorcing virtual address
space may make a better Operating System and it's good that somebody
it planning that. But the result will not be the 25-30 year old
architecture we call Unix. It will be Linux. And it just might
be the thing that makes Linux shine above others, so don't call
this difference stupid.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:17     ` Richard B. Johnson
@ 2002-05-03 19:24       ` Christoph Hellwig
  0 siblings, 0 replies; 165+ messages in thread
From: Christoph Hellwig @ 2002-05-03 19:24 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Tony Luck, linux-kernel

On Fri, May 03, 2002 at 03:17:35PM -0400, Richard B. Johnson wrote:
> On Fri, 3 May 2002, Christoph Hellwig wrote:
> 
> > On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
> > > The other Unix's I've become familiar are Sun-OS, the
> > 
> > SunOS 5 uses separate address spaces on sparcv9 (32 and 64bit).
> > The same is true for many Linux ports, e.g. sparc64 or s390.
> > 
> 
> No no! I'm not talking about the physical address spaces. Many
> CPUs have separate address spaces for separate functions. I'm
> taking about the virtual address space that the process sees.
> There are no holes in this virtual address space of SunOS, and
> no "separate stuff" (I/O space) seen by a user-mode task.

This thread was about separate user/kernel VIRTUAL address spaces.
Not about holes, I/O spaces or other crap.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:09   ` Christoph Hellwig
@ 2002-05-03 19:17     ` Richard B. Johnson
  2002-05-03 19:24       ` Christoph Hellwig
  0 siblings, 1 reply; 165+ messages in thread
From: Richard B. Johnson @ 2002-05-03 19:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Tony Luck, linux-kernel

On Fri, 3 May 2002, Christoph Hellwig wrote:

> On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
> > The other Unix's I've become familiar are Sun-OS, the
> 
> SunOS 5 uses separate address spaces on sparcv9 (32 and 64bit).
> The same is true for many Linux ports, e.g. sparc64 or s390.
> 

No no! I'm not talking about the physical address spaces. Many
CPUs have separate address spaces for separate functions. I'm
taking about the virtual address space that the process sees.
There are no holes in this virtual address space of SunOS, and
no "separate stuff" (I/O space) seen by a user-mode task.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
  2002-04-27  1:15   ` Pavel Machek
@ 2002-05-03 19:09   ` Christoph Hellwig
  2002-05-03 19:17     ` Richard B. Johnson
  2002-05-03 19:38   ` Matti Aarnio
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 165+ messages in thread
From: Christoph Hellwig @ 2002-05-03 19:09 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Tony Luck, linux-kernel

On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
> The other Unix's I've become familiar are Sun-OS, the

SunOS 5 uses separate address spaces on sparcv9 (32 and 64bit).
The same is true for many Linux ports, e.g. sparc64 or s390.

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 18:37 Virtual address space exhaustion (was Discontigmem virt_to_page() ) Tony Luck
@ 2002-05-03 19:01 ` Richard B. Johnson
  2002-04-27  1:15   ` Pavel Machek
                     ` (4 more replies)
  0 siblings, 5 replies; 165+ messages in thread
From: Richard B. Johnson @ 2002-05-03 19:01 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel

On Fri, 3 May 2002, Tony Luck wrote:

> Richard B. Johnson wrote:
> > One of the Unix characteristics is that the kernel
> > address space is shared with each of the process
> > address space.
> 
> This hasn't been an absolute requirement. There have
> been 32-bit Unix implementations that gave separate
> 4G address spaces to the kernel and to each user
> process.  The only real downside to this is that
> copyin()/copyout() are more complex. Some processors
> provided special instructions to access user-mode
> addresses from kernel to mitigate this complexity.
> 
> -Tony
> 
Really? The only 32-bit Unix's I've seen the details of
are SCO Unix, Interactive Unix, Linux, and BSD Unix.
The other Unix's I've become familiar are Sun-OS, the
original AT&T(Unix System Labs)/SYS-V and DEC Ultrix.
All these Unix's share user address-space with kernel
address-space. This is supposed to be the very thing
that makes Unix different from other VMS/timeshare
Operating Systems.

I think that if this shared address-space doesn't exist
then you don't have "Unix". You have something (perhaps
better), but it's not Unix. For instance VAX/VMS doesn't
share address space. In fact, the VAX/VMS kernel is, itself,
a process. This means it has its own context. This can
be quite useful.

Would you please tell me what Unix has 32-bit address space
which is not shared with the kernel?

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
@ 2002-05-03 18:37 Tony Luck
  2002-05-03 19:01 ` Richard B. Johnson
  0 siblings, 1 reply; 165+ messages in thread
From: Tony Luck @ 2002-05-03 18:37 UTC (permalink / raw)
  To: linux-kernel

Richard B. Johnson wrote:
> One of the Unix characteristics is that the kernel
> address space is shared with each of the process
> address space.

This hasn't been an absolute requirement. There have
been 32-bit Unix implementations that gave separate
4G address spaces to the kernel and to each user
process.  The only real downside to this is that
copyin()/copyout() are more complex. Some processors
provided special instructions to access user-mode
addresses from kernel to mitigate this complexity.

-Tony


__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

^ permalink raw reply	[flat|nested] 165+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
@ 2002-04-27  1:15   ` Pavel Machek
  2002-05-03 19:09   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 165+ messages in thread
From: Pavel Machek @ 2002-04-27  1:15 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Tony Luck, linux-kernel

Hi!

> > > One of the Unix characteristics is that the kernel
> > > address space is shared with each of the process
> > > address space.
> > 
> > This hasn't been an absolute requirement. There have
> > been 32-bit Unix implementations that gave separate
> > 4G address spaces to the kernel and to each user
> > process.  The only real downside to this is that
> > copyin()/copyout() are more complex. Some processors
> > provided special instructions to access user-mode
> > addresses from kernel to mitigate this complexity.
> > 
> Really? The only 32-bit Unix's I've seen the details of
> are SCO Unix, Interactive Unix, Linux, and BSD Unix.
> The other Unix's I've become familiar are Sun-OS, the
> original AT&T(Unix System Labs)/SYS-V and DEC Ultrix.
> All these Unix's share user address-space with kernel
> address-space. This is supposed to be the very thing

Remember userspace being accessed through fs: in linux-2.0 days?

That counts as separate address space to me...
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 165+ messages in thread

end of thread, other threads:[~2002-05-10  0:13 UTC | newest]

Thread overview: 165+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-26 18:27 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Russell King
2002-04-26 22:46 ` Andrea Arcangeli
2002-04-29 17:50   ` Martin J. Bligh
2002-04-29 22:00   ` Roman Zippel
2002-04-30  0:43     ` Andrea Arcangeli
2002-04-27 22:10 ` Daniel Phillips
2002-04-29 13:35   ` Andrea Arcangeli
2002-04-29 23:02     ` Daniel Phillips
2002-05-01  2:23       ` Andrea Arcangeli
2002-04-30 23:12         ` Daniel Phillips
2002-05-01  1:05           ` Daniel Phillips
2002-05-02  0:47           ` Andrea Arcangeli
2002-05-01  1:26             ` Daniel Phillips
2002-05-02  1:43               ` Andrea Arcangeli
2002-05-01  2:41                 ` Daniel Phillips
2002-05-02 13:34                   ` Andrea Arcangeli
2002-05-02 15:18                     ` Martin J. Bligh
2002-05-02 15:35                       ` Andrea Arcangeli
2002-05-01 15:42                         ` Daniel Phillips
2002-05-02 16:06                           ` Andrea Arcangeli
2002-05-02 16:10                             ` Martin J. Bligh
2002-05-02 16:40                               ` Andrea Arcangeli
2002-05-02 17:16                                 ` William Lee Irwin III
2002-05-02 18:41                                   ` Andrea Arcangeli
2002-05-02 19:19                                     ` William Lee Irwin III
2002-05-02 19:27                                       ` Daniel Phillips
2002-05-02 19:38                                         ` William Lee Irwin III
2002-05-02 19:58                                           ` Daniel Phillips
2002-05-03  6:28                                           ` Andrea Arcangeli
2002-05-03  6:10                                         ` Andrea Arcangeli
2002-05-02 22:20                                       ` Martin J. Bligh
2002-05-02 21:28                                         ` William Lee Irwin III
2002-05-02 21:52                                           ` Kurt Ferreira
2002-05-02 21:55                                             ` William Lee Irwin III
2002-05-03  6:38                                         ` Andrea Arcangeli
2002-05-03  6:58                                           ` Martin J. Bligh
2002-05-03  6:04                                       ` Andrea Arcangeli
2002-05-03  6:33                                         ` Martin J. Bligh
2002-05-03  8:38                                           ` Andrea Arcangeli
2002-05-03  9:26                                             ` William Lee Irwin III
2002-05-03 15:38                                               ` Martin J. Bligh
2002-05-03 15:17                                             ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
2002-05-03 15:58                                               ` Andrea Arcangeli
2002-05-03 16:10                                                 ` Martin J. Bligh
2002-05-03 16:25                                                   ` Andrea Arcangeli
2002-05-03 16:02                                               ` Daniel Phillips
2002-05-03 16:20                                                 ` Andrea Arcangeli
2002-05-03 16:41                                                   ` Daniel Phillips
2002-05-03 16:58                                                     ` Andrea Arcangeli
2002-05-03 18:08                                                       ` Daniel Phillips
2002-05-03  9:24                                         ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] William Lee Irwin III
2002-05-03 10:30                                           ` Andrea Arcangeli
2002-05-03 11:09                                             ` William Lee Irwin III
2002-05-03 11:27                                               ` Andrea Arcangeli
2002-05-03 15:42                                             ` Martin J. Bligh
2002-05-03 15:32                                           ` Martin J. Bligh
2002-05-02 19:22                                     ` Daniel Phillips
2002-05-03  6:06                                       ` Andrea Arcangeli
2002-05-02 18:25                                 ` Daniel Phillips
2002-05-02 18:44                                   ` Andrea Arcangeli
2002-05-02 19:31                                 ` Martin J. Bligh
2002-05-02 18:57                                   ` Andrea Arcangeli
2002-05-02 19:08                                     ` Daniel Phillips
2002-05-03  5:15                                       ` Andrea Arcangeli
2002-05-05 23:54                                         ` Daniel Phillips
2002-05-06  0:28                                           ` Andrea Arcangeli
2002-05-06  0:34                                             ` Daniel Phillips
2002-05-06  1:01                                               ` Andrea Arcangeli
2002-05-06  0:55                                           ` Russell King
2002-05-06  1:07                                             ` Daniel Phillips
2002-05-06  1:20                                               ` Andrea Arcangeli
2002-05-06  1:24                                                 ` Daniel Phillips
2002-05-06  1:42                                                   ` Andrea Arcangeli
2002-05-06  1:48                                                     ` Daniel Phillips
2002-05-06  2:06                                                       ` Andrea Arcangeli
2002-05-06 17:40                                                         ` Daniel Phillips
2002-05-06 19:09                                                           ` Martin J. Bligh
2002-05-06  1:09                                             ` Andrea Arcangeli
2002-05-06  1:13                                             ` Daniel Phillips
2002-05-06  2:03                                             ` Daniel Phillips
2002-05-06  2:31                                               ` Andrea Arcangeli
2002-05-06  8:57                                               ` Russell King
2002-05-06  8:54                                           ` Roman Zippel
2002-05-06 15:26                                             ` Daniel Phillips
2002-05-06 19:07                                               ` Roman Zippel
2002-05-08 15:57                                                 ` Daniel Phillips
2002-05-08 23:11                                                   ` Roman Zippel
2002-05-09 16:08                                                     ` Daniel Phillips
2002-05-09 22:06                                                       ` Roman Zippel
2002-05-09 22:22                                                         ` Daniel Phillips
2002-05-09 23:00                                                           ` Roman Zippel
2002-05-09 23:22                                                             ` Daniel Phillips
2002-05-10  0:13                                                               ` Roman Zippel
2002-05-02 22:39                                     ` Martin J. Bligh
2002-05-03  7:04                                       ` Andrea Arcangeli
2002-05-02 23:42                             ` Daniel Phillips
2002-05-03  7:45                               ` Andrea Arcangeli
2002-05-02 16:07                         ` Martin J. Bligh
2002-05-02 16:58                           ` Gerrit Huizenga
2002-05-02 18:10                             ` Andrea Arcangeli
2002-05-02 19:28                               ` Gerrit Huizenga
2002-05-02 22:23                                 ` Martin J. Bligh
2002-05-03  6:20                                 ` Andrea Arcangeli
2002-05-03  6:39                                   ` Martin J. Bligh
2002-05-02 16:00                     ` William Lee Irwin III
2002-05-02  2:37             ` William Lee Irwin III
2002-05-02 15:59               ` Andrea Arcangeli
2002-05-02 16:06                 ` William Lee Irwin III
2002-05-01 18:05         ` Jesse Barnes
2002-05-01 23:17           ` Andrea Arcangeli
2002-05-01 23:23             ` discontiguous memory platforms Jesse Barnes
2002-05-02  0:51               ` Ralf Baechle
2002-05-02  1:27                 ` Andrea Arcangeli
2002-05-02  1:32                   ` Ralf Baechle
2002-05-02  8:50                   ` Roman Zippel
2002-05-01 13:21                     ` Daniel Phillips
2002-05-02 14:00                       ` Roman Zippel
2002-05-01 14:08                         ` Daniel Phillips
2002-05-02 17:56                           ` Roman Zippel
2002-05-01 17:59                             ` Daniel Phillips
2002-05-02 18:26                               ` Roman Zippel
2002-05-02 18:32                                 ` Daniel Phillips
2002-05-02 19:40                                   ` Roman Zippel
2002-05-02 20:14                                     ` Daniel Phillips
2002-05-03  6:34                                       ` Andrea Arcangeli
2002-05-03  9:33                                       ` Roman Zippel
2002-05-03  6:30                                     ` Andrea Arcangeli
2002-05-02 18:35                     ` Geert Uytterhoeven
2002-05-02 18:39                       ` Daniel Phillips
2002-05-02  0:20             ` Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Anton Blanchard
2002-05-01  1:35               ` Daniel Phillips
2002-05-02  1:45                 ` William Lee Irwin III
2002-05-01  2:02                   ` Daniel Phillips
2002-05-02  2:33                     ` William Lee Irwin III
2002-05-01  2:44                       ` Daniel Phillips
2002-05-02  1:46                 ` Andrea Arcangeli
2002-05-01  1:56                   ` Daniel Phillips
2002-05-02  1:01               ` Andrea Arcangeli
2002-05-02 15:28                 ` Anton Blanchard
2002-05-01 16:10                   ` Daniel Phillips
2002-05-02 15:59                   ` Dave Engebretsen
2002-05-01 17:24                     ` Daniel Phillips
2002-05-02 16:44                       ` Dave Engebretsen
2002-05-02 16:31                   ` William Lee Irwin III
2002-05-02 16:21                     ` Dave Engebretsen
2002-05-02 17:28                       ` William Lee Irwin III
2002-05-02 23:05               ` Daniel Phillips
2002-05-03  0:05                 ` William Lee Irwin III
2002-05-03  1:19                   ` Daniel Phillips
2002-05-03 19:47                     ` Dave Engebretsen
2002-05-03 22:06                       ` Daniel Phillips
2002-05-03 23:52               ` David Mosberger
2002-05-03 18:37 Virtual address space exhaustion (was Discontigmem virt_to_page() ) Tony Luck
2002-05-03 19:01 ` Richard B. Johnson
2002-04-27  1:15   ` Pavel Machek
2002-05-03 19:09   ` Christoph Hellwig
2002-05-03 19:17     ` Richard B. Johnson
2002-05-03 19:24       ` Christoph Hellwig
2002-05-03 19:38   ` Matti Aarnio
2002-05-03 19:50   ` Tony Luck
2002-05-03 20:22   ` Jeff Dike
2002-05-03 19:30     ` Richard B. Johnson
2002-05-03 22:35       ` Martin J. Bligh
2002-05-05  0:49         ` Denis Vlasenko
2002-05-05 17:59           ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).