linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Hugetlbpages in very large memory machines.......
@ 2004-03-13  3:44 Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ray Bryant @ 2004-03-13  3:44 UTC (permalink / raw)
  To: lse-tech, linux-ia64, linux-kernel

We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines 
with 1TB or more of main memory.  The problem is that hugetlbpage pages are not faulted in, rather 
they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
response to the user's mmap() request.  The net is that all of the hugetlb pages end up being 
allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb 
pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take 
a long time (500 s or more).

We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow 
multiple processors to be thrown at the problem.  Question is, has anyone else been working on
this problem and might they have prototype code they could share with us?

Thanks,
-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Hugetlbpages in very large memory machines Ray Bryant
@ 2004-03-13  3:48 ` Andi Kleen
  2004-03-13  5:49   ` William Lee Irwin III
  2004-03-14  2:45   ` Andrew Morton
  2004-03-13  3:55 ` William Lee Irwin III
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: Andi Kleen @ 2004-03-13  3:48 UTC (permalink / raw)
  To: Ray Bryant; +Cc: lse-tech, linux-ia64, linux-kernel

On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote:
> We've run into a scaling problem using hugetlbpages in very large memory 
> machines, e. g. machines with 1TB or more of main memory.  The problem is 
> that hugetlbpage pages are not faulted in, rather they are zeroed and 
> mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
> response to the user's mmap() request.  The net is that all of the hugetlb 
> pages end up being allocated and zeroed by a single thread, and if most of 
> the machine's memory is allocated to hugetlb pages, and there is 1 TB or 
> more of main memory, zeroing and allocating all of those pages can take a 
> long time (500 s or more).
> 
> We've looked at allocating and zeroing hugetlbpages at fault time, which 
> would at least allow multiple processors to be thrown at the problem.  
> Question is, has anyone else been working on
> this problem and might they have prototype code they could share with us?

Yes. I ran into exactly this problem with NUMA API too. 
mbind() runs after mmap, but it cannot work anymore when
the pages are already allocated.

I fixed it on x86-64/i386 by allocating the pages lazily.
Doing it for IA64 has been on the todo list too.

i386/x86-64 Code as an example attached.

One drawback is that the out of memory handling is lot less nicer
than it was before - when you run out of hugepages you get SIGBUS
now instead of a ENOMEM from mmap. Maybe some prereservation would
make sense, but that would be somewhat harder. Alternatively
fall back to smaller pages if possible (I was told it isn't easily
possible on IA64)

-Andi


diff -burpN -X ../KDIFX linux-2.6.2/arch/i386/mm/hugetlbpage.c linux-2.6.2-numa/arch/i386/mm/hugetlbpage.c
--- linux-2.6.2/arch/i386/mm/hugetlbpage.c	2004-02-24 20:48:10.000000000 +0100
+++ linux-2.6.2-numa/arch/i386/mm/hugetlbpage.c	2004-02-20 18:52:57.000000000 +0100
@@ -329,41 +333,43 @@ zap_hugepage_range(struct vm_area_struct
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+/* page_table_lock hold on entry. */
+static int 
+hugetlb_alloc_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+			       unsigned long addr, int write_access)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
+	int ret;
+	pte_t *pte;
+	struct page *page = NULL;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 
+	pte = huge_pte_alloc(mm, addr); 
 		if (!pte) {
-			ret = -ENOMEM;
+		ret = VM_FAULT_OOM;
 			goto out;
 		}
-		if (!pte_none(*pte))
-			continue;
 
 		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
 			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
 		page = find_get_page(mapping, idx);
 		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
+		/* Should do this at prefault time, but that gets us into
+		   trouble with freeing right now. */
+		ret = hugetlb_get_quota(mapping);
+		if (ret) {
+			ret = VM_FAULT_OOM;
 				goto out;
 			}
-			page = alloc_hugetlb_page();
+
+		page = alloc_hugetlb_page(vma);
 			if (!page) {
 				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
+			
+			/* Instead of OOMing here could just transparently use
+			   small pages. */
+			
+			ret = VM_FAULT_OOM;
 				goto out;
 			}
 			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
@@ -371,23 +377,62 @@ int hugetlb_prefault(struct address_spac
 			if (ret) {
 				hugetlb_put_quota(mapping);
 				free_huge_page(page);
+			ret = VM_FAULT_SIGBUS;
 				goto out;
 			}
-		}
+		ret = VM_FAULT_MAJOR; 
+	} else
+		ret = VM_FAULT_MINOR;
+		
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
+	/* Don't need to flush other CPUs. They will just do a page
+	   fault and flush it lazily. */
+	__flush_tlb_one(addr);
+	
+ out:
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }
 
+int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+		       unsigned long address, int write_access)
+{ 
+	pmd_t *pmd;
+	pgd_t *pgd;
+
+	if (write_access && !(vma->vm_flags & VM_WRITE))
+		return VM_FAULT_SIGBUS;
+
+	spin_lock(&mm->page_table_lock);	
+	pgd = pgd_offset(mm, address); 
+	if (pgd_none(*pgd)) 
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	pmd = pmd_offset(pgd, address);
+	if (pmd_none(*pmd))
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	BUG_ON(!pmd_large(*pmd)); 
+
+	/* must have been a race. Flush the TLB. NX not supported yet. */ 
+
+	__flush_tlb_one(address); 
+	spin_lock(&mm->page_table_lock);	
+	return VM_FAULT_MINOR;
+} 
+
+int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
 static void update_and_free_page(struct page *page)
 {
 	int j;
 	struct page *map;
 
 	map = page;
-	htlbzone_pages--;
+	htlbzone_pages--;
 	for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
 		map->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
diff -burpN -X ../KDIFX linux-2.6.2/mm/memory.c linux-2.6.2-numa/mm/memory.c
--- linux-2.6.2/mm/memory.c	2004-02-20 18:31:32.000000000 +0100
+++ linux-2.6.2-numa/mm/memory.c	2004-02-18 20:08:40.000000000 +0100
@@ -1576,6 +1593,15 @@ static inline int handle_pte_fault(struc
 	return VM_FAULT_MINOR;
 }
 
+
+/* Can be overwritten by the architecture */
+int __attribute__((weak)) arch_hugetlb_fault(struct mm_struct *mm, 
+					     struct vm_area_struct *vma, 
+					     unsigned long address, int write_access)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  */
@@ -1591,7 +1617,7 @@ int handle_mm_fault(struct mm_struct *mm
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return arch_hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Hugetlbpages in very large memory machines Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
@ 2004-03-13  3:55 ` William Lee Irwin III
  2004-03-13  4:56 ` Hirokazu Takahashi
  2004-03-15 15:28 ` jlnance
  3 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2004-03-13  3:55 UTC (permalink / raw)
  To: Ray Bryant; +Cc: lse-tech, linux-ia64, linux-kernel

On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote:
> We've run into a scaling problem using hugetlbpages in very large memory 
> machines, e. g. machines with 1TB or more of main memory.  The problem is 
> that hugetlbpage pages are not faulted in, rather they are zeroed and 
> mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
> response to the user's mmap() request.  The net is that all of the hugetlb 
> pages end up being allocated and zeroed by a single thread, and if most of 
> the machine's memory is allocated to hugetlb pages, and there is 1 TB or 
> more of main memory, zeroing and allocating all of those pages can take a 
> long time (500 s or more).
> We've looked at allocating and zeroing hugetlbpages at fault time, which 
> would at least allow multiple processors to be thrown at the problem.  
> Question is, has anyone else been working on
> this problem and might they have prototype code they could share with us?

This actually is largely a question of architecture-dependent code, so
the answer will depend on whether your architecture matches those of the
others who have had a need to arrange this.

Basically, all you really need to do is to check the vma and call either
a hugetlb-specific fault handler or handle_mm_fault() depending on whether
hugetlb is configured. Once you've gotten that far, it's only a question
of implementing the methods to work together properly when driven by
upper layers.

The reason why this wasn't done up-front was that there wasn't a
demonstrable need to do so. The issue you're citing is exactly the kind
of demonstration needed to motivate its inclusion.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Hugetlbpages in very large memory machines Ray Bryant
  2004-03-13  3:48 ` Andi Kleen
  2004-03-13  3:55 ` William Lee Irwin III
@ 2004-03-13  4:56 ` Hirokazu Takahashi
  2004-03-16  0:30   ` Nobuhiko Yoshida
  2004-03-15 15:28 ` jlnance
  3 siblings, 1 reply; 32+ messages in thread
From: Hirokazu Takahashi @ 2004-03-13  4:56 UTC (permalink / raw)
  To: raybry; +Cc: lse-tech, linux-ia64, linux-kernel, n-yoshida

Hello,

My following patch might help you. It inclueds pagefault routine
for hugetlbpages. If you want to use it for your purpose, you need to
remove some code from hugetlb_prefault() that will call hugetlb_fault().
http://people.valinux.co.jp/~taka/patches/va01-hugepagefault.patch

But it's just for IA32.

I heard that n-yoshida@pst.fujitsu.com was porting this patch
on IA64.

> We've run into a scaling problem using hugetlbpages in very large memory machines, e. g. machines 
> with 1TB or more of main memory.  The problem is that hugetlbpage pages are not faulted in, rather 
> they are zeroed and mapped in in by hugetlb_prefault() (at least on ia64), which is called in 
> response to the user's mmap() request.  The net is that all of the hugetlb pages end up being 
> allocated and zeroed by a single thread, and if most of the machine's memory is allocated to hugetlb 
> pages, and there is 1 TB or more of main memory, zeroing and allocating all of those pages can take 
> a long time (500 s or more).
> 
> We've looked at allocating and zeroing hugetlbpages at fault time, which would at least allow 
> multiple processors to be thrown at the problem.  Question is, has anyone else been working on
> this problem and might they have prototype code they could share with us?
> 
> Thanks,
> -- 
> Best Regards,
> Ray


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:48 ` Andi Kleen
@ 2004-03-13  5:49   ` William Lee Irwin III
  2004-03-13 16:10     ` [Lse-tech] " Andi Kleen
       [not found]     ` <844231526.20040313030948@adinet.com.uy>
  2004-03-14  2:45   ` Andrew Morton
  1 sibling, 2 replies; 32+ messages in thread
From: William Lee Irwin III @ 2004-03-13  5:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Bryant, lse-tech, linux-ia64, linux-kernel

On Sat, Mar 13, 2004 at 04:48:40AM +0100, Andi Kleen wrote:
> One drawback is that the out of memory handling is lot less nicer
> than it was before - when you run out of hugepages you get SIGBUS
> now instead of a ENOMEM from mmap. Maybe some prereservation would
> make sense, but that would be somewhat harder. Alternatively
> fall back to smaller pages if possible (I was told it isn't easily
> possible on IA64)

That's not entirely true. Whether it's feasible depends on how the
MMU is used. The HPW (Hardware Pagetable Walker) and short mode of the
VHPT insist upon pagesize being a per-region attribute, where regions
are something like 60-bit areas of virtualspace, which is likely what
they're referring to. The VHPT in long mode should be capable of
arbitrary virtual placement (modulo alignment of course).


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-13  5:49   ` William Lee Irwin III
@ 2004-03-13 16:10     ` Andi Kleen
  2004-03-14  0:05       ` William Lee Irwin III
       [not found]     ` <844231526.20040313030948@adinet.com.uy>
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2004-03-13 16:10 UTC (permalink / raw)
  To: William Lee Irwin III, Andi Kleen, Ray Bryant, lse-tech,
	linux-ia64, linux-kernel

> > fall back to smaller pages if possible (I was told it isn't easily
> > possible on IA64)
> 
> That's not entirely true. Whether it's feasible depends on how the
> MMU is used. The HPW (Hardware Pagetable Walker) and short mode of the
> VHPT insist upon pagesize being a per-region attribute, where regions
> are something like 60-bit areas of virtualspace, which is likely what
> they're referring to. The VHPT in long mode should be capable of
> arbitrary virtual placement (modulo alignment of course).

Redesigning the low level TLB fault handling for this would not count as
"easily" in my book.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re[2]: Hugetlbpages in very large memory machines.......
       [not found]       ` <20040313061232.GB655@holomorphy.com>
@ 2004-03-13 16:32         ` Luis Mirabal
  0 siblings, 0 replies; 32+ messages in thread
From: Luis Mirabal @ 2004-03-13 16:32 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

oops... wrong list, wrong message.. :S sorry

luis


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-13 16:10     ` [Lse-tech] " Andi Kleen
@ 2004-03-14  0:05       ` William Lee Irwin III
  2004-03-14  5:22         ` Peter Chubb
  0 siblings, 1 reply; 32+ messages in thread
From: William Lee Irwin III @ 2004-03-14  0:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ray Bryant, lse-tech, linux-ia64, linux-kernel

At some point in the past, I wrote:
>> That's not entirely true. Whether it's feasible depends on how the
>> MMU is used. The HPW (Hardware Pagetable Walker) and short mode of the
>> VHPT insist upon pagesize being a per-region attribute, where regions
>> are something like 60-bit areas of virtualspace, which is likely what
>> they're referring to. The VHPT in long mode should be capable of
>> arbitrary virtual placement (modulo alignment of course).

On Sat, Mar 13, 2004 at 05:10:10PM +0100, Andi Kleen wrote:
> Redesigning the low level TLB fault handling for this would not count as
> "easily" in my book.

I make no estimate of ease of implementation of long mode VHPT support.
The point of the above is that the virtual placement constraint is an
artifact of the implementation and not inherent in hardware.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:48 ` Andi Kleen
  2004-03-13  5:49   ` William Lee Irwin III
@ 2004-03-14  2:45   ` Andrew Morton
  2004-03-14  4:06     ` [Lse-tech] " Anton Blanchard
  2004-03-14  8:38     ` Ray Bryant
  1 sibling, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2004-03-14  2:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: raybry, lse-tech, linux-ia64, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> > We've looked at allocating and zeroing hugetlbpages at fault time, which 
>  > would at least allow multiple processors to be thrown at the problem.  
>  > Question is, has anyone else been working on
>  > this problem and might they have prototype code they could share with us?
> 
>  Yes. I ran into exactly this problem with NUMA API too. 
>  mbind() runs after mmap, but it cannot work anymore when
>  the pages are already allocated.
> 
>  I fixed it on x86-64/i386 by allocating the pages lazily.
>  Doing it for IA64 has been on the todo list too.
> 
>  i386/x86-64 Code as an example attached.
> 
>  One drawback is that the out of memory handling is lot less nicer
>  than it was before - when you run out of hugepages you get SIGBUS
>  now instead of a ENOMEM from mmap. Maybe some prereservation would
>  make sense, but that would be somewhat harder. Alternatively
>  fall back to smaller pages if possible (I was told it isn't easily
>  possible on IA64)

Demand-paging the hugepages is a decent feature to have, and ISTR resisting
it before for this reason.

Even though it's early in the 2.6 series I'd be a bit worried about
breaking existing hugetlb users in this way.  Yes, the pages are
preallocated so it is unlikely that a working setup is suddenly going to
break.  Unless someone is using the return value from mmap to find out how
many pages they can get.

So ho-hum.  I think it needs to be back-compatible.  Could we add
MAP_NO_PREFAULT?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  2:45   ` Andrew Morton
@ 2004-03-14  4:06     ` Anton Blanchard
  2004-03-17 19:05       ` Andy Whitcroft
  2004-03-14  8:38     ` Ray Bryant
  1 sibling, 1 reply; 32+ messages in thread
From: Anton Blanchard @ 2004-03-14  4:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, raybry, lse-tech, linux-ia64, linux-kernel


> Demand-paging the hugepages is a decent feature to have, and ISTR resisting
> it before for this reason.
> 
> Even though it's early in the 2.6 series I'd be a bit worried about
> breaking existing hugetlb users in this way.  Yes, the pages are
> preallocated so it is unlikely that a working setup is suddenly going to
> break.  Unless someone is using the return value from mmap to find out how
> many pages they can get.

Hmm what a coincidence, I was chasing a problem where large page
allocations would fail even though I clearly had enough large page memory
free.

It turns out we were tripping the overcommit logic in do_mmap. I had
30GB of large page and 2GB of small pages and of course cap_vm_enough_memory
was looking at the small page pool. Setting overcommit to 1 fixed it.

It seems we can solve both problems by having a separate hugetlb overcommit
policy. Make it strict and you wont have OOM problems on large pages
and I wont hit my 30GB / 2GB problem.

Anton

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  0:05       ` William Lee Irwin III
@ 2004-03-14  5:22         ` Peter Chubb
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Chubb @ 2004-03-14  5:22 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andi Kleen, Ray Bryant, lse-tech, linux-ia64, linux-kernel

>>>>> "William" == William Lee Irwin, <William> writes:

William> At some point in the past, I wrote:

William> On Sat, Mar 13, 2004 at 05:10:10PM +0100, Andi Kleen wrote:
>> Redesigning the low level TLB fault handling for this would not
>> count as "easily" in my book.

William> I make no estimate of ease of implementation of long mode
William> VHPT support.  The point of the above is that the virtual
William> placement constraint is an artifact of the implementation and
William> not inherent in hardware.

Ther's a patch available to enable long-format VHPT at
www.gelato.unsw.edu.au

We're waiting for 2.7 to open before pushing it in. The long-format
vpht is a prerequisite for other work we're doing on super-pagesand
TLB sharing.

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  2:45   ` Andrew Morton
  2004-03-14  4:06     ` [Lse-tech] " Anton Blanchard
@ 2004-03-14  8:38     ` Ray Bryant
  2004-03-14  8:48       ` William Lee Irwin III
  2004-03-14  8:57       ` Andrew Morton
  1 sibling, 2 replies; 32+ messages in thread
From: Ray Bryant @ 2004-03-14  8:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, lse-tech, linux-ia64, linux-kernel



Andrew Morton wrote:

>>
>> One drawback is that the out of memory handling is lot less nicer
>> than it was before - when you run out of hugepages you get SIGBUS
>> now instead of a ENOMEM from mmap. Maybe some prereservation would
>> make sense, but that would be somewhat harder. Alternatively
>> fall back to smaller pages if possible (I was told it isn't easily
>> possible on IA64)
> 
> 
> Demand-paging the hugepages is a decent feature to have, and ISTR resisting
> it before for this reason.
> 
> Even though it's early in the 2.6 series I'd be a bit worried about
> breaking existing hugetlb users in this way.  Yes, the pages are
> preallocated so it is unlikely that a working setup is suddenly going to
> break.  Unless someone is using the return value from mmap to find out how
> many pages they can get.
> 
> So ho-hum.  I think it needs to be back-compatible.  Could we add
> MAP_NO_PREFAULT?
> 
> 
> 

I agree with the compatibility concern, but the other part of the problem
is that while hugetlb_prefault() is running, it holds both the mm->mmap_sem in
write mode and the mm->page_table_lock.  So not only does it take 500 s for
the mmap() to return on our test system, but ps, top, etc all freeze for the
duration.  Very irritating, especially on a 64 or 128 P system.

My preference would be to do away with bugetlb_prefault() altogether.
(If there was a MAP_NO_PREFAULT, we would have to make this the default on
Altix to avoid the freeze problem mentioned above.  Can't have an arbitrary
user locking up the system.)  As Andi pointed out, perhaps we can do some
prereservation of huge pages so that we can return a ENONMEM to the mmap()
if there are not enough huge pages to (lazily) be allocated to satisfy the
request, but then still allocate the pages at fault time.  A simple count
would suffice.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  8:38     ` Ray Bryant
@ 2004-03-14  8:48       ` William Lee Irwin III
  2004-03-14  8:57       ` Andrew Morton
  1 sibling, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2004-03-14  8:48 UTC (permalink / raw)
  To: Ray Bryant; +Cc: Andrew Morton, Andi Kleen, lse-tech, linux-ia64, linux-kernel

On Sun, Mar 14, 2004 at 02:38:33AM -0600, Ray Bryant wrote:
> write mode and the mm->page_table_lock.  So not only does it take 500 s for
> the mmap() to return on our test system, but ps, top, etc all freeze for the
> duration.  Very irritating, especially on a 64 or 128 P system.
> My preference would be to do away with bugetlb_prefault() altogether.
> (If there was a MAP_NO_PREFAULT, we would have to make this the default on
> Altix to avoid the freeze problem mentioned above.  Can't have an arbitrary
> user locking up the system.)  As Andi pointed out, perhaps we can do some
> prereservation of huge pages so that we can return a ENONMEM to the mmap()
> if there are not enough huge pages to (lazily) be allocated to satisfy the
> request, but then still allocate the pages at fault time.  A simple count
> would suffice.

There is a patch which arranges to keep statistics ready in the mm so that
the mmap_sem need not be taken for /proc/ and furthermore renders
proc_pid_statm() nothing more than copying integers out of the mm that
I forward ported to 2.6.0-test*, originally by Ben LaHaise, that may
also be of interest to those concerned about tripping over other processes'
mmap_sem's in /proc/.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  8:38     ` Ray Bryant
  2004-03-14  8:48       ` William Lee Irwin III
@ 2004-03-14  8:57       ` Andrew Morton
  2004-03-14  9:02         ` Andrew Morton
                           ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Andrew Morton @ 2004-03-14  8:57 UTC (permalink / raw)
  To: Ray Bryant; +Cc: ak, lse-tech, linux-ia64, linux-kernel

Ray Bryant <raybry@sgi.com> wrote:
>
> 
> I agree with the compatibility concern, but the other part of the problem
> is that while hugetlb_prefault() is running, it holds both the mm->mmap_sem in
> write mode and the mm->page_table_lock.  So not only does it take 500 s for
> the mmap() to return on our test system, but ps, top, etc all freeze for the
> duration.  Very irritating, especially on a 64 or 128 P system.

Well that's just a dumb implementation.  hugetlb_prefault() doesn't need
page_table_lock while it is zeroing the page: just drop it, test for
-EEXIST returned from add_to_page_cache().

In fact we need to do that anyway: the current code is buggy if some other
process with a different mm gets in there and instantiates the page in the
pagecache before this process does: hugetlb_prefault() will return -EEXIST
instead of simply accepting the race and using the page which someone else
put there.

After we have the page in pagecache we need to retake page_table_lock and
check that the target pte is still pte_none().  If it is not, you know that
some other thread has already instantiated a pte there so the new ref to
the pagecache page can simply be dropped.  See how do_no_page() handles it.
Of course, this only applies if mmap_sem is no longer held in there.

As for holding mmap_sem for too long, well, that can presumably be worked
around by not mmapping the whole lot in one hit?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  8:57       ` Andrew Morton
@ 2004-03-14  9:02         ` Andrew Morton
  2004-03-14  9:07         ` William Lee Irwin III
  2004-03-15  6:45         ` Ray Bryant
  2 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2004-03-14  9:02 UTC (permalink / raw)
  To: raybry, ak, lse-tech, linux-ia64, linux-kernel

Andrew Morton <akpm@osdl.org> wrote:
>
> Well that's just a dumb implementation.  hugetlb_prefault() doesn't need
>  page_table_lock while it is zeroing the page: just drop it, test for
>  -EEXIST returned from add_to_page_cache().
> 
>  In fact we need to do that anyway: the current code is buggy if some other
>  process with a different mm gets in there and instantiates the page in the
>  pagecache before this process does: hugetlb_prefault() will return -EEXIST
>  instead of simply accepting the race and using the page which someone else
>  put there.
> 
>  After we have the page in pagecache we need to retake page_table_lock and
>  check that the target pte is still pte_none().  If it is not, you know that
>  some other thread has already instantiated a pte there so the new ref to
>  the pagecache page can simply be dropped.  See how do_no_page() handles it.
>  Of course, this only applies if mmap_sem is no longer held in there.

But before implementing any of this we should move hugetlb_prefault() and
any other generic-looking functions into mm/hugetlbpage.c.  We're getting
too much duplication in there.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  8:57       ` Andrew Morton
  2004-03-14  9:02         ` Andrew Morton
@ 2004-03-14  9:07         ` William Lee Irwin III
  2004-03-15  6:45         ` Ray Bryant
  2 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2004-03-14  9:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ray Bryant, ak, lse-tech, linux-ia64, linux-kernel

On Sun, Mar 14, 2004 at 12:57:37AM -0800, Andrew Morton wrote:
> Well that's just a dumb implementation.  hugetlb_prefault() doesn't need
> page_table_lock while it is zeroing the page: just drop it, test for
> -EEXIST returned from add_to_page_cache().
> In fact we need to do that anyway: the current code is buggy if some other
> process with a different mm gets in there and instantiates the page in the
> pagecache before this process does: hugetlb_prefault() will return -EEXIST
> instead of simply accepting the race and using the page which someone else
> put there.

Don't blame me. I didn't write the expand-on-mmap() code.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  8:57       ` Andrew Morton
  2004-03-14  9:02         ` Andrew Morton
  2004-03-14  9:07         ` William Lee Irwin III
@ 2004-03-15  6:45         ` Ray Bryant
  2004-03-15 23:54           ` William Lee Irwin III
  2 siblings, 1 reply; 32+ messages in thread
From: Ray Bryant @ 2004-03-15  6:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ak, lse-tech, linux-ia64, linux-kernel



Andrew Morton wrote:
<unrelated text snipped>
> 
> As for holding mmap_sem for too long, well, that can presumably be worked
> around by not mmapping the whole lot in one hit?
> 

There are a number of places that one could do this (explicitly in user code,
hidden in library level, or in do_mmap2() where the mm->map_sem is taken).
I'm not happy with requiring the user to make a modification to solve this
kernel problem.  Hiding the split has the problem of making sure that if any
of the sub mmap() operations fail then the rest of the mmap() operations have
to be undone, and this all has to happen in a way that makes the mmap() look
like a single system call.

An alternative would be put some info in the mm_struct indicating that a
hugetlb_prefault() is in progress, then drop the mm->mmap_sem while
hugetlb_prefault() is running.  Once it is done, regrab the mm->mmap_sem,
clear the "in progress flag" and finish up processing.  Any other mmap()
that got the mmap_sem and found the "in progress flag" set would have to
fail, perhaps with -EAGAIN (again, an mmap() extension).  One can also
implement more elaborate schemes where there is a list of pending hugetlb
mmaps() with the associated address space ranges being listed; one could
check this list in get_unmapped_area() and return -EAGAIN if there is
a conflict.

I'd still rather see us do the "allocate on fault" approach with prereservation
to maintain the current ENOMEM return code from mmap() for hugepages.  Let me
work on that and get back to y'all with a patch and see where we can go from
there.  I'll start by taking a look at all of the arch dependent hugetlbpage.c's
and see how common they all are and move the common code up to mm/hugetlbpage.c.
(or did WLI's note imply that this is impossible?)

However, is this set of changes something that would still be accepted in 2.6,
or is this now a 2.7 discussion?

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  3:44 Hugetlbpages in very large memory machines Ray Bryant
                   ` (2 preceding siblings ...)
  2004-03-13  4:56 ` Hirokazu Takahashi
@ 2004-03-15 15:28 ` jlnance
  3 siblings, 0 replies; 32+ messages in thread
From: jlnance @ 2004-03-15 15:28 UTC (permalink / raw)
  To: linux-kernel

On Fri, Mar 12, 2004 at 09:44:03PM -0600, Ray Bryant wrote:
> We've run into a scaling problem using hugetlbpages in very large memory 
> machines, e. g. machines with 1TB or more of main memory.

You know, when I started using Linux it wouldn't support more than 16M
of ram.  No one complained because no one using Linux had a machine with
more than 16M of ram.  It looks like things have progressed a bit since
then :-)

Jim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-15  6:45         ` Ray Bryant
@ 2004-03-15 23:54           ` William Lee Irwin III
  0 siblings, 0 replies; 32+ messages in thread
From: William Lee Irwin III @ 2004-03-15 23:54 UTC (permalink / raw)
  To: Ray Bryant; +Cc: Andrew Morton, ak, lse-tech, linux-ia64, linux-kernel

On Mon, Mar 15, 2004 at 12:45:10AM -0600, Ray Bryant wrote:
> I'd still rather see us do the "allocate on fault" approach with 
> prereservation to maintain the current ENOMEM return code from mmap()
> for hugepages. Let me work on that and get back to y'all with a patch
> and see where we can go from there.  I'll start by taking a look at
> all of the arch dependent hugetlbpage.c's and see how common they all
> are and move the common code up to mm/hugetlbpage.c.
> (or did WLI's note imply that this is impossible?)

It would be a mistake to put any pagetable handling functions in the
core. Things above that level, e.g. callers that don't examine the
pagetables directly in favor of calling lower-level API's, are fine.


-- wli

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-13  4:56 ` Hirokazu Takahashi
@ 2004-03-16  0:30   ` Nobuhiko Yoshida
  2004-03-16  1:54     ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Nobuhiko Yoshida @ 2004-03-16  0:30 UTC (permalink / raw)
  To: raybry, linux-kernel; +Cc: lse-tech, linux-ia64, Hirokazu Takahashi

Hello,

Hirokazu Takahashi <taka@valinux.co.jp> :
> Hello,
> 
> My following patch might help you. It inclueds pagefault routine
> for hugetlbpages. If you want to use it for your purpose, you need to
> remove some code from hugetlb_prefault() that will call hugetlb_fault().
> http://people.valinux.co.jp/~taka/patches/va01-hugepagefault.patch
> 
> But it's just for IA32.
> 
> I heard that n-yoshida@pst.fujitsu.com was porting this patch
> on IA64.

Below is the patch I ported Takahashi-san's one for IA64.
However, my patch is for kernel 2.6.0 and cannot be
appiled to 2.6.1 or later.

Thank you,
Nobuhiko Yoshida

diff -dupr linux-2.6.0.org/arch/ia64/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.0.org/arch/ia64/mm/hugetlbpage.c  2003-12-18 11:58:56.000000000 +0900
+++ linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c  2004-01-06 14:26:53.000000000 +0900
@@ -170,8 +170,10 @@ int copy_hugetlb_page_range(struct mm_st
            goto nomem;
        src_pte = huge_pte_offset(src, addr);
        entry = *src_pte;
-       ptepage = pte_page(entry);
-       get_page(ptepage);
+       if (!pte_none(entry)) {
+           ptepage = pte_page(entry);
+           get_page(ptepage);
+       }   
        set_pte(dst_pte, entry);
        dst->rss += (HPAGE_SIZE / PAGE_SIZE);
        addr += HPAGE_SIZE;
@@ -195,6 +197,12 @@ follow_hugetlb_page(struct mm_struct *mm
    do {
        pstart = start & HPAGE_MASK;
        ptep = huge_pte_offset(mm, start);
+
+       if (!ptep || pte_none(*ptep)) {
+           hugetlb_fault(mm, vma, 0, start);
+           ptep = huge_pte_offset(mm, start);
+       }
+
        pte = *ptep;
 
 back1:
@@ -236,6 +244,12 @@ struct page *follow_huge_addr(struct mm_
    pte_t *ptep;
 
    ptep = huge_pte_offset(mm, addr);
+
+   if (!ptep || pte_none(*ptep)) {
+       hugetlb_fault(mm, vma, 0, addr);
+       ptep = huge_pte_offset(mm, addr);
+   }
+
    page = pte_page(*ptep);
    page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
    get_page(page);
@@ -246,7 +260,8 @@ int pmd_huge(pmd_t pmd)
    return 0;
 }
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+       unsigned long address, pmd_t *pmd, int write)
 {
    return NULL;
 }
@@ -518,6 +533,48 @@ int is_hugepage_mem_enough(size_t size)
    return 1;
 }
 
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+   struct file *file = vma->vm_file;
+   struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+   struct page *page;
+   unsigned long idx;
+   pte_t *pte;
+   int ret = VM_FAULT_MINOR;
+
+   BUG_ON(vma->vm_start & ~HPAGE_MASK);
+   BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+   spin_lock(&mm->page_table_lock);
+
+   idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+       + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+   page = find_get_page(mapping, idx);
+
+   if (!page) {
+       page = alloc_hugetlb_page();
+       if (!page) {
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+       ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+       unlock_page(page);
+       if (ret) {
+           free_huge_page(page);
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+   }
+   pte = huge_pte_alloc(mm, address);
+   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+/*      update_mmu_cache(vma, address, *pte); */
+out:
+   spin_unlock(&mm->page_table_lock);
+   return ret;
+}
+
+
 static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
 {
    BUG();

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  0:30   ` Nobuhiko Yoshida
@ 2004-03-16  1:54     ` Andi Kleen
  2004-03-16  2:32       ` Hirokazu Takahashi
  2004-03-16  3:15       ` Nobuhiko Yoshida
  0 siblings, 2 replies; 32+ messages in thread
From: Andi Kleen @ 2004-03-16  1:54 UTC (permalink / raw)
  To: Nobuhiko Yoshida
  Cc: raybry, linux-kernel, lse-tech, linux-ia64, Hirokazu Takahashi

> +   pte = huge_pte_alloc(mm, address);
> +   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);

This looks broken. Another CPU could have raced to the same fault
and already added an PTE here. You have to handle that.

(my i386 version originally had the same problem)


> +/*      update_mmu_cache(vma, address, *pte); */

I have not studied low level IA64 VM in detail, but don't you need
some kind of TLB flush here?

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  1:54     ` Andi Kleen
@ 2004-03-16  2:32       ` Hirokazu Takahashi
  2004-03-16  3:20         ` Hirokazu Takahashi
  2004-03-16  3:15       ` Nobuhiko Yoshida
  1 sibling, 1 reply; 32+ messages in thread
From: Hirokazu Takahashi @ 2004-03-16  2:32 UTC (permalink / raw)
  To: ak; +Cc: n-yoshida, raybry, linux-kernel, lse-tech, linux-ia64

Hello,

> > +   pte = huge_pte_alloc(mm, address);
> > +   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
> 
> This looks broken. Another CPU could have raced to the same fault
> and already added an PTE here. You have to handle that.
> 
> (my i386 version originally had the same problem)

Yes, you are true.
In the fault handler, we should use find_lock_page() instead of
find_get_page() to find a hugepage associated with the fault address.
After that pte_none(*pte) should be called again to check whether 
some races has happened.

> > +/*      update_mmu_cache(vma, address, *pte); */
> 
> I have not studied low level IA64 VM in detail, but don't you need
> some kind of TLB flush here?
> 
> -Andi


Thank you,
Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  1:54     ` Andi Kleen
  2004-03-16  2:32       ` Hirokazu Takahashi
@ 2004-03-16  3:15       ` Nobuhiko Yoshida
  2004-04-01  9:10         ` Nobuhiko Yoshida
  1 sibling, 1 reply; 32+ messages in thread
From: Nobuhiko Yoshida @ 2004-03-16  3:15 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel; +Cc: raybry, lse-tech, linux-ia64, Hirokazu Takahashi

Hello,

> > +/*      update_mmu_cache(vma, address, *pte); */
> 
> I have not studied low level IA64 VM in detail, but don't you need
> some kind of TLB flush here?

Oh! Yes.
Perhaps, TLB flush is needed here.

Thank you,
Nobuhiko Yoshida

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  2:32       ` Hirokazu Takahashi
@ 2004-03-16  3:20         ` Hirokazu Takahashi
  0 siblings, 0 replies; 32+ messages in thread
From: Hirokazu Takahashi @ 2004-03-16  3:20 UTC (permalink / raw)
  To: ak; +Cc: n-yoshida, raybry, linux-kernel, lse-tech, linux-ia64

Hello,

> Yes, you are true.
> In the fault handler, we should use find_lock_page() instead of
> find_get_page() to find a hugepage associated with the fault address.

Sorry, locking page is not needed.

> After that pte_none(*pte) should be called again to check whether 
> some races has happened.

While checking, mm->page_table_lock have to be locked.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-14  4:06     ` [Lse-tech] " Anton Blanchard
@ 2004-03-17 19:05       ` Andy Whitcroft
  2004-03-18 20:25         ` Andrew Morton
  2004-03-23 17:30         ` Andy Whitcroft
  0 siblings, 2 replies; 32+ messages in thread
From: Andy Whitcroft @ 2004-03-17 19:05 UTC (permalink / raw)
  To: Anton Blanchard, Andrew Morton
  Cc: Andi Kleen, raybry, lse-tech, linux-ia64, linux-kernel, Martin J. Bligh

--On 14 March 2004 15:06 +1100 Anton Blanchard <anton@samba.org> wrote:

> Hmm what a coincidence, I was chasing a problem where large page
> allocations would fail even though I clearly had enough large page memory
> free.
>
> It turns out we were tripping the overcommit logic in do_mmap. I had
> 30GB of large page and 2GB of small pages and of course
> cap_vm_enough_memory was looking at the small page pool. Setting
> overcommit to 1 fixed it.
>
> It seems we can solve both problems by having a separate hugetlb
> overcommit policy. Make it strict and you wont have OOM problems on large
> pages and I wont hit my 30GB / 2GB problem.

Been following this thread and it seems that fixing this overcommit
miss-handling problem would logically be the first step.  From my reading
it seems that once we have initialised hugetlb we have two independent and
non-overlapping 'page' pools from which we can allocate pages and against
which we wish to handle commitments.  Looking at the current code base we
effectivly have only a single 'accounting domain' and so when we attempt to
allocate hugetlb pages we incorrectly account them against the small page
pool.

I believe we need to add support for more than one page 'accounting domain'
each with its own policy and with its own commitments.  The attached patch
is my attempt at this first step.  I have created the concept of an
accounting domain, against which pages are to be accounted.  In this
implementation there are two domains VM_AD_DEFAULT which is used to account
normal small pages in the normal way and VM_AD_HUGETLB which is used to
select and identify VM_HUGETLB pages.  I have not attempted to add any
actual accounting for VM_HUGETLB pages, as currently they are prefaulted
and thus there is always 0 outstanding commitment to track.  Obviously, if
hugetlb was also changed to support demand paging that would need to be
implemented.

The patch below implements the basic domain split and provides a default
overcommit policy only for VM_AD_HUGETLB.  Anton, with it installed I
believe that you should not need to change the global overcommit policy to
1 to allow 30GB of hugetlb pages to work.  It was made against 2.6.4.  It
contains a couple of comment changes which I intend to split off and submit
separatly (so ignore them).

I have compiled and booted with security on and off, but have not had a 
chance to test the hugetlb side as yet.  What do people think?  The right 
direction?

Cheers.

-apw

diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/mm.h 
current/include/linux/mm.h
--- reference/include/linux/mm.h	2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/mm.h	2004-03-17 19:10:23.000000000 +0000
@@ -112,6 +112,11 @@ struct vm_area_struct {
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */

+/* Memory accounting domains.  These may not be consecutive bits. */
+#define VM_ACCTDOM(vma) (vma)->vm_flags & VM_HUGETLB)
+#define VM_AD_DEFAULT	0x00000000
+#define VM_AD_HUGETLB	VM_HUGETLB
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/security.h 
current/include/linux/security.h
--- reference/include/linux/security.h	2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/security.h	2004-03-17 19:10:23.000000000 +0000
@@ -51,7 +51,7 @@ extern int cap_inode_removexattr(struct
 extern int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t 
old_suid, int flags);
 extern void cap_task_reparent_to_init (struct task_struct *p);
 extern int cap_syslog (int type);
-extern int cap_vm_enough_memory (long pages);
+extern int cap_vm_enough_acctdom (int domain, long pages);

 static inline int cap_netlink_send (struct sk_buff *skb)
 {
@@ -987,8 +987,9 @@ struct swap_info_struct;
  *	See the syslog(2) manual page for an explanation of the @type values.
  *	@type contains the type of action.
  *	Return 0 if permission is granted.
- * @vm_enough_memory:
- *	Check permissions for allocating a new virtual mapping.
+ * @vm_enough_acctdom:
+ *      Check permissions for allocating a new virtual mapping.
+ *      @domain contains the accounting domain.
  *      @pages contains the number of pages.
  *	Return 0 if permission is granted.
  *
@@ -1022,7 +1023,7 @@ struct security_operations {
 	int (*quotactl) (int cmds, int type, int id, struct super_block * sb);
 	int (*quota_on) (struct file * f);
 	int (*syslog) (int type);
-	int (*vm_enough_memory) (long pages);
+	int (*vm_enough_acctdom) (int domain, long pages);

 	int (*bprm_alloc_security) (struct linux_binprm * bprm);
 	void (*bprm_free_security) (struct linux_binprm * bprm);
@@ -1276,9 +1277,9 @@ static inline int security_syslog(int ty
 	return security_ops->syslog(type);
 }

-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_acctdom(int domain, long pages)
 {
-	return security_ops->vm_enough_memory(pages);
+	return security_ops->vm_enough_acctdom(domain, pages);
 }

 static inline int security_bprm_alloc (struct linux_binprm *bprm)
@@ -1947,9 +1948,9 @@ static inline int security_syslog(int ty
 	return cap_syslog(type);
 }

-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_acctdom(int domain, long pages)
 {
-	return cap_vm_enough_memory(pages);
+	return cap_vm_enough_acctdom(domain, pages);
 }

 static inline int security_bprm_alloc (struct linux_binprm *bprm)
@@ -2738,5 +2739,10 @@ static inline void security_sk_free(stru
 }
 #endif	/* CONFIG_SECURITY_NETWORK */

+static inline int security_vm_enough_memory(long pages)
+{
+	return security_vm_enough_acctdom(VM_AD_DEFAULT, pages);
+}
+
 #endif /* ! __LINUX_SECURITY_H */

diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/mmap.c	2004-03-17 19:10:23.000000000 +0000
@@ -473,6 +473,7 @@ unsigned long do_mmap_pgoff(struct file
 	int error;
 	struct rb_node ** rb_link, * rb_parent;
 	unsigned long charged = 0;
+	int acctdom = VM_AD_DEFAULT;

 	if (file) {
 		if (!file->f_op || !file->f_op->mmap)
@@ -591,7 +592,10 @@ munmap_back:
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;

-	if (!(flags & MAP_NORESERVE) || sysctl_overcommit_memory > 1) {
+	if (is_file_hugepages(file))
+		acctdom = VM_AD_HUGETLB;
+	if (!(flags & MAP_NORESERVE) ||
+	    (acctdom == VM_AD_DEFAULT && sysctl_overcommit_memory > 1)) {
 		if (vm_flags & VM_SHARED) {
 			/* Check memory availability in shmem_file_setup? */
 			vm_flags |= VM_ACCOUNT;
@@ -600,7 +604,7 @@ munmap_back:
 			 * Private writable mapping: check memory availability
 			 */
 			charged = len >> PAGE_SHIFT;
-			if (security_vm_enough_memory(charged))
+			if (security_vm_enough_acctdom(acctdom, charged))
 				return -ENOMEM;
 			vm_flags |= VM_ACCOUNT;
 		}
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/capability.c 
current/security/capability.c
--- reference/security/capability.c	2004-02-04 15:09:21.000000000 +0000
+++ current/security/capability.c	2004-03-17 19:10:23.000000000 +0000
@@ -47,7 +47,7 @@ static struct security_operations capabi

 	.syslog =                       cap_syslog,

-	.vm_enough_memory =             cap_vm_enough_memory,
+	.vm_enough_acctdom =            cap_vm_enough_acctdom,
 };

 #if defined(CONFIG_SECURITY_CAPABILITIES_MODULE)
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/commoncap.c 
current/security/commoncap.c
--- reference/security/commoncap.c	2004-02-23 18:15:19.000000000 +0000
+++ current/security/commoncap.c	2004-03-17 19:10:23.000000000 +0000
@@ -303,15 +303,21 @@ int cap_syslog (int type)
  * succeed and -ENOMEM implies there is not.
  *
  * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-acounting
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  *
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
  */
-int cap_vm_enough_memory(long pages)
+int cap_vm_enough_acctdom(int domain, long pages)
 {
 	unsigned long free, allowed;

+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);

         /*
@@ -382,7 +388,7 @@ EXPORT_SYMBOL(cap_inode_removexattr);
 EXPORT_SYMBOL(cap_task_post_setuid);
 EXPORT_SYMBOL(cap_task_reparent_to_init);
 EXPORT_SYMBOL(cap_syslog);
-EXPORT_SYMBOL(cap_vm_enough_memory);
+EXPORT_SYMBOL(cap_vm_enough_acctdom);

 MODULE_DESCRIPTION("Standard Linux Common Capabilities Security Module");
 MODULE_LICENSE("GPL");
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/dummy.c 
current/security/dummy.c
--- reference/security/dummy.c	2004-03-11 20:47:31.000000000 +0000
+++ current/security/dummy.c	2004-03-17 19:10:23.000000000 +0000
@@ -101,10 +101,24 @@ static int dummy_syslog (int type)
 	return 0;
 }

-static int dummy_vm_enough_memory(long pages)
+/*
+ * Check that a process has enough memory to allocate a new virtual
+ * mapping. 0 means there is enough memory for the allocation to
+ * succeed and -ENOMEM implies there is not.
+ *
+ * We currently support three overcommit policies, which are set via the
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
+ */
+static int dummy_vm_enough_acctdom(int domain, long pages)
 {
 	unsigned long free, allowed;

+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);

         /*
@@ -873,7 +887,7 @@ void security_fixup_ops (struct security
 	set_to_dummy_if_null(ops, quota_on);
 	set_to_dummy_if_null(ops, sysctl);
 	set_to_dummy_if_null(ops, syslog);
-	set_to_dummy_if_null(ops, vm_enough_memory);
+	set_to_dummy_if_null(ops, vm_enough_acctdom);
 	set_to_dummy_if_null(ops, bprm_alloc_security);
 	set_to_dummy_if_null(ops, bprm_free_security);
 	set_to_dummy_if_null(ops, bprm_compute_creds);
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/selinux/hooks.c 
current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-11 20:47:31.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-17 19:10:23.000000000 +0000
@@ -1492,17 +1492,23 @@ static int selinux_syslog(int type)
  * succeed and -ENOMEM implies there is not.
  *
  * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-acounting
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  *
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
  */
-static int selinux_vm_enough_memory(long pages)
+static int selinux_vm_enough_acctdom(int domain, long pages)
 {
 	unsigned long free, allowed;
 	int rc;
 	struct task_security_struct *tsec = current->security;

+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);

         /*
@@ -3817,7 +3823,7 @@ struct security_operations selinux_ops =
 	.quotactl =			selinux_quotactl,
 	.quota_on =			selinux_quota_on,
 	.syslog =			selinux_syslog,
-	.vm_enough_memory =		selinux_vm_enough_memory,
+	.vm_enough_acctdom =		selinux_vm_enough_acctdom,

 	.netlink_send =			selinux_netlink_send,
         .netlink_recv =			selinux_netlink_recv,



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-17 19:05       ` Andy Whitcroft
@ 2004-03-18 20:25         ` Andrew Morton
  2004-03-18 21:22           ` Stephen Smalley
  2004-03-23 17:30         ` Andy Whitcroft
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2004-03-18 20:25 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: anton, ak, raybry, lse-tech, linux-ia64, linux-kernel, mbligh,
	Stephen Smalley

Andy Whitcroft <apw@shadowen.org> wrote:
>
> --On 14 March 2004 15:06 +1100 Anton Blanchard <anton@samba.org> wrote:
> 
> > Hmm what a coincidence, I was chasing a problem where large page
> > allocations would fail even though I clearly had enough large page memory
> > free.
> >
> > It turns out we were tripping the overcommit logic in do_mmap. I had
> > 30GB of large page and 2GB of small pages and of course
> > cap_vm_enough_memory was looking at the small page pool. Setting
> > overcommit to 1 fixed it.
> >
> > It seems we can solve both problems by having a separate hugetlb
> > overcommit policy. Make it strict and you wont have OOM problems on large
> > pages and I wont hit my 30GB / 2GB problem.
> 
> Been following this thread and it seems that fixing this overcommit
> miss-handling problem would logically be the first step.  From my reading
> it seems that once we have initialised hugetlb we have two independent and
> non-overlapping 'page' pools from which we can allocate pages and against
> which we wish to handle commitments.  Looking at the current code base we
> effectivly have only a single 'accounting domain' and so when we attempt to
> allocate hugetlb pages we incorrectly account them against the small page
> pool.
> 
> I believe we need to add support for more than one page 'accounting domain'
> each with its own policy and with its own commitments.  The attached patch
> is my attempt at this first step.  I have created the concept of an
> accounting domain, against which pages are to be accounted.  In this
> implementation there are two domains VM_AD_DEFAULT which is used to account
> normal small pages in the normal way and VM_AD_HUGETLB which is used to
> select and identify VM_HUGETLB pages.  I have not attempted to add any
> actual accounting for VM_HUGETLB pages, as currently they are prefaulted
> and thus there is always 0 outstanding commitment to track.  Obviously, if
> hugetlb was also changed to support demand paging that would need to be
> implemented.

Seems reasonable, although "vm_enough_acctdom" makes my eyes pop.  Why not
keep the "vm_enough_memory" identifier?

I've asked Stephen for comment - assuming he's OK with it I'd ask you to
finish this off please.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-18 20:25         ` Andrew Morton
@ 2004-03-18 21:22           ` Stephen Smalley
  2004-03-18 22:21             ` Andy Whitcroft
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Smalley @ 2004-03-18 21:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, anton, ak, raybry, lse-tech, linux-ia64, lkml, mbligh

On Thu, 2004-03-18 at 15:25, Andrew Morton wrote:
> Seems reasonable, although "vm_enough_acctdom" makes my eyes pop.  Why not
> keep the "vm_enough_memory" identifier?
> 
> I've asked Stephen for comment - assuming he's OK with it I'd ask you to
> finish this off please.

To keep the name, he needs to update all callers, right?  Current patch
appears to add a static inline for security_vm_enough_memory that
retains the old interface to avoid having to update most callers.

I don't have any fundamental problem with the nature of the change.  As
a side note, patch was malformed (at least as I received it), not sure
if that was just a problem on my end.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-18 21:22           ` Stephen Smalley
@ 2004-03-18 22:21             ` Andy Whitcroft
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Whitcroft @ 2004-03-18 22:21 UTC (permalink / raw)
  To: Stephen Smalley, Andrew Morton
  Cc: anton, ak, raybry, lse-tech, linux-ia64, lkml, mbligh

--On 18 March 2004 16:22 -0500 Stephen Smalley <sds@epoch.ncsc.mil> wrote:

> On Thu, 2004-03-18 at 15:25, Andrew Morton wrote:
>> Seems reasonable, although "vm_enough_acctdom" makes my eyes pop.  Why
>> not keep the "vm_enough_memory" identifier?
>>
>> I've asked Stephen for comment - assuming he's OK with it I'd ask you to
>> finish this off please.

I have no emotional attachment to any of the names.  If we can come up with 
a more sensible name then all for the best.  I was trying to find something 
which implied the 'measurement' thing which didn't overlap with any of the 
other memory grouping concepts.  As the domains overlap nodes and zones.

> To keep the name, he needs to update all callers, right?  Current patch
> appears to add a static inline for security_vm_enough_memory that
> retains the old interface to avoid having to update most callers.

Yes this is the main reason for the name change.  This is at the dirty hack 
stage in that sense, minimal changes to prove the concept.  I think that we 
should be changing all the callers if this is going mainline in the longer 
term.  Although then the do cross 4 architecture and with it being in the 
security interface it also interfaces with selinux as well (sigh).

I'll put together a more complete change over of the interface, keep the 
name the same and see how intrusive that seems.  Then we'll get some 
testing on it.

> I don't have any fundamental problem with the nature of the change.  As
> a side note, patch was malformed (at least as I received it), not sure
> if that was just a problem on my end.

Steven, I'll send you a copy of the patch under separate cover.

-apw

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-17 19:05       ` Andy Whitcroft
  2004-03-18 20:25         ` Andrew Morton
@ 2004-03-23 17:30         ` Andy Whitcroft
  2004-03-24 17:38           ` Andy Whitcroft
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Whitcroft @ 2004-03-23 17:30 UTC (permalink / raw)
  To: Anton Blanchard, Andrew Morton, Stephen Smalley
  Cc: Andi Kleen, raybry, lse-tech, linux-ia64, linux-kernel, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 1659 bytes --]

Been working on the hugetlb page commitment overcommit issues.  I have
attached a bunch of patches for review purposes, there are a number so I've
not inlined them, but I can send them, just ask.

The first two patches are cosmetic fixes, either in documentation or to
remove a warning later in the game.

010-overcommit_docs:	documentation changes.
015-do_mremap_warning:	change mremap to be more correct and prevents a
warning when later patches are applied.

The next two patches set the scene.  These are the most tested and it is
these that I hope Anton can test for us with his "real world" failure mode.
These two patches introduce the concept of a split between the default and
hugetlb memory pools and stop the hugtlb pool being accounted at all.  This
is not as clean as I would like, particularly the need to check against
VM_AD_DEFAULT in a few places.

050-mem_acctdom_core: core changes to create two accounting domains
055-mem_acctdom_arch: architecture specific changes for above.

The next two patches are work in progress and I present them more for
review of the direction.  This was prompted by the need to check
VM_AD_DEFAULT explicitly to handle vm_committed.  The first splits the
current vm_committed into a per domain count.  The final patch is the
beginnings of making hugetlbfs account for its pages correctly, currently
it actually only exposes the HUGETLB accounting domain.

060-mem_acctdom_commitments: splits vm_committed into a per domain count
070-mem_acctdom_hugetlb: starts the process of using above for hugetlb.

Testing for the first four patches and comments on the direction of the
remaining patches appreciated.

-apw

[-- Attachment #2: 010-overcommit_docs.txt --]
[-- Type: text/plain, Size: 2200 bytes --]

---
 commoncap.c     |    2 +-
 dummy.c         |    8 ++++++++
 selinux/hooks.c |    2 +-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff -upN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-02-23 18:15:19.000000000 +0000
+++ current/security/commoncap.c	2004-03-23 15:29:41.000000000 +0000
@@ -303,7 +303,7 @@ int cap_syslog (int type)
  * succeed and -ENOMEM implies there is not.
  *
  * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-acounting
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  *
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
diff -upN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-11 20:47:31.000000000 +0000
+++ current/security/dummy.c	2004-03-23 15:29:41.000000000 +0000
@@ -101,6 +101,14 @@ static int dummy_syslog (int type)
 	return 0;
 }
 
+/*
+ * Check that a process has enough memory to allocate a new virtual
+ * mapping. 0 means there is enough memory for the allocation to
+ * succeed and -ENOMEM implies there is not.
+ *
+ * We currently support three overcommit policies, which are set via the
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
+ */
 static int dummy_vm_enough_memory(long pages)
 {
 	unsigned long free, allowed;
diff -upN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-11 20:47:31.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-23 15:29:41.000000000 +0000
@@ -1492,7 +1492,7 @@ static int selinux_syslog(int type)
  * succeed and -ENOMEM implies there is not.
  *
  * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-acounting
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  *
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.

[-- Attachment #3: 015-do_mremap_warning.txt --]
[-- Type: text/plain, Size: 1286 bytes --]

do_remap takes a memory commitment about half way though.  Error exits prior
to this check unnecessarily as to whether we need to release this memory
commitment.  This patch clarifies the exit requirements.

---
 mremap.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff -upN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c	2004-02-23 18:15:13.000000000 +0000
+++ current/mm/mremap.c	2004-03-23 15:29:42.000000000 +0000
@@ -401,7 +401,7 @@ unsigned long do_mremap(unsigned long ad
 	if (vma->vm_flags & VM_ACCOUNT) {
 		charged = (new_len - old_len) >> PAGE_SHIFT;
 		if (security_vm_enough_memory(charged))
-			goto out_nc;
+			goto out;
 	}
 
 	/* old_len exactly to the end of the area..
@@ -426,7 +426,7 @@ unsigned long do_mremap(unsigned long ad
 						   addr + new_len);
 			}
 			ret = addr;
-			goto out;
+			goto out_rc;
 		}
 	}
 
@@ -445,14 +445,14 @@ unsigned long do_mremap(unsigned long ad
 						vma->vm_pgoff, map_flags);
 			ret = new_addr;
 			if (new_addr & ~PAGE_MASK)
-				goto out;
+				goto out_rc;
 		}
 		ret = move_vma(vma, addr, old_len, new_len, new_addr);
 	}
-out:
+out_rc:
 	if (ret & ~PAGE_MASK)
 		vm_unacct_memory(charged);
-out_nc:
+out:
 	return ret;
 }
 

[-- Attachment #4: 050-mem_acctdom_core.txt --]
[-- Type: text/plain, Size: 14853 bytes --]

When hugetlb memory is in user we effectivly split memory into to
two independent and non-overlapping 'page' pools from which we can
allocate pages and against which we wish to handle commitments.
Currently all allocations are accounted against the normal page pool
which can lead to false allocation failures.

This patch provides the framework to allow these pools to be treated
separatly, preventing allocation in the hugetlb pool from being accounted
against the small page pool.  The hugetlb page pool is not accounted at all
and effectibly is treated as in overcommit mode.

The patch creates the concept of an accounting domain, against which
pages are to be accounted.  In this implementation there are two
domains VM_AD_DEFAULT which is used to account normal small pages
in the normal way and VM_AD_HUGETLB which is used to select and
identify VM_HUGETLB pages.  I have not attempted to add any actual
accounting for VM_HUGETLB pages, as currently they are prefaulted and
thus there is always 0 outstanding commitment to track.  Obviously,
if hugetlb was also changed to support demand paging that would
need to be implemented.

---
 fs/exec.c                |    2 +-
 include/linux/mm.h       |    6 ++++++
 include/linux/security.h |   15 ++++++++-------
 kernel/fork.c            |    8 +++++---
 mm/memory.c              |    1 +
 mm/mmap.c                |   18 +++++++++++-------
 mm/mprotect.c            |    5 +++--
 mm/mremap.c              |    3 ++-
 mm/shmem.c               |   10 ++++++----
 mm/swapfile.c            |    2 +-
 security/commoncap.c     |    8 +++++++-
 security/dummy.c         |    8 +++++++-
 security/selinux/hooks.c |    8 +++++++-
 13 files changed, 65 insertions(+), 29 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/exec.c current/fs/exec.c
--- reference/fs/exec.c	2004-03-11 20:47:24.000000000 +0000
+++ current/fs/exec.c	2004-03-23 15:29:40.000000000 +0000
@@ -409,7 +409,7 @@ int setup_arg_pages(struct linux_binprm 
 	if (!mpnt)
 		return -ENOMEM;
 
-	if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, arg_size >> PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/mm.h current/include/linux/mm.h
--- reference/include/linux/mm.h	2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/mm.h	2004-03-23 15:29:40.000000000 +0000
@@ -112,6 +112,12 @@ struct vm_area_struct {
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
 
+/* Memory accounting domains. */
+#define VM_ACCTDOM_NR	2
+#define VM_ACCTDOM(vma) (!!((vma)->vm_flags & VM_HUGETLB))
+#define VM_AD_DEFAULT	0
+#define VM_AD_HUGETLB	1
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/security.h current/include/linux/security.h
--- reference/include/linux/security.h	2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/security.h	2004-03-23 15:29:40.000000000 +0000
@@ -51,7 +51,7 @@ extern int cap_inode_removexattr(struct 
 extern int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t old_suid, int flags);
 extern void cap_task_reparent_to_init (struct task_struct *p);
 extern int cap_syslog (int type);
-extern int cap_vm_enough_memory (long pages);
+extern int cap_vm_enough_memory (int domain, long pages);
 
 static inline int cap_netlink_send (struct sk_buff *skb)
 {
@@ -988,7 +988,8 @@ struct swap_info_struct;
  *	@type contains the type of action.
  *	Return 0 if permission is granted.
  * @vm_enough_memory:
- *	Check permissions for allocating a new virtual mapping.
+ *      Check permissions for allocating a new virtual mapping.
+ *      @domain contains the accounting domain.
  *      @pages contains the number of pages.
  *	Return 0 if permission is granted.
  *
@@ -1022,7 +1023,7 @@ struct security_operations {
 	int (*quotactl) (int cmds, int type, int id, struct super_block * sb);
 	int (*quota_on) (struct file * f);
 	int (*syslog) (int type);
-	int (*vm_enough_memory) (long pages);
+	int (*vm_enough_memory) (int domain, long pages);
 
 	int (*bprm_alloc_security) (struct linux_binprm * bprm);
 	void (*bprm_free_security) (struct linux_binprm * bprm);
@@ -1276,9 +1277,9 @@ static inline int security_syslog(int ty
 	return security_ops->syslog(type);
 }
 
-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_memory(int domain, long pages)
 {
-	return security_ops->vm_enough_memory(pages);
+	return security_ops->vm_enough_memory(domain, pages);
 }
 
 static inline int security_bprm_alloc (struct linux_binprm *bprm)
@@ -1947,9 +1948,9 @@ static inline int security_syslog(int ty
 	return cap_syslog(type);
 }
 
-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_memory(int domain, long pages)
 {
-	return cap_vm_enough_memory(pages);
+	return cap_vm_enough_memory(domain, pages);
 }
 
 static inline int security_bprm_alloc (struct linux_binprm *bprm)
diff -X /home/apw/lib/vdiff.excl -rupN reference/kernel/fork.c current/kernel/fork.c
--- reference/kernel/fork.c	2004-03-11 20:47:29.000000000 +0000
+++ current/kernel/fork.c	2004-03-23 16:29:48.000000000 +0000
@@ -301,9 +301,10 @@ static inline int dup_mmap(struct mm_str
 			continue;
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
-			if (security_vm_enough_memory(len))
+			if (security_vm_enough_memory(VM_ACCTDOM(mpnt), len))
 				goto fail_nomem;
-			charge += len;
+			if (VM_ACCTDOM(mpnt) == VM_AD_DEFAULT)
+				charge += len;
 		}
 		tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 		if (!tmp)
@@ -358,7 +359,8 @@ out:
 fail_nomem:
 	retval = -ENOMEM;
 fail:
-	vm_unacct_memory(charge);
+	if (charge)
+		vm_unacct_memory(charge);
 	goto out;
 }
 static inline int mm_alloc_pgd(struct mm_struct * mm)
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/memory.c	2004-03-23 16:29:48.000000000 +0000
@@ -551,6 +551,7 @@ int unmap_vmas(struct mmu_gather **tlbp,
 		if (end <= vma->vm_start)
 			continue;
 
+		/* We assume that only accountable VMAs are VM_ACCOUNT. */
 		if (vma->vm_flags & VM_ACCOUNT)
 			*nr_accounted += (end - start) >> PAGE_SHIFT;
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/mmap.c	2004-03-23 16:29:48.000000000 +0000
@@ -473,8 +473,11 @@ unsigned long do_mmap_pgoff(struct file 
 	int error;
 	struct rb_node ** rb_link, * rb_parent;
 	unsigned long charged = 0;
+	long acctdom = VM_AD_DEFAULT;
 
 	if (file) {
+		if (is_file_hugepages(file))
+			acctdom = VM_AD_HUGETLB;
 		if (!file->f_op || !file->f_op->mmap)
 			return -ENODEV;
 
@@ -591,7 +594,8 @@ munmap_back:
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
 
-	if (!(flags & MAP_NORESERVE) || sysctl_overcommit_memory > 1) {
+	if (acctdom == VM_AD_DEFAULT && (!(flags & MAP_NORESERVE) || 
+	    sysctl_overcommit_memory > 1)) {
 		if (vm_flags & VM_SHARED) {
 			/* Check memory availability in shmem_file_setup? */
 			vm_flags |= VM_ACCOUNT;
@@ -600,7 +604,7 @@ munmap_back:
 			 * Private writable mapping: check memory availability
 			 */
 			charged = len >> PAGE_SHIFT;
-			if (security_vm_enough_memory(charged))
+			if (security_vm_enough_memory(acctdom, charged))
 				return -ENOMEM;
 			vm_flags |= VM_ACCOUNT;
 		}
@@ -909,8 +913,8 @@ int expand_stack(struct vm_area_struct *
  	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (address - vma->vm_end) >> PAGE_SHIFT;
 
-	/* Overcommit.. */
-	if (security_vm_enough_memory(grow)) {
+	/* Overcommit ... assume stack is in normal memory */
+	if (security_vm_enough_memory(VM_AD_DEFAULT, grow)) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		return -ENOMEM;
 	}
@@ -963,8 +967,8 @@ int expand_stack(struct vm_area_struct *
  	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (vma->vm_start - address) >> PAGE_SHIFT;
 
-	/* Overcommit.. */
-	if (security_vm_enough_memory(grow)) {
+	/* Overcommit ... assume stack is in normal memory */
+	if (security_vm_enough_memory(VM_AD_DEFAULT, grow)) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		return -ENOMEM;
 	}
@@ -1361,7 +1365,7 @@ unsigned long do_brk(unsigned long addr,
 	if (mm->map_count > MAX_MAP_COUNT)
 		return -ENOMEM;
 
-	if (security_vm_enough_memory(len >> PAGE_SHIFT))
+	if (security_vm_enough_memory(VM_AD_DEFAULT, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
 	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mprotect.c current/mm/mprotect.c
--- reference/mm/mprotect.c	2004-01-09 06:59:26.000000000 +0000
+++ current/mm/mprotect.c	2004-03-23 16:29:48.000000000 +0000
@@ -173,9 +173,10 @@ mprotect_fixup(struct vm_area_struct *vm
 	 * a MAP_NORESERVE private mapping to writable will now reserve.
 	 */
 	if (newflags & VM_WRITE) {
-		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
+		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED)) &&
+				VM_ACCTDOM(vma) == VM_AD_DEFAULT) {
 			charged = (end - start) >> PAGE_SHIFT;
-			if (security_vm_enough_memory(charged))
+			if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
 				return -ENOMEM;
 			newflags |= VM_ACCOUNT;
 		}
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/mremap.c	2004-03-23 16:29:48.000000000 +0000
@@ -398,9 +398,10 @@ unsigned long do_mremap(unsigned long ad
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		goto out;
 
+	/* We assume that only accountable VMAs are VM_ACCOUNT. */
 	if (vma->vm_flags & VM_ACCOUNT) {
 		charged = (new_len - old_len) >> PAGE_SHIFT;
-		if (security_vm_enough_memory(charged))
+ 		if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
 			goto out;
 	}
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/shmem.c current/mm/shmem.c
--- reference/mm/shmem.c	2004-02-04 15:09:17.000000000 +0000
+++ current/mm/shmem.c	2004-03-23 15:29:40.000000000 +0000
@@ -526,7 +526,7 @@ static int shmem_notify_change(struct de
 	 	 */
 		change = VM_ACCT(attr->ia_size) - VM_ACCT(inode->i_size);
 		if (change > 0) {
-			if (security_vm_enough_memory(change))
+			if (security_vm_enough_memory(VM_AD_DEFAULT, change))
 				return -ENOMEM;
 		} else if (attr->ia_size < inode->i_size) {
 			vm_unacct_memory(-change);
@@ -1193,7 +1193,8 @@ shmem_file_write(struct file *file, cons
 	maxpos = inode->i_size;
 	if (maxpos < pos + count) {
 		maxpos = pos + count;
-		if (security_vm_enough_memory(VM_ACCT(maxpos) - VM_ACCT(inode->i_size))) {
+		if (security_vm_enough_memory(VM_AD_DEFAULT,
+				VM_ACCT(maxpos) - VM_ACCT(inode->i_size))) {
 			err = -ENOMEM;
 			goto out;
 		}
@@ -1554,7 +1555,7 @@ static int shmem_symlink(struct inode *d
 		memcpy(info, symname, len);
 		inode->i_op = &shmem_symlink_inline_operations;
 	} else {
-		if (security_vm_enough_memory(VM_ACCT(1))) {
+		if (security_vm_enough_memory(VM_AD_DEFAULT, VM_ACCT(1))) {
 			iput(inode);
 			return -ENOMEM;
 		}
@@ -1950,7 +1951,8 @@ struct file *shmem_file_setup(char *name
 	if (size > SHMEM_MAX_BYTES)
 		return ERR_PTR(-EINVAL);
 
-	if ((flags & VM_ACCOUNT) && security_vm_enough_memory(VM_ACCT(size)))
+	if ((flags & VM_ACCOUNT) && security_vm_enough_memory(VM_AD_DEFAULT,
+			VM_ACCT(size)))
 		return ERR_PTR(-ENOMEM);
 
 	error = -ENOMEM;
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/swapfile.c current/mm/swapfile.c
--- reference/mm/swapfile.c	2004-02-23 18:15:13.000000000 +0000
+++ current/mm/swapfile.c	2004-03-23 15:29:40.000000000 +0000
@@ -1048,7 +1048,7 @@ asmlinkage long sys_swapoff(const char _
 		swap_list_unlock();
 		goto out_dput;
 	}
-	if (!security_vm_enough_memory(p->pages))
+	if (!security_vm_enough_memory(VM_AD_DEFAULT, p->pages))
 		vm_unacct_memory(p->pages);
 	else {
 		err = -ENOMEM;
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-03-23 15:29:41.000000000 +0000
+++ current/security/commoncap.c	2004-03-23 15:29:40.000000000 +0000
@@ -308,10 +308,16 @@ int cap_syslog (int type)
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
  */
-int cap_vm_enough_memory(long pages)
+int cap_vm_enough_memory(int domain, long pages)
 {
 	unsigned long free, allowed;
 
+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);
 
         /*
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-23 15:29:41.000000000 +0000
+++ current/security/dummy.c	2004-03-23 15:29:40.000000000 +0000
@@ -109,10 +109,16 @@ static int dummy_syslog (int type)
  * We currently support three overcommit policies, which are set via the
  * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  */
-static int dummy_vm_enough_memory(long pages)
+static int dummy_vm_enough_memory(int domain, long pages)
 {
 	unsigned long free, allowed;
 
+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);
 
         /*
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-23 15:29:41.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-23 15:29:40.000000000 +0000
@@ -1497,12 +1497,18 @@ static int selinux_syslog(int type)
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
  */
-static int selinux_vm_enough_memory(long pages)
+static int selinux_vm_enough_memory(int domain, long pages)
 {
 	unsigned long free, allowed;
 	int rc;
 	struct task_security_struct *tsec = current->security;
 
+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);
 
         /*

[-- Attachment #5: 055-mem_acctdom_arch.txt --]
[-- Type: text/plain, Size: 2699 bytes --]

---
 ia64/ia32/binfmt_elf32.c  |    3 ++-
 mips/kernel/sysirix.c     |    3 ++-
 s390/kernel/compat_exec.c |    3 ++-
 x86_64/ia32/ia32_binfmt.c |    3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff -upN reference/arch/ia64/ia32/binfmt_elf32.c current/arch/ia64/ia32/binfmt_elf32.c
--- reference/arch/ia64/ia32/binfmt_elf32.c	2004-03-11 20:47:12.000000000 +0000
+++ current/arch/ia64/ia32/binfmt_elf32.c	2004-03-23 15:29:42.000000000 +0000
@@ -168,7 +168,8 @@ ia32_setup_arg_pages (struct linux_binpr
 	if (!mpnt)
 		return -ENOMEM;
 
-	if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, (IA32_STACK_TOP -
+			(PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}
diff -upN reference/arch/mips/kernel/sysirix.c current/arch/mips/kernel/sysirix.c
--- reference/arch/mips/kernel/sysirix.c	2004-03-11 20:47:13.000000000 +0000
+++ current/arch/mips/kernel/sysirix.c	2004-03-23 15:29:42.000000000 +0000
@@ -578,7 +578,8 @@ asmlinkage int irix_brk(unsigned long br
 	/*
 	 * Check if we have enough memory..
 	 */
-	if (security_vm_enough_memory((newbrk-oldbrk) >> PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT,
+			(newbrk-oldbrk) >> PAGE_SHIFT)) {
 		ret = -ENOMEM;
 		goto out;
 	}
diff -upN reference/arch/s390/kernel/compat_exec.c current/arch/s390/kernel/compat_exec.c
--- reference/arch/s390/kernel/compat_exec.c	2004-01-09 06:59:57.000000000 +0000
+++ current/arch/s390/kernel/compat_exec.c	2004-03-23 15:29:42.000000000 +0000
@@ -56,7 +56,8 @@ int setup_arg_pages32(struct linux_binpr
 	if (!mpnt) 
 		return -ENOMEM; 

-	if (security_vm_enough_memory((STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, (STACK_TOP -
+			(PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}
diff -upN reference/arch/x86_64/ia32/ia32_binfmt.c current/arch/x86_64/ia32/ia32_binfmt.c
--- reference/arch/x86_64/ia32/ia32_binfmt.c	2004-03-11 20:47:15.000000000 +0000
+++ current/arch/x86_64/ia32/ia32_binfmt.c	2004-03-23 15:29:42.000000000 +0000
@@ -345,7 +345,8 @@ int setup_arg_pages(struct linux_binprm 
 	if (!mpnt) 
 		return -ENOMEM; 

-	if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, (IA32_STACK_TOP -
+			(PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}

[-- Attachment #6: 060-mem_acctdom_commitments.txt --]
[-- Type: text/plain, Size: 18519 bytes --]

Currently only normal page commitments are tracked.  This patch
provides a framework for tracking page commitments in multiple
independant domains.  With this patch vm_commited_space becomes a
per domain trackable.

---
 fs/proc/proc_misc.c      |    2 +-
 include/linux/mm.h       |   13 +++++++++++--
 include/linux/mman.h     |   12 ++++++------
 kernel/fork.c            |    8 +++-----
 mm/memory.c              |   12 +++++++++---
 mm/mmap.c                |   23 ++++++++++++-----------
 mm/mprotect.c            |    5 ++---
 mm/mremap.c              |    2 +-
 mm/nommu.c               |    3 ++-
 mm/shmem.c               |   13 +++++++------
 mm/swap.c                |   17 +++++++++++++----
 mm/swapfile.c            |    4 +++-
 security/commoncap.c     |   10 +++++-----
 security/dummy.c         |   10 +++++-----
 security/selinux/hooks.c |   10 +++++-----
 15 files changed, 85 insertions(+), 59 deletions(-)

diff -upN reference/fs/proc/proc_misc.c current/fs/proc/proc_misc.c
--- reference/fs/proc/proc_misc.c	2004-03-11 20:47:27.000000000 +0000
+++ current/fs/proc/proc_misc.c	2004-03-23 15:29:43.000000000 +0000
@@ -174,7 +174,7 @@ static int meminfo_read_proc(char *page,
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = atomic_read(&vm_committed_space);
+	committed = atomic_read(&vm_committed_space[VM_AD_DEFAULT]);
 
 	vmtot = (VMALLOC_END-VMALLOC_START)>>10;
 	vmi = get_vmalloc_info();
diff -upN reference/include/linux/mman.h current/include/linux/mman.h
--- reference/include/linux/mman.h	2004-01-09 06:59:09.000000000 +0000
+++ current/include/linux/mman.h	2004-03-23 15:29:43.000000000 +0000
@@ -12,20 +12,20 @@
 
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern atomic_t vm_committed_space;
+extern atomic_t vm_committed_space[];
 
 #ifdef CONFIG_SMP
-extern void vm_acct_memory(long pages);
+extern void vm_acct_memory(int domain, long pages);
 #else
-static inline void vm_acct_memory(long pages)
+static inline void vm_acct_memory(int domain, long pages)
 {
-	atomic_add(pages, &vm_committed_space);
+	atomic_add(pages, &vm_committed_space[domain]);
 }
 #endif
 
-static inline void vm_unacct_memory(long pages)
+static inline void vm_unacct_memory(int domain, long pages)
 {
-	vm_acct_memory(-pages);
+	vm_acct_memory(domain, -pages);
 }
 
 /*
diff -upN reference/include/linux/mm.h current/include/linux/mm.h
--- reference/include/linux/mm.h	2004-03-23 15:29:42.000000000 +0000
+++ current/include/linux/mm.h	2004-03-23 15:29:43.000000000 +0000
@@ -117,7 +117,16 @@ struct vm_area_struct {
 #define VM_ACCTDOM(vma) (!!((vma)->vm_flags & VM_HUGETLB))
 #define VM_AD_DEFAULT	0
 #define VM_AD_HUGETLB	1
-
+typedef struct {
+	long vec[VM_ACCTDOM_NR];
+} madv_t;
+#define MADV_NONE { {[0 ... VM_ACCTDOM_NR-1] =  0UL} }
+static inline void madv_add(madv_t *madv, int domain, long size)
+{
+	madv->vec[domain] += size;
+}
+void vm_unacct_memory_domains(madv_t *madv);
+  
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
@@ -440,7 +449,7 @@ void zap_page_range(struct vm_area_struc
 			unsigned long size);
 int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
-		unsigned long end_addr, unsigned long *nr_accounted);
+		unsigned long end_addr, madv_t *nr_accounted);
 void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			unsigned long address, unsigned long size);
 void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
diff -upN reference/kernel/fork.c current/kernel/fork.c
--- reference/kernel/fork.c	2004-03-23 15:29:42.000000000 +0000
+++ current/kernel/fork.c	2004-03-23 15:29:43.000000000 +0000
@@ -267,7 +267,7 @@ static inline int dup_mmap(struct mm_str
 	struct vm_area_struct * mpnt, *tmp, **pprev;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
-	unsigned long charge = 0;
+	madv_t charge = MADV_NONE;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -303,8 +303,7 @@ static inline int dup_mmap(struct mm_str
 			unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 			if (security_vm_enough_memory(VM_ACCTDOM(mpnt), len))
 				goto fail_nomem;
-			if (VM_ACCTDOM(mpnt) == VM_AD_DEFAULT)
-				charge += len;
+ 			madv_add(&charge, VM_ACCTDOM(mpnt), len);
 		}
 		tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 		if (!tmp)
@@ -359,8 +358,7 @@ out:
 fail_nomem:
 	retval = -ENOMEM;
 fail:
-	if (charge)
-		vm_unacct_memory(charge);
+	vm_unacct_memory_domains(&charge);
 	goto out;
 }
 static inline int mm_alloc_pgd(struct mm_struct * mm)
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/memory.c	2004-03-23 15:29:43.000000000 +0000
@@ -524,7 +524,7 @@ void unmap_page_range(struct mmu_gather 
  */
 int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		unsigned long end_addr, unsigned long *nr_accounted)
+		unsigned long end_addr, madv_t *nr_accounted)
 {
 	unsigned long zap_bytes = ZAP_BLOCK_SIZE;
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
@@ -553,7 +553,8 @@ int unmap_vmas(struct mmu_gather **tlbp,
 
 		/* We assume that only accountable VMAs are VM_ACCOUNT. */
 		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += (end - start) >> PAGE_SHIFT;
+			madv_add(nr_accounted,
+				VM_ACCTDOM(vma), (end - start) >> PAGE_SHIFT);
 
 		ret++;
 		while (start != end) {
@@ -602,7 +603,12 @@ void zap_page_range(struct vm_area_struc
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather *tlb;
 	unsigned long end = address + size;
-	unsigned long nr_accounted = 0;
+	madv_t nr_accounted = MADV_NONE;
+
+	/* XXX: we seem to avoid thinking about the memory accounting
+	 * for both the hugepages where don't bother even tracking it and
+	 * in the normal path where we figure it out and do nothing with it??
+	 */
 
 	might_sleep();
 
diff -upN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/mmap.c	2004-03-23 15:29:43.000000000 +0000
@@ -54,7 +54,8 @@ pgprot_t protection_map[16] = {
 
 int sysctl_overcommit_memory = 0;	/* default is heuristic overcommit */
 int sysctl_overcommit_ratio = 50;	/* default is 50% */
-atomic_t vm_committed_space = ATOMIC_INIT(0);
+atomic_t vm_committed_space[VM_ACCTDOM_NR] = 
+     { [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };
 
 EXPORT_SYMBOL(sysctl_overcommit_memory);
 EXPORT_SYMBOL(sysctl_overcommit_ratio);
@@ -594,8 +595,8 @@ munmap_back:
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
 
-	if (acctdom == VM_AD_DEFAULT && (!(flags & MAP_NORESERVE) || 
-	    sysctl_overcommit_memory > 1)) {
+	if (!(flags & MAP_NORESERVE) || 
+	    (acctdom == VM_AD_DEFAULT && sysctl_overcommit_memory > 1)) {
 		if (vm_flags & VM_SHARED) {
 			/* Check memory availability in shmem_file_setup? */
 			vm_flags |= VM_ACCOUNT;
@@ -713,7 +714,7 @@ free_vma:
 	kmem_cache_free(vm_area_cachep, vma);
 unacct_error:
 	if (charged)
-		vm_unacct_memory(charged);
+		vm_unacct_memory(acctdom, charged);
 	return error;
 }
 
@@ -923,7 +924,7 @@ int expand_stack(struct vm_area_struct *
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		vm_unacct_memory(grow);
+		vm_unacct_memory(VM_AD_DEFAULT, grow);
 		return -ENOMEM;
 	}
 	vma->vm_end = address;
@@ -977,7 +978,7 @@ int expand_stack(struct vm_area_struct *
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		vm_unacct_memory(grow);
+		vm_unacct_memory(VM_AD_DEFAULT, grow);
 		return -ENOMEM;
 	}
 	vma->vm_start = address;
@@ -1135,12 +1136,12 @@ static void unmap_region(struct mm_struc
 	unsigned long end)
 {
 	struct mmu_gather *tlb;
-	unsigned long nr_accounted = 0;
+	madv_t nr_accounted = MADV_NONE;
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted);
-	vm_unacct_memory(nr_accounted);
+	vm_unacct_memory_domains(&nr_accounted);
 
 	if (is_hugepage_only_range(start, end - start))
 		hugetlb_free_pgtables(tlb, prev, start, end);
@@ -1380,7 +1381,7 @@ unsigned long do_brk(unsigned long addr,
 	 */
 	vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 	if (!vma) {
-		vm_unacct_memory(len >> PAGE_SHIFT);
+		vm_unacct_memory(VM_AD_DEFAULT, len >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
@@ -1413,7 +1414,7 @@ void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
 	struct vm_area_struct *vma;
-	unsigned long nr_accounted = 0;
+	madv_t nr_accounted = MADV_NONE;
 
 	profile_exit_mmap(mm);
  
@@ -1426,7 +1427,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use ~0UL here to ensure all VMAs in the mm are unmapped */
 	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
 					~0UL, &nr_accounted);
-	vm_unacct_memory(nr_accounted);
+	vm_unacct_memory_domains(&nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
 	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
diff -upN reference/mm/mprotect.c current/mm/mprotect.c
--- reference/mm/mprotect.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/mprotect.c	2004-03-23 15:29:43.000000000 +0000
@@ -173,8 +173,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	 * a MAP_NORESERVE private mapping to writable will now reserve.
 	 */
 	if (newflags & VM_WRITE) {
-		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED)) &&
-				VM_ACCTDOM(vma) == VM_AD_DEFAULT) {
+		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
 			charged = (end - start) >> PAGE_SHIFT;
 			if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
 				return -ENOMEM;
@@ -218,7 +217,7 @@ success:
 	return 0;
 
 fail:
-	vm_unacct_memory(charged);
+	vm_unacct_memory(VM_ACCTDOM(vma), charged);
 	return error;
 }
 
diff -upN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/mremap.c	2004-03-23 15:29:43.000000000 +0000
@@ -452,7 +452,7 @@ unsigned long do_mremap(unsigned long ad
 	}
 out_rc:
 	if (ret & ~PAGE_MASK)
-		vm_unacct_memory(charged);
+		vm_unacct_memory(VM_ACCTDOM(vma), charged);
 out:
 	return ret;
 }
diff -upN reference/mm/nommu.c current/mm/nommu.c
--- reference/mm/nommu.c	2004-02-04 15:09:16.000000000 +0000
+++ current/mm/nommu.c	2004-03-23 15:29:43.000000000 +0000
@@ -29,7 +29,8 @@ struct page *mem_map;
 unsigned long max_mapnr;
 unsigned long num_physpages;
 unsigned long askedalloc, realalloc;
-atomic_t vm_committed_space = ATOMIC_INIT(0);
+atomic_t vm_committed_space[VM_ACCTDOM_NR] = 
+     { [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };
 int sysctl_overcommit_memory; /* default is heuristic overcommit */
 int sysctl_overcommit_ratio = 50; /* default is 50% */
 
diff -upN reference/mm/shmem.c current/mm/shmem.c
--- reference/mm/shmem.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/shmem.c	2004-03-23 15:29:43.000000000 +0000
@@ -529,7 +529,7 @@ static int shmem_notify_change(struct de
 			if (security_vm_enough_memory(VM_AD_DEFAULT, change))
 				return -ENOMEM;
 		} else if (attr->ia_size < inode->i_size) {
-			vm_unacct_memory(-change);
+			vm_unacct_memory(VM_AD_DEFAULT, -change);
 			/*
 			 * If truncating down to a partial page, then
 			 * if that page is already allocated, hold it
@@ -564,7 +564,7 @@ static int shmem_notify_change(struct de
 	if (page)
 		page_cache_release(page);
 	if (error)
-		vm_unacct_memory(change);
+		vm_unacct_memory(VM_AD_DEFAULT, change);
 	return error;
 }
 
@@ -578,7 +578,7 @@ static void shmem_delete_inode(struct in
 		list_del(&info->list);
 		spin_unlock(&shmem_ilock);
 		if (info->flags & VM_ACCOUNT)
-			vm_unacct_memory(VM_ACCT(inode->i_size));
+			vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(inode->i_size));
 		inode->i_size = 0;
 		shmem_truncate(inode);
 	}
@@ -1274,7 +1274,8 @@ shmem_file_write(struct file *file, cons
 
 	/* Short writes give back address space */
 	if (inode->i_size != maxpos)
-		vm_unacct_memory(VM_ACCT(maxpos) - VM_ACCT(inode->i_size));
+		vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(maxpos) -
+			VM_ACCT(inode->i_size));
 out:
 	up(&inode->i_sem);
 	return err;
@@ -1561,7 +1562,7 @@ static int shmem_symlink(struct inode *d
 		}
 		error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
 		if (error) {
-			vm_unacct_memory(VM_ACCT(1));
+			vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(1));
 			iput(inode);
 			return error;
 		}
@@ -1991,7 +1992,7 @@ put_dentry:
 	dput(dentry);
 put_memory:
 	if (flags & VM_ACCOUNT)
-		vm_unacct_memory(VM_ACCT(size));
+		vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(size));
 	return ERR_PTR(error);
 }
 
diff -upN reference/mm/swap.c current/mm/swap.c
--- reference/mm/swap.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/swap.c	2004-03-23 15:29:43.000000000 +0000
@@ -365,17 +365,18 @@ unsigned int pagevec_lookup(struct pagev
  */
 #define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
 
-static DEFINE_PER_CPU(long, committed_space) = 0;
+/* XXX: zero this????? */
+static DEFINE_PER_CPU(long, committed_space[VM_ACCTDOM_NR]);
 
-void vm_acct_memory(long pages)
+void vm_acct_memory(int domain, long pages)
 {
 	long *local;
 
 	preempt_disable();
-	local = &__get_cpu_var(committed_space);
+	local = &__get_cpu_var(committed_space[domain]);
 	*local += pages;
 	if (*local > ACCT_THRESHOLD || *local < -ACCT_THRESHOLD) {
-		atomic_add(*local, &vm_committed_space);
+		atomic_add(*local, &vm_committed_space[domain]);
 		*local = 0;
 	}
 	preempt_enable();
@@ -383,6 +384,14 @@ void vm_acct_memory(long pages)
 EXPORT_SYMBOL(vm_acct_memory);
 #endif
 
+void vm_unacct_memory_domains(madv_t *adv)
+{
+	if (adv->vec[0])
+		vm_unacct_memory(VM_AD_DEFAULT, adv->vec[0]);
+	if (adv->vec[1])
+		vm_unacct_memory(VM_AD_DEFAULT, adv->vec[1]);
+}
+
 #ifdef CONFIG_SMP
 void percpu_counter_mod(struct percpu_counter *fbc, long amount)
 {
diff -upN reference/mm/swapfile.c current/mm/swapfile.c
--- reference/mm/swapfile.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/swapfile.c	2004-03-23 15:29:43.000000000 +0000
@@ -1048,8 +1048,10 @@ asmlinkage long sys_swapoff(const char _
 		swap_list_unlock();
 		goto out_dput;
 	}
+	/* There is an assumption here that we only may have swapped things
+	 * from the default memory accounting domain to this device. */
 	if (!security_vm_enough_memory(VM_AD_DEFAULT, p->pages))
-		vm_unacct_memory(p->pages);
+		vm_unacct_memory(VM_AD_DEFAULT, p->pages);
 	else {
 		err = -ENOMEM;
 		swap_list_unlock();
diff -upN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-03-23 15:29:42.000000000 +0000
+++ current/security/commoncap.c	2004-03-23 15:29:43.000000000 +0000
@@ -312,14 +312,14 @@ int cap_vm_enough_memory(int domain, lon
 {
 	unsigned long free, allowed;
 
+	vm_acct_memory(domain, pages);
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
 	if (domain != VM_AD_DEFAULT)
 		return 0;
 
-	vm_acct_memory(pages);
-
         /*
 	 * Sometimes we want to use more memory than we have
 	 */
@@ -360,17 +360,17 @@ int cap_vm_enough_memory(int domain, lon
 
 		if (free > pages)
 			return 0;
-		vm_unacct_memory(pages);
+		vm_unacct_memory(domain, pages);
 		return -ENOMEM;
 	}
 
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
-	if (atomic_read(&vm_committed_space) < allowed)
+	if (atomic_read(&vm_committed_space[domain]) < allowed)
 		return 0;
 
-	vm_unacct_memory(pages);
+	vm_unacct_memory(domain, pages);
 
 	return -ENOMEM;
 }
diff -upN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-23 15:29:42.000000000 +0000
+++ current/security/dummy.c	2004-03-23 15:29:43.000000000 +0000
@@ -113,14 +113,14 @@ static int dummy_vm_enough_memory(int do
 {
 	unsigned long free, allowed;
 
+	vm_acct_memory(domain, pages);
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
 	if (domain != VM_AD_DEFAULT)
 		return 0;
 
-	vm_acct_memory(pages);
-
         /*
 	 * Sometimes we want to use more memory than we have
 	 */
@@ -148,17 +148,17 @@ static int dummy_vm_enough_memory(int do
 
 		if (free > pages)
 			return 0;
-		vm_unacct_memory(pages);
+		vm_unacct_memory(domain, pages);
 		return -ENOMEM;
 	}
 
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
-	if (atomic_read(&vm_committed_space) < allowed)
+	if (atomic_read(&vm_committed_space[VM_AD_DEFAULT]) < allowed)
 		return 0;
 
-	vm_unacct_memory(pages);
+	vm_unacct_memory(domain, pages);
 
 	return -ENOMEM;
 }
diff -upN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-23 15:29:42.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-23 15:29:43.000000000 +0000
@@ -1503,14 +1503,14 @@ static int selinux_vm_enough_memory(int 
 	int rc;
 	struct task_security_struct *tsec = current->security;
 
+	vm_acct_memory(domain, pages);
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
 	if (domain != VM_AD_DEFAULT)
 		return 0;
 
-	vm_acct_memory(pages);
-
         /*
 	 * Sometimes we want to use more memory than we have
 	 */
@@ -1547,17 +1547,17 @@ static int selinux_vm_enough_memory(int 
 
 		if (free > pages)
 			return 0;
-		vm_unacct_memory(pages);
+		vm_unacct_memory(domain, pages);
 		return -ENOMEM;
 	}
 
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
-	if (atomic_read(&vm_committed_space) < allowed)
+	if (atomic_read(&vm_committed_space[VM_AD_DEFAULT]) < allowed)
 		return 0;
 
-	vm_unacct_memory(pages);
+	vm_unacct_memory(domain, pages);
 
 	return -ENOMEM;
 }

[-- Attachment #7: 070-mem_acctdom_hugetlb.txt --]
[-- Type: text/plain, Size: 1105 bytes --]

---
 hugetlbpage.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff -upN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c	2004-01-09 07:00:02.000000000 +0000
+++ current/arch/i386/mm/hugetlbpage.c	2004-03-23 15:29:41.000000000 +0000
@@ -15,7 +15,7 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -513,13 +513,17 @@ module_init(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
+			"Hugepagesize:    %5lu kB\n"
+			"HugeCommited_AS: %8u kB\n",
 			htlbzone_pages,
 			htlbpagemem,
-			HPAGE_SIZE/1024);
+			HPAGE_SIZE/1024,
+			K(committed));
 }
 
 int is_hugepage_mem_enough(size_t size)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
  2004-03-23 17:30         ` Andy Whitcroft
@ 2004-03-24 17:38           ` Andy Whitcroft
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Whitcroft @ 2004-03-24 17:38 UTC (permalink / raw)
  To: Anton Blanchard, Andrew Morton, Stephen Smalley
  Cc: Andi Kleen, raybry, lse-tech, linux-ia64, linux-kernel, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 1867 bytes --]

Here is the next installment of HUGETLB memory accounting.  With the stack
applied (to 2.6.4) HUGETLB allocations are be handled separately to those
for normal pages.  The set has been tested lightly on i386.  Other
architectures have not yet been compiled (testers please).  Currently there
are no tunables for overcommit.  Again patches attached, ask if you need
them inline.

This patch has an interesting and I believe correct side effect.  Memory is
now committed when a hugetlb segment is initially requested, even before it
is attached.  Thus it is no longer possible to shmget many large segments
and have them fail to attach.

The patch list below ...  Comments??

-apw

010-overcommit_docs:		documentation changes
015-do_mremap_warning:		cleanup exit handling to prevent warning
050-mem_acctdom_core:		core changes to create two accounting domains
055-mem_acctdom_arch:		architecture specific changes for above
060-mem_acctdom_commitments:	splits vm_committed into a per domain count
070-mem_acctdom_hugetlb: 	use vm_committed to track HUGETLB usage
075-em_acctdom_hugetlb_arch:	architecture specific changes for above

The first two patches are cosmetic fixes, either in documentation or to
remove a warning later in the game.

The third and fourth patches patches set the scene.  These are the most
tested and it is these that I hope Anton can test for us with his "real
world" failure mode. These two patches introduce the concept of a split
between the default and hugetlb memory pools and stop the hugtlb pool being
accounted at all.  This is not as clean as I would like, particularly the
need to check against VM_AD_DEFAULT in a few places.

The fifth patch splits the vm_committed count into a per domain count and
exposes the domain in the interface.

The sixth and seventh patch converts hugetlb to use the vm_commitment
interfaces exposed above.

[-- Attachment #2: 075-mem_acctdom_hugetlb_arch.txt --]
[-- Type: text/plain, Size: 6232 bytes --]

---
 i386/mm/hugetlbpage.c    |   16 +++++++++++++---
 ia64/mm/hugetlbpage.c    |   16 +++++++++++++---
 ppc64/mm/hugetlbpage.c   |   16 +++++++++++++---
 sparc64/mm/hugetlbpage.c |   16 +++++++++++++---
 4 files changed, 52 insertions(+), 12 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c	2004-01-09 07:00:02.000000000 +0000
+++ current/arch/i386/mm/hugetlbpage.c	2004-03-24 18:03:05.000000000 +0000
@@ -15,7 +15,7 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -513,13 +513,17 @@ module_init(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
+			"Hugepagesize:    %5lu kB\n"
+			"HugeCommited_AS: %8u kB\n",
 			htlbzone_pages,
 			htlbpagemem,
-			HPAGE_SIZE/1024);
+			HPAGE_SIZE/1024,
+			K(committed));
 }
 
 int is_hugepage_mem_enough(size_t size)
@@ -527,6 +531,12 @@ int is_hugepage_mem_enough(size_t size)
 	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
 }
 
+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+int hugetlb_total_pages(void)
+{
+	return htlbzone_pages * (HPAGE_SIZE / PAGE_SIZE);
+}
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/ia64/mm/hugetlbpage.c current/arch/ia64/mm/hugetlbpage.c
--- reference/arch/ia64/mm/hugetlbpage.c	2004-03-11 20:47:12.000000000 +0000
+++ current/arch/ia64/mm/hugetlbpage.c	2004-03-24 18:07:31.000000000 +0000
@@ -17,7 +17,7 @@
 #include <linux/smp_lock.h>
 #include <linux/slab.h>
 #include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -576,13 +576,17 @@ __initcall(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
+			"Hugepagesize:    %5lu kB\n"
+			"HugeCommited_AS: %8u kB\n",
 			htlbzone_pages,
 			htlbpagemem,
-			HPAGE_SIZE/1024);
+			HPAGE_SIZE/1024,
+			K(committed));
 }
 
 int is_hugepage_mem_enough(size_t size)
@@ -592,6 +596,12 @@ int is_hugepage_mem_enough(size_t size)
 	return 1;
 }
 
+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+unsigned long hugetlb_total_pages(void)
+{
+	return htlbzone_pages * (HPAGE_SIZE / PAGE_SIZE);
+}
+
 static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int *unused)
 {
 	BUG();
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/ppc64/mm/hugetlbpage.c current/arch/ppc64/mm/hugetlbpage.c
--- reference/arch/ppc64/mm/hugetlbpage.c	2004-03-11 20:47:14.000000000 +0000
+++ current/arch/ppc64/mm/hugetlbpage.c	2004-03-24 18:11:09.000000000 +0000
@@ -17,7 +17,7 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -896,13 +896,17 @@ module_init(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
 	return sprintf(buf,
 			"HugePages_Total: %5d\n"
 			"HugePages_Free:  %5d\n"
-			"Hugepagesize:    %5lu kB\n",
+			"Hugepagesize:    %5lu kB\n"
+			"HugeCommited_AS: %8u kB",
 			htlbpage_total,
 			htlbpage_free,
-			HPAGE_SIZE/1024);
+			HPAGE_SIZE/1024,
+			K(commited));
 }
 
 /* This is advisory only, so we can get away with accesing
@@ -912,6 +916,12 @@ int is_hugepage_mem_enough(size_t size)
 	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpage_free;
 }
 
+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+int hugetlb_total_pages(void)
+{
+	return htlbpage_total * (HPAGE_SIZE / PAGE_SIZE);
+}
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/sparc64/mm/hugetlbpage.c current/arch/sparc64/mm/hugetlbpage.c
--- reference/arch/sparc64/mm/hugetlbpage.c	2004-01-09 06:59:45.000000000 +0000
+++ current/arch/sparc64/mm/hugetlbpage.c	2004-03-24 18:12:11.000000000 +0000
@@ -13,8 +13,8 @@
 #include <linux/smp_lock.h>
 #include <linux/slab.h>
 #include <linux/sysctl.h>
+#include <linux/mman.h>
 
-#include <asm/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
@@ -483,13 +483,17 @@ module_init(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
-			"Hugepagesize:    %5lu kB\n",
+			"Hugepagesize:    %5lu kB\n"
+			"HugeCommited_AS: %8u kB\n",
 			htlbzone_pages,
 			htlbpagemem,
-			HPAGE_SIZE/1024);
+			HPAGE_SIZE/1024,
+			K(committed));
 }
 
 int is_hugepage_mem_enough(size_t size)
@@ -497,6 +501,12 @@ int is_hugepage_mem_enough(size_t size)
 	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
 }
 
+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+int hugetlb_total_pages(void)
+{
+	return htlbzone_pages * (HPAGE_SIZE / PAGE_SIZE);
+}
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the

[-- Attachment #3: 015-do_mremap_warning.txt --]
[-- Type: text/plain, Size: 1286 bytes --]

do_remap takes a memory commitment about half way though.  Error exits prior
to this check unnecessarily as to whether we need to release this memory
commitment.  This patch clarifies the exit requirements.

---
 mremap.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff -upN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c	2004-02-23 18:15:13.000000000 +0000
+++ current/mm/mremap.c	2004-03-23 15:29:42.000000000 +0000
@@ -401,7 +401,7 @@ unsigned long do_mremap(unsigned long ad
 	if (vma->vm_flags & VM_ACCOUNT) {
 		charged = (new_len - old_len) >> PAGE_SHIFT;
 		if (security_vm_enough_memory(charged))
-			goto out_nc;
+			goto out;
 	}
 
 	/* old_len exactly to the end of the area..
@@ -426,7 +426,7 @@ unsigned long do_mremap(unsigned long ad
 						   addr + new_len);
 			}
 			ret = addr;
-			goto out;
+			goto out_rc;
 		}
 	}
 
@@ -445,14 +445,14 @@ unsigned long do_mremap(unsigned long ad
 						vma->vm_pgoff, map_flags);
 			ret = new_addr;
 			if (new_addr & ~PAGE_MASK)
-				goto out;
+				goto out_rc;
 		}
 		ret = move_vma(vma, addr, old_len, new_len, new_addr);
 	}
-out:
+out_rc:
 	if (ret & ~PAGE_MASK)
 		vm_unacct_memory(charged);
-out_nc:
+out:
 	return ret;
 }
 

[-- Attachment #4: 050-mem_acctdom_core.txt --]
[-- Type: text/plain, Size: 14853 bytes --]

When hugetlb memory is in user we effectivly split memory into to
two independent and non-overlapping 'page' pools from which we can
allocate pages and against which we wish to handle commitments.
Currently all allocations are accounted against the normal page pool
which can lead to false allocation failures.

This patch provides the framework to allow these pools to be treated
separatly, preventing allocation in the hugetlb pool from being accounted
against the small page pool.  The hugetlb page pool is not accounted at all
and effectibly is treated as in overcommit mode.

The patch creates the concept of an accounting domain, against which
pages are to be accounted.  In this implementation there are two
domains VM_AD_DEFAULT which is used to account normal small pages
in the normal way and VM_AD_HUGETLB which is used to select and
identify VM_HUGETLB pages.  I have not attempted to add any actual
accounting for VM_HUGETLB pages, as currently they are prefaulted and
thus there is always 0 outstanding commitment to track.  Obviously,
if hugetlb was also changed to support demand paging that would
need to be implemented.

---
 fs/exec.c                |    2 +-
 include/linux/mm.h       |    6 ++++++
 include/linux/security.h |   15 ++++++++-------
 kernel/fork.c            |    8 +++++---
 mm/memory.c              |    1 +
 mm/mmap.c                |   18 +++++++++++-------
 mm/mprotect.c            |    5 +++--
 mm/mremap.c              |    3 ++-
 mm/shmem.c               |   10 ++++++----
 mm/swapfile.c            |    2 +-
 security/commoncap.c     |    8 +++++++-
 security/dummy.c         |    8 +++++++-
 security/selinux/hooks.c |    8 +++++++-
 13 files changed, 65 insertions(+), 29 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/exec.c current/fs/exec.c
--- reference/fs/exec.c	2004-03-11 20:47:24.000000000 +0000
+++ current/fs/exec.c	2004-03-23 15:29:40.000000000 +0000
@@ -409,7 +409,7 @@ int setup_arg_pages(struct linux_binprm 
 	if (!mpnt)
 		return -ENOMEM;
 
-	if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, arg_size >> PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/mm.h current/include/linux/mm.h
--- reference/include/linux/mm.h	2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/mm.h	2004-03-23 15:29:40.000000000 +0000
@@ -112,6 +112,12 @@ struct vm_area_struct {
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
 
+/* Memory accounting domains. */
+#define VM_ACCTDOM_NR	2
+#define VM_ACCTDOM(vma) (!!((vma)->vm_flags & VM_HUGETLB))
+#define VM_AD_DEFAULT	0
+#define VM_AD_HUGETLB	1
+
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/security.h current/include/linux/security.h
--- reference/include/linux/security.h	2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/security.h	2004-03-23 15:29:40.000000000 +0000
@@ -51,7 +51,7 @@ extern int cap_inode_removexattr(struct 
 extern int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t old_suid, int flags);
 extern void cap_task_reparent_to_init (struct task_struct *p);
 extern int cap_syslog (int type);
-extern int cap_vm_enough_memory (long pages);
+extern int cap_vm_enough_memory (int domain, long pages);
 
 static inline int cap_netlink_send (struct sk_buff *skb)
 {
@@ -988,7 +988,8 @@ struct swap_info_struct;
  *	@type contains the type of action.
  *	Return 0 if permission is granted.
  * @vm_enough_memory:
- *	Check permissions for allocating a new virtual mapping.
+ *      Check permissions for allocating a new virtual mapping.
+ *      @domain contains the accounting domain.
  *      @pages contains the number of pages.
  *	Return 0 if permission is granted.
  *
@@ -1022,7 +1023,7 @@ struct security_operations {
 	int (*quotactl) (int cmds, int type, int id, struct super_block * sb);
 	int (*quota_on) (struct file * f);
 	int (*syslog) (int type);
-	int (*vm_enough_memory) (long pages);
+	int (*vm_enough_memory) (int domain, long pages);
 
 	int (*bprm_alloc_security) (struct linux_binprm * bprm);
 	void (*bprm_free_security) (struct linux_binprm * bprm);
@@ -1276,9 +1277,9 @@ static inline int security_syslog(int ty
 	return security_ops->syslog(type);
 }
 
-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_memory(int domain, long pages)
 {
-	return security_ops->vm_enough_memory(pages);
+	return security_ops->vm_enough_memory(domain, pages);
 }
 
 static inline int security_bprm_alloc (struct linux_binprm *bprm)
@@ -1947,9 +1948,9 @@ static inline int security_syslog(int ty
 	return cap_syslog(type);
 }
 
-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_memory(int domain, long pages)
 {
-	return cap_vm_enough_memory(pages);
+	return cap_vm_enough_memory(domain, pages);
 }
 
 static inline int security_bprm_alloc (struct linux_binprm *bprm)
diff -X /home/apw/lib/vdiff.excl -rupN reference/kernel/fork.c current/kernel/fork.c
--- reference/kernel/fork.c	2004-03-11 20:47:29.000000000 +0000
+++ current/kernel/fork.c	2004-03-23 16:29:48.000000000 +0000
@@ -301,9 +301,10 @@ static inline int dup_mmap(struct mm_str
 			continue;
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
-			if (security_vm_enough_memory(len))
+			if (security_vm_enough_memory(VM_ACCTDOM(mpnt), len))
 				goto fail_nomem;
-			charge += len;
+			if (VM_ACCTDOM(mpnt) == VM_AD_DEFAULT)
+				charge += len;
 		}
 		tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 		if (!tmp)
@@ -358,7 +359,8 @@ out:
 fail_nomem:
 	retval = -ENOMEM;
 fail:
-	vm_unacct_memory(charge);
+	if (charge)
+		vm_unacct_memory(charge);
 	goto out;
 }
 static inline int mm_alloc_pgd(struct mm_struct * mm)
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/memory.c	2004-03-23 16:29:48.000000000 +0000
@@ -551,6 +551,7 @@ int unmap_vmas(struct mmu_gather **tlbp,
 		if (end <= vma->vm_start)
 			continue;
 
+		/* We assume that only accountable VMAs are VM_ACCOUNT. */
 		if (vma->vm_flags & VM_ACCOUNT)
 			*nr_accounted += (end - start) >> PAGE_SHIFT;
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/mmap.c	2004-03-23 16:29:48.000000000 +0000
@@ -473,8 +473,11 @@ unsigned long do_mmap_pgoff(struct file 
 	int error;
 	struct rb_node ** rb_link, * rb_parent;
 	unsigned long charged = 0;
+	long acctdom = VM_AD_DEFAULT;
 
 	if (file) {
+		if (is_file_hugepages(file))
+			acctdom = VM_AD_HUGETLB;
 		if (!file->f_op || !file->f_op->mmap)
 			return -ENODEV;
 
@@ -591,7 +594,8 @@ munmap_back:
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
 
-	if (!(flags & MAP_NORESERVE) || sysctl_overcommit_memory > 1) {
+	if (acctdom == VM_AD_DEFAULT && (!(flags & MAP_NORESERVE) || 
+	    sysctl_overcommit_memory > 1)) {
 		if (vm_flags & VM_SHARED) {
 			/* Check memory availability in shmem_file_setup? */
 			vm_flags |= VM_ACCOUNT;
@@ -600,7 +604,7 @@ munmap_back:
 			 * Private writable mapping: check memory availability
 			 */
 			charged = len >> PAGE_SHIFT;
-			if (security_vm_enough_memory(charged))
+			if (security_vm_enough_memory(acctdom, charged))
 				return -ENOMEM;
 			vm_flags |= VM_ACCOUNT;
 		}
@@ -909,8 +913,8 @@ int expand_stack(struct vm_area_struct *
  	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (address - vma->vm_end) >> PAGE_SHIFT;
 
-	/* Overcommit.. */
-	if (security_vm_enough_memory(grow)) {
+	/* Overcommit ... assume stack is in normal memory */
+	if (security_vm_enough_memory(VM_AD_DEFAULT, grow)) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		return -ENOMEM;
 	}
@@ -963,8 +967,8 @@ int expand_stack(struct vm_area_struct *
  	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (vma->vm_start - address) >> PAGE_SHIFT;
 
-	/* Overcommit.. */
-	if (security_vm_enough_memory(grow)) {
+	/* Overcommit ... assume stack is in normal memory */
+	if (security_vm_enough_memory(VM_AD_DEFAULT, grow)) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		return -ENOMEM;
 	}
@@ -1361,7 +1365,7 @@ unsigned long do_brk(unsigned long addr,
 	if (mm->map_count > MAX_MAP_COUNT)
 		return -ENOMEM;
 
-	if (security_vm_enough_memory(len >> PAGE_SHIFT))
+	if (security_vm_enough_memory(VM_AD_DEFAULT, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
 	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mprotect.c current/mm/mprotect.c
--- reference/mm/mprotect.c	2004-01-09 06:59:26.000000000 +0000
+++ current/mm/mprotect.c	2004-03-23 16:29:48.000000000 +0000
@@ -173,9 +173,10 @@ mprotect_fixup(struct vm_area_struct *vm
 	 * a MAP_NORESERVE private mapping to writable will now reserve.
 	 */
 	if (newflags & VM_WRITE) {
-		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
+		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED)) &&
+				VM_ACCTDOM(vma) == VM_AD_DEFAULT) {
 			charged = (end - start) >> PAGE_SHIFT;
-			if (security_vm_enough_memory(charged))
+			if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
 				return -ENOMEM;
 			newflags |= VM_ACCOUNT;
 		}
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c	2004-03-23 15:29:42.000000000 +0000
+++ current/mm/mremap.c	2004-03-23 16:29:48.000000000 +0000
@@ -398,9 +398,10 @@ unsigned long do_mremap(unsigned long ad
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		goto out;
 
+	/* We assume that only accountable VMAs are VM_ACCOUNT. */
 	if (vma->vm_flags & VM_ACCOUNT) {
 		charged = (new_len - old_len) >> PAGE_SHIFT;
-		if (security_vm_enough_memory(charged))
+ 		if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
 			goto out;
 	}
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/shmem.c current/mm/shmem.c
--- reference/mm/shmem.c	2004-02-04 15:09:17.000000000 +0000
+++ current/mm/shmem.c	2004-03-23 15:29:40.000000000 +0000
@@ -526,7 +526,7 @@ static int shmem_notify_change(struct de
 	 	 */
 		change = VM_ACCT(attr->ia_size) - VM_ACCT(inode->i_size);
 		if (change > 0) {
-			if (security_vm_enough_memory(change))
+			if (security_vm_enough_memory(VM_AD_DEFAULT, change))
 				return -ENOMEM;
 		} else if (attr->ia_size < inode->i_size) {
 			vm_unacct_memory(-change);
@@ -1193,7 +1193,8 @@ shmem_file_write(struct file *file, cons
 	maxpos = inode->i_size;
 	if (maxpos < pos + count) {
 		maxpos = pos + count;
-		if (security_vm_enough_memory(VM_ACCT(maxpos) - VM_ACCT(inode->i_size))) {
+		if (security_vm_enough_memory(VM_AD_DEFAULT,
+				VM_ACCT(maxpos) - VM_ACCT(inode->i_size))) {
 			err = -ENOMEM;
 			goto out;
 		}
@@ -1554,7 +1555,7 @@ static int shmem_symlink(struct inode *d
 		memcpy(info, symname, len);
 		inode->i_op = &shmem_symlink_inline_operations;
 	} else {
-		if (security_vm_enough_memory(VM_ACCT(1))) {
+		if (security_vm_enough_memory(VM_AD_DEFAULT, VM_ACCT(1))) {
 			iput(inode);
 			return -ENOMEM;
 		}
@@ -1950,7 +1951,8 @@ struct file *shmem_file_setup(char *name
 	if (size > SHMEM_MAX_BYTES)
 		return ERR_PTR(-EINVAL);
 
-	if ((flags & VM_ACCOUNT) && security_vm_enough_memory(VM_ACCT(size)))
+	if ((flags & VM_ACCOUNT) && security_vm_enough_memory(VM_AD_DEFAULT,
+			VM_ACCT(size)))
 		return ERR_PTR(-ENOMEM);
 
 	error = -ENOMEM;
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/swapfile.c current/mm/swapfile.c
--- reference/mm/swapfile.c	2004-02-23 18:15:13.000000000 +0000
+++ current/mm/swapfile.c	2004-03-23 15:29:40.000000000 +0000
@@ -1048,7 +1048,7 @@ asmlinkage long sys_swapoff(const char _
 		swap_list_unlock();
 		goto out_dput;
 	}
-	if (!security_vm_enough_memory(p->pages))
+	if (!security_vm_enough_memory(VM_AD_DEFAULT, p->pages))
 		vm_unacct_memory(p->pages);
 	else {
 		err = -ENOMEM;
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-03-23 15:29:41.000000000 +0000
+++ current/security/commoncap.c	2004-03-23 15:29:40.000000000 +0000
@@ -308,10 +308,16 @@ int cap_syslog (int type)
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
  */
-int cap_vm_enough_memory(long pages)
+int cap_vm_enough_memory(int domain, long pages)
 {
 	unsigned long free, allowed;
 
+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);
 
         /*
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-23 15:29:41.000000000 +0000
+++ current/security/dummy.c	2004-03-23 15:29:40.000000000 +0000
@@ -109,10 +109,16 @@ static int dummy_syslog (int type)
  * We currently support three overcommit policies, which are set via the
  * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  */
-static int dummy_vm_enough_memory(long pages)
+static int dummy_vm_enough_memory(int domain, long pages)
 {
 	unsigned long free, allowed;
 
+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);
 
         /*
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-23 15:29:41.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-23 15:29:40.000000000 +0000
@@ -1497,12 +1497,18 @@ static int selinux_syslog(int type)
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
  */
-static int selinux_vm_enough_memory(long pages)
+static int selinux_vm_enough_memory(int domain, long pages)
 {
 	unsigned long free, allowed;
 	int rc;
 	struct task_security_struct *tsec = current->security;
 
+	/* We only account for the default memory domain, assume overcommit
+	 * for all others.
+	 */
+	if (domain != VM_AD_DEFAULT)
+		return 0;
+
 	vm_acct_memory(pages);
 
         /*

[-- Attachment #5: 055-mem_acctdom_arch.txt --]
[-- Type: text/plain, Size: 2699 bytes --]

---
 ia64/ia32/binfmt_elf32.c  |    3 ++-
 mips/kernel/sysirix.c     |    3 ++-
 s390/kernel/compat_exec.c |    3 ++-
 x86_64/ia32/ia32_binfmt.c |    3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff -upN reference/arch/ia64/ia32/binfmt_elf32.c current/arch/ia64/ia32/binfmt_elf32.c
--- reference/arch/ia64/ia32/binfmt_elf32.c	2004-03-11 20:47:12.000000000 +0000
+++ current/arch/ia64/ia32/binfmt_elf32.c	2004-03-23 15:29:42.000000000 +0000
@@ -168,7 +168,8 @@ ia32_setup_arg_pages (struct linux_binpr
 	if (!mpnt)
 		return -ENOMEM;
 
-	if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, (IA32_STACK_TOP -
+			(PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}
diff -upN reference/arch/mips/kernel/sysirix.c current/arch/mips/kernel/sysirix.c
--- reference/arch/mips/kernel/sysirix.c	2004-03-11 20:47:13.000000000 +0000
+++ current/arch/mips/kernel/sysirix.c	2004-03-23 15:29:42.000000000 +0000
@@ -578,7 +578,8 @@ asmlinkage int irix_brk(unsigned long br
 	/*
 	 * Check if we have enough memory..
 	 */
-	if (security_vm_enough_memory((newbrk-oldbrk) >> PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT,
+			(newbrk-oldbrk) >> PAGE_SHIFT)) {
 		ret = -ENOMEM;
 		goto out;
 	}
diff -upN reference/arch/s390/kernel/compat_exec.c current/arch/s390/kernel/compat_exec.c
--- reference/arch/s390/kernel/compat_exec.c	2004-01-09 06:59:57.000000000 +0000
+++ current/arch/s390/kernel/compat_exec.c	2004-03-23 15:29:42.000000000 +0000
@@ -56,7 +56,8 @@ int setup_arg_pages32(struct linux_binpr
 	if (!mpnt) 
 		return -ENOMEM; 

-	if (security_vm_enough_memory((STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, (STACK_TOP -
+			(PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}
diff -upN reference/arch/x86_64/ia32/ia32_binfmt.c current/arch/x86_64/ia32/ia32_binfmt.c
--- reference/arch/x86_64/ia32/ia32_binfmt.c	2004-03-11 20:47:15.000000000 +0000
+++ current/arch/x86_64/ia32/ia32_binfmt.c	2004-03-23 15:29:42.000000000 +0000
@@ -345,7 +345,8 @@ int setup_arg_pages(struct linux_binprm 
 	if (!mpnt) 
 		return -ENOMEM; 

-	if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+	if (security_vm_enough_memory(VM_AD_DEFAULT, (IA32_STACK_TOP -
+			(PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
 		kmem_cache_free(vm_area_cachep, mpnt);
 		return -ENOMEM;
 	}

[-- Attachment #6: 060-mem_acctdom_commitments.txt --]
[-- Type: text/plain, Size: 18940 bytes --]

Currently only normal page commitments are tracked.  This patch
provides a framework for tracking page commitments in multiple
independant domains.  With this patch vm_commited_space becomes a
per domain trackable.

---
 fs/proc/proc_misc.c      |    2 +-
 include/linux/mm.h       |   13 +++++++++++--
 include/linux/mman.h     |   12 ++++++------
 kernel/fork.c            |    8 +++-----
 mm/memory.c              |   12 +++++++++---
 mm/mmap.c                |   23 ++++++++++++-----------
 mm/mprotect.c            |    5 ++---
 mm/mremap.c              |    2 +-
 mm/nommu.c               |    3 ++-
 mm/shmem.c               |   13 +++++++------
 mm/swap.c                |   17 +++++++++++++----
 mm/swapfile.c            |    4 +++-
 security/commoncap.c     |   10 +++++-----
 security/dummy.c         |   10 +++++-----
 security/selinux/hooks.c |   10 +++++-----
 15 files changed, 85 insertions(+), 59 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/proc/proc_misc.c current/fs/proc/proc_misc.c
--- reference/fs/proc/proc_misc.c	2004-03-11 20:47:27.000000000 +0000
+++ current/fs/proc/proc_misc.c	2004-03-24 16:09:07.000000000 +0000
@@ -174,7 +174,7 @@ static int meminfo_read_proc(char *page,
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = atomic_read(&vm_committed_space);
+	committed = atomic_read(&vm_committed_space[VM_AD_DEFAULT]);
 
 	vmtot = (VMALLOC_END-VMALLOC_START)>>10;
 	vmi = get_vmalloc_info();
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/mman.h current/include/linux/mman.h
--- reference/include/linux/mman.h	2004-01-09 06:59:09.000000000 +0000
+++ current/include/linux/mman.h	2004-03-24 16:09:07.000000000 +0000
@@ -12,20 +12,20 @@
 
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
-extern atomic_t vm_committed_space;
+extern atomic_t vm_committed_space[];
 
 #ifdef CONFIG_SMP
-extern void vm_acct_memory(long pages);
+extern void vm_acct_memory(int domain, long pages);
 #else
-static inline void vm_acct_memory(long pages)
+static inline void vm_acct_memory(int domain, long pages)
 {
-	atomic_add(pages, &vm_committed_space);
+	atomic_add(pages, &vm_committed_space[domain]);
 }
 #endif
 
-static inline void vm_unacct_memory(long pages)
+static inline void vm_unacct_memory(int domain, long pages)
 {
-	vm_acct_memory(-pages);
+	vm_acct_memory(domain, -pages);
 }
 
 /*
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/mm.h current/include/linux/mm.h
--- reference/include/linux/mm.h	2004-03-23 16:30:13.000000000 +0000
+++ current/include/linux/mm.h	2004-03-24 16:09:07.000000000 +0000
@@ -117,7 +117,16 @@ struct vm_area_struct {
 #define VM_ACCTDOM(vma) (!!((vma)->vm_flags & VM_HUGETLB))
 #define VM_AD_DEFAULT	0
 #define VM_AD_HUGETLB	1
-
+typedef struct {
+	long vec[VM_ACCTDOM_NR];
+} madv_t;
+#define MADV_NONE { {[0 ... VM_ACCTDOM_NR-1] =  0UL} }
+static inline void madv_add(madv_t *madv, int domain, long size)
+{
+	madv->vec[domain] += size;
+}
+void vm_unacct_memory_domains(madv_t *madv);
+  
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
@@ -440,7 +449,7 @@ void zap_page_range(struct vm_area_struc
 			unsigned long size);
 int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
-		unsigned long end_addr, unsigned long *nr_accounted);
+		unsigned long end_addr, madv_t *nr_accounted);
 void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			unsigned long address, unsigned long size);
 void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
diff -X /home/apw/lib/vdiff.excl -rupN reference/kernel/fork.c current/kernel/fork.c
--- reference/kernel/fork.c	2004-03-23 16:30:13.000000000 +0000
+++ current/kernel/fork.c	2004-03-24 16:09:07.000000000 +0000
@@ -267,7 +267,7 @@ static inline int dup_mmap(struct mm_str
 	struct vm_area_struct * mpnt, *tmp, **pprev;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
-	unsigned long charge = 0;
+	madv_t charge = MADV_NONE;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -303,8 +303,7 @@ static inline int dup_mmap(struct mm_str
 			unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 			if (security_vm_enough_memory(VM_ACCTDOM(mpnt), len))
 				goto fail_nomem;
-			if (VM_ACCTDOM(mpnt) == VM_AD_DEFAULT)
-				charge += len;
+ 			madv_add(&charge, VM_ACCTDOM(mpnt), len);
 		}
 		tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 		if (!tmp)
@@ -359,8 +358,7 @@ out:
 fail_nomem:
 	retval = -ENOMEM;
 fail:
-	if (charge)
-		vm_unacct_memory(charge);
+	vm_unacct_memory_domains(&charge);
 	goto out;
 }
 static inline int mm_alloc_pgd(struct mm_struct * mm)
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c	2004-03-23 16:30:13.000000000 +0000
+++ current/mm/memory.c	2004-03-24 16:09:07.000000000 +0000
@@ -524,7 +524,7 @@ void unmap_page_range(struct mmu_gather 
  */
 int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		unsigned long end_addr, unsigned long *nr_accounted)
+		unsigned long end_addr, madv_t *nr_accounted)
 {
 	unsigned long zap_bytes = ZAP_BLOCK_SIZE;
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
@@ -553,7 +553,8 @@ int unmap_vmas(struct mmu_gather **tlbp,
 
 		/* We assume that only accountable VMAs are VM_ACCOUNT. */
 		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += (end - start) >> PAGE_SHIFT;
+			madv_add(nr_accounted,
+				VM_ACCTDOM(vma), (end - start) >> PAGE_SHIFT);
 
 		ret++;
 		while (start != end) {
@@ -602,7 +603,12 @@ void zap_page_range(struct vm_area_struc
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather *tlb;
 	unsigned long end = address + size;
-	unsigned long nr_accounted = 0;
+	madv_t nr_accounted = MADV_NONE;
+
+	/* XXX: we seem to avoid thinking about the memory accounting
+	 * for both the hugepages where don't bother even tracking it and
+	 * in the normal path where we figure it out and do nothing with it??
+	 */
 
 	might_sleep();
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c	2004-03-23 16:30:13.000000000 +0000
+++ current/mm/mmap.c	2004-03-24 16:09:07.000000000 +0000
@@ -54,7 +54,8 @@ pgprot_t protection_map[16] = {
 
 int sysctl_overcommit_memory = 0;	/* default is heuristic overcommit */
 int sysctl_overcommit_ratio = 50;	/* default is 50% */
-atomic_t vm_committed_space = ATOMIC_INIT(0);
+atomic_t vm_committed_space[VM_ACCTDOM_NR] = 
+     { [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };
 
 EXPORT_SYMBOL(sysctl_overcommit_memory);
 EXPORT_SYMBOL(sysctl_overcommit_ratio);
@@ -594,8 +595,8 @@ munmap_back:
 	    > current->rlim[RLIMIT_AS].rlim_cur)
 		return -ENOMEM;
 
-	if (acctdom == VM_AD_DEFAULT && (!(flags & MAP_NORESERVE) || 
-	    sysctl_overcommit_memory > 1)) {
+	if (!(flags & MAP_NORESERVE) || 
+	    (acctdom == VM_AD_DEFAULT && sysctl_overcommit_memory > 1)) {
 		if (vm_flags & VM_SHARED) {
 			/* Check memory availability in shmem_file_setup? */
 			vm_flags |= VM_ACCOUNT;
@@ -713,7 +714,7 @@ free_vma:
 	kmem_cache_free(vm_area_cachep, vma);
 unacct_error:
 	if (charged)
-		vm_unacct_memory(charged);
+		vm_unacct_memory(acctdom, charged);
 	return error;
 }
 
@@ -923,7 +924,7 @@ int expand_stack(struct vm_area_struct *
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		vm_unacct_memory(grow);
+		vm_unacct_memory(VM_AD_DEFAULT, grow);
 		return -ENOMEM;
 	}
 	vma->vm_end = address;
@@ -977,7 +978,7 @@ int expand_stack(struct vm_area_struct *
 			((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
 			current->rlim[RLIMIT_AS].rlim_cur) {
 		spin_unlock(&vma->vm_mm->page_table_lock);
-		vm_unacct_memory(grow);
+		vm_unacct_memory(VM_AD_DEFAULT, grow);
 		return -ENOMEM;
 	}
 	vma->vm_start = address;
@@ -1135,12 +1136,12 @@ static void unmap_region(struct mm_struc
 	unsigned long end)
 {
 	struct mmu_gather *tlb;
-	unsigned long nr_accounted = 0;
+	madv_t nr_accounted = MADV_NONE;
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted);
-	vm_unacct_memory(nr_accounted);
+	vm_unacct_memory_domains(&nr_accounted);
 
 	if (is_hugepage_only_range(start, end - start))
 		hugetlb_free_pgtables(tlb, prev, start, end);
@@ -1380,7 +1381,7 @@ unsigned long do_brk(unsigned long addr,
 	 */
 	vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 	if (!vma) {
-		vm_unacct_memory(len >> PAGE_SHIFT);
+		vm_unacct_memory(VM_AD_DEFAULT, len >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
@@ -1413,7 +1414,7 @@ void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
 	struct vm_area_struct *vma;
-	unsigned long nr_accounted = 0;
+	madv_t nr_accounted = MADV_NONE;
 
 	profile_exit_mmap(mm);
  
@@ -1426,7 +1427,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use ~0UL here to ensure all VMAs in the mm are unmapped */
 	mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
 					~0UL, &nr_accounted);
-	vm_unacct_memory(nr_accounted);
+	vm_unacct_memory_domains(&nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
 	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mprotect.c current/mm/mprotect.c
--- reference/mm/mprotect.c	2004-03-23 16:30:13.000000000 +0000
+++ current/mm/mprotect.c	2004-03-24 16:09:07.000000000 +0000
@@ -173,8 +173,7 @@ mprotect_fixup(struct vm_area_struct *vm
 	 * a MAP_NORESERVE private mapping to writable will now reserve.
 	 */
 	if (newflags & VM_WRITE) {
-		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED)) &&
-				VM_ACCTDOM(vma) == VM_AD_DEFAULT) {
+		if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
 			charged = (end - start) >> PAGE_SHIFT;
 			if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
 				return -ENOMEM;
@@ -218,7 +217,7 @@ success:
 	return 0;
 
 fail:
-	vm_unacct_memory(charged);
+	vm_unacct_memory(VM_ACCTDOM(vma), charged);
 	return error;
 }
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c	2004-03-23 16:30:13.000000000 +0000
+++ current/mm/mremap.c	2004-03-24 16:09:07.000000000 +0000
@@ -452,7 +452,7 @@ unsigned long do_mremap(unsigned long ad
 	}
 out_rc:
 	if (ret & ~PAGE_MASK)
-		vm_unacct_memory(charged);
+		vm_unacct_memory(VM_ACCTDOM(vma), charged);
 out:
 	return ret;
 }
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/nommu.c current/mm/nommu.c
--- reference/mm/nommu.c	2004-02-04 15:09:16.000000000 +0000
+++ current/mm/nommu.c	2004-03-24 16:09:07.000000000 +0000
@@ -29,7 +29,8 @@ struct page *mem_map;
 unsigned long max_mapnr;
 unsigned long num_physpages;
 unsigned long askedalloc, realalloc;
-atomic_t vm_committed_space = ATOMIC_INIT(0);
+atomic_t vm_committed_space[VM_ACCTDOM_NR] = 
+     { [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };
 int sysctl_overcommit_memory; /* default is heuristic overcommit */
 int sysctl_overcommit_ratio = 50; /* default is 50% */
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/shmem.c current/mm/shmem.c
--- reference/mm/shmem.c	2004-03-23 16:30:13.000000000 +0000
+++ current/mm/shmem.c	2004-03-24 16:09:07.000000000 +0000
@@ -529,7 +529,7 @@ static int shmem_notify_change(struct de
 			if (security_vm_enough_memory(VM_AD_DEFAULT, change))
 				return -ENOMEM;
 		} else if (attr->ia_size < inode->i_size) {
-			vm_unacct_memory(-change);
+			vm_unacct_memory(VM_AD_DEFAULT, -change);
 			/*
 			 * If truncating down to a partial page, then
 			 * if that page is already allocated, hold it
@@ -564,7 +564,7 @@ static int shmem_notify_change(struct de
 	if (page)
 		page_cache_release(page);
 	if (error)
-		vm_unacct_memory(change);
+		vm_unacct_memory(VM_AD_DEFAULT, change);
 	return error;
 }
 
@@ -578,7 +578,7 @@ static void shmem_delete_inode(struct in
 		list_del(&info->list);
 		spin_unlock(&shmem_ilock);
 		if (info->flags & VM_ACCOUNT)
-			vm_unacct_memory(VM_ACCT(inode->i_size));
+			vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(inode->i_size));
 		inode->i_size = 0;
 		shmem_truncate(inode);
 	}
@@ -1274,7 +1274,8 @@ shmem_file_write(struct file *file, cons
 
 	/* Short writes give back address space */
 	if (inode->i_size != maxpos)
-		vm_unacct_memory(VM_ACCT(maxpos) - VM_ACCT(inode->i_size));
+		vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(maxpos) -
+			VM_ACCT(inode->i_size));
 out:
 	up(&inode->i_sem);
 	return err;
@@ -1561,7 +1562,7 @@ static int shmem_symlink(struct inode *d
 		}
 		error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
 		if (error) {
-			vm_unacct_memory(VM_ACCT(1));
+			vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(1));
 			iput(inode);
 			return error;
 		}
@@ -1991,7 +1992,7 @@ put_dentry:
 	dput(dentry);
 put_memory:
 	if (flags & VM_ACCOUNT)
-		vm_unacct_memory(VM_ACCT(size));
+		vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(size));
 	return ERR_PTR(error);
 }
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/swap.c current/mm/swap.c
--- reference/mm/swap.c	2004-03-11 20:47:29.000000000 +0000
+++ current/mm/swap.c	2004-03-24 16:09:07.000000000 +0000
@@ -365,17 +365,18 @@ unsigned int pagevec_lookup(struct pagev
  */
 #define ACCT_THRESHOLD	max(16, NR_CPUS * 2)
 
-static DEFINE_PER_CPU(long, committed_space) = 0;
+/* XXX: zero this????? */
+static DEFINE_PER_CPU(long, committed_space[VM_ACCTDOM_NR]);
 
-void vm_acct_memory(long pages)
+void vm_acct_memory(int domain, long pages)
 {
 	long *local;
 
 	preempt_disable();
-	local = &__get_cpu_var(committed_space);
+	local = &__get_cpu_var(committed_space[domain]);
 	*local += pages;
 	if (*local > ACCT_THRESHOLD || *local < -ACCT_THRESHOLD) {
-		atomic_add(*local, &vm_committed_space);
+		atomic_add(*local, &vm_committed_space[domain]);
 		*local = 0;
 	}
 	preempt_enable();
@@ -383,6 +384,14 @@ void vm_acct_memory(long pages)
 EXPORT_SYMBOL(vm_acct_memory);
 #endif
 
+void vm_unacct_memory_domains(madv_t *adv)
+{
+	if (adv->vec[0])
+		vm_unacct_memory(VM_AD_DEFAULT, adv->vec[0]);
+	if (adv->vec[1])
+		vm_unacct_memory(VM_AD_DEFAULT, adv->vec[1]);
+}
+
 #ifdef CONFIG_SMP
 void percpu_counter_mod(struct percpu_counter *fbc, long amount)
 {
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/swapfile.c current/mm/swapfile.c
--- reference/mm/swapfile.c	2004-03-23 16:30:13.000000000 +0000
+++ current/mm/swapfile.c	2004-03-24 16:09:07.000000000 +0000
@@ -1048,8 +1048,10 @@ asmlinkage long sys_swapoff(const char _
 		swap_list_unlock();
 		goto out_dput;
 	}
+	/* There is an assumption here that we only may have swapped things
+	 * from the default memory accounting domain to this device. */
 	if (!security_vm_enough_memory(VM_AD_DEFAULT, p->pages))
-		vm_unacct_memory(p->pages);
+		vm_unacct_memory(VM_AD_DEFAULT, p->pages);
 	else {
 		err = -ENOMEM;
 		swap_list_unlock();
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-03-23 16:30:13.000000000 +0000
+++ current/security/commoncap.c	2004-03-24 16:09:07.000000000 +0000
@@ -312,14 +312,14 @@ int cap_vm_enough_memory(int domain, lon
 {
 	unsigned long free, allowed;
 
+	vm_acct_memory(domain, pages);
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
 	if (domain != VM_AD_DEFAULT)
 		return 0;
 
-	vm_acct_memory(pages);
-
         /*
 	 * Sometimes we want to use more memory than we have
 	 */
@@ -360,17 +360,17 @@ int cap_vm_enough_memory(int domain, lon
 
 		if (free > pages)
 			return 0;
-		vm_unacct_memory(pages);
+		vm_unacct_memory(domain, pages);
 		return -ENOMEM;
 	}
 
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
-	if (atomic_read(&vm_committed_space) < allowed)
+	if (atomic_read(&vm_committed_space[domain]) < allowed)
 		return 0;
 
-	vm_unacct_memory(pages);
+	vm_unacct_memory(domain, pages);
 
 	return -ENOMEM;
 }
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-23 16:30:13.000000000 +0000
+++ current/security/dummy.c	2004-03-24 17:56:16.000000000 +0000
@@ -113,14 +113,14 @@ static int dummy_vm_enough_memory(int do
 {
 	unsigned long free, allowed;
 
+	vm_acct_memory(domain, pages);
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
 	if (domain != VM_AD_DEFAULT)
 		return 0;
 
-	vm_acct_memory(pages);
-
         /*
 	 * Sometimes we want to use more memory than we have
 	 */
@@ -148,17 +148,17 @@ static int dummy_vm_enough_memory(int do
 
 		if (free > pages)
 			return 0;
-		vm_unacct_memory(pages);
+		vm_unacct_memory(domain, pages);
 		return -ENOMEM;
 	}
 
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
-	if (atomic_read(&vm_committed_space) < allowed)
+	if (atomic_read(&vm_committed_space[domain]) < allowed)
 		return 0;
 
-	vm_unacct_memory(pages);
+	vm_unacct_memory(domain, pages);
 
 	return -ENOMEM;
 }
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-23 16:30:13.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-24 17:56:28.000000000 +0000
@@ -1503,14 +1503,14 @@ static int selinux_vm_enough_memory(int 
 	int rc;
 	struct task_security_struct *tsec = current->security;
 
+	vm_acct_memory(domain, pages);
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
 	if (domain != VM_AD_DEFAULT)
 		return 0;
 
-	vm_acct_memory(pages);
-
         /*
 	 * Sometimes we want to use more memory than we have
 	 */
@@ -1547,17 +1547,17 @@ static int selinux_vm_enough_memory(int 
 
 		if (free > pages)
 			return 0;
-		vm_unacct_memory(pages);
+		vm_unacct_memory(domain, pages);
 		return -ENOMEM;
 	}
 
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
-	if (atomic_read(&vm_committed_space) < allowed)
+	if (atomic_read(&vm_committed_space[domain]) < allowed)
 		return 0;
 
-	vm_unacct_memory(pages);
+	vm_unacct_memory(domain, pages);
 
 	return -ENOMEM;
 }

[-- Attachment #7: 070-mem_acctdom_hugetlb.txt --]
[-- Type: text/plain, Size: 7437 bytes --]

---
 fs/hugetlbfs/inode.c     |   44 ++++++++++++++++++++++++++++++++++++++------
 include/linux/hugetlb.h  |    5 +++++
 security/commoncap.c     |    8 ++++++++
 security/dummy.c         |    7 +++++++
 security/selinux/hooks.c |    7 +++++++
 5 files changed, 65 insertions(+), 6 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c	2004-02-23 18:15:01.000000000 +0000
+++ current/fs/hugetlbfs/inode.c	2004-03-24 17:57:36.000000000 +0000
@@ -26,12 +26,15 @@
 #include <linux/dnotify.h>
 #include <linux/statfs.h>
 #include <linux/security.h>
+#include <linux/mman.h>
 
 #include <asm/uaccess.h>
 
 /* some random number */
 #define HUGETLBFS_MAGIC	0x958458f6
 
+#define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
+
 static struct super_operations hugetlbfs_ops;
 static struct address_space_operations hugetlbfs_aops;
 struct file_operations hugetlbfs_file_operations;
@@ -191,6 +194,7 @@ void truncate_hugepages(struct address_s
 static void hugetlbfs_delete_inode(struct inode *inode)
 {
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
+	long change;
 
 	hlist_del_init(&inode->i_hash);
 	list_del_init(&inode->i_list);
@@ -198,6 +202,9 @@ static void hugetlbfs_delete_inode(struc
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
 
+	change = VM_ACCT(inode->i_size) - VM_ACCT(0);
+	if (change)
+		vm_unacct_memory(VM_AD_HUGETLB, change);
 	if (inode->i_data.nrpages)
 		truncate_hugepages(&inode->i_data, 0);
 
@@ -217,6 +224,7 @@ static void hugetlbfs_forget_inode(struc
 {
 	struct super_block *super_block = inode->i_sb;
 	struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(super_block);
+	long change;
 
 	if (hlist_unhashed(&inode->i_hash))
 		goto out_truncate;
@@ -239,6 +247,9 @@ out_truncate:
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	change = VM_ACCT(inode->i_size) - VM_ACCT(0);
+	if (change)
+		vm_unacct_memory(VM_AD_HUGETLB, change);
 	if (inode->i_data.nrpages)
 		truncate_hugepages(&inode->i_data, 0);
 
@@ -312,8 +323,10 @@ static int hugetlb_vmtruncate(struct ino
 	unsigned long pgoff;
 	struct address_space *mapping = inode->i_mapping;
 
+	/*
 	if (offset > inode->i_size)
 		return -EINVAL;
+	*/
 
 	BUG_ON(offset & ~HPAGE_MASK);
 	pgoff = offset >> HPAGE_SHIFT;
@@ -334,6 +347,8 @@ static int hugetlbfs_setattr(struct dent
 	struct inode *inode = dentry->d_inode;
 	int error;
 	unsigned int ia_valid = attr->ia_valid;
+	long change = 0;
+	loff_t csize;
 
 	BUG_ON(!inode);
 
@@ -345,15 +360,27 @@ static int hugetlbfs_setattr(struct dent
 	if (error)
 		goto out;
 	if (ia_valid & ATTR_SIZE) {
+		csize = i_size_read(inode);
 		error = -EINVAL;
-		if (!(attr->ia_size & ~HPAGE_MASK))
-			error = hugetlb_vmtruncate(inode, attr->ia_size);
-		if (error)
+		if (!(attr->ia_size & ~HPAGE_MASK)) 
+			goto out;
+		if (attr->ia_size > csize)
 			goto out;
+		change = VM_ACCT(csize) - VM_ACCT(attr->ia_size);
+		if (change)
+			vm_unacct_memory(VM_AD_HUGETLB, change);
+		/* XXX: here we commit to removing the mappings, should we do
+		 * this before we attmempt to write the inode or after.  What
+		 * should we do if it fails?
+		 */
+		hugetlb_vmtruncate(inode, attr->ia_size);
 		attr->ia_valid &= ~ATTR_SIZE;
 	}
 	error = inode_setattr(inode, attr);
 out:
+	if (error && change)
+		vm_acct_memory(VM_AD_HUGETLB, change);
+
 	return error;
 }
 
@@ -697,8 +724,9 @@ struct file *hugetlb_zero_setup(size_t s
 	if (!capable(CAP_IPC_LOCK))
 		return ERR_PTR(-EPERM);
 
-	if (!is_hugepage_mem_enough(size))
+	if (security_vm_enough_memory(VM_AD_HUGETLB, VM_ACCT(size)))
 		return ERR_PTR(-ENOMEM);
+
 	n = atomic_read(&hugetlbfs_counter);
 	atomic_inc(&hugetlbfs_counter);
 
@@ -708,8 +736,10 @@ struct file *hugetlb_zero_setup(size_t s
 	quick_string.len = strlen(quick_string.name);
 	quick_string.hash = 0;
 	dentry = d_alloc(root, &quick_string);
-	if (!dentry)
-		return ERR_PTR(-ENOMEM);
+	if (!dentry) {
+		error = -ENOMEM;
+		goto out_committed;
+	}
 
 	error = -ENFILE;
 	file = get_empty_filp();
@@ -736,6 +766,8 @@ out_file:
 	put_filp(file);
 out_dentry:
 	dput(dentry);
+out_committed:
+	vm_unacct_memory(VM_AD_HUGETLB, VM_ACCT(size));
 	return ERR_PTR(error);
 }
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h	2004-02-23 18:15:09.000000000 +0000
+++ current/include/linux/hugetlb.h	2004-03-24 18:01:27.000000000 +0000
@@ -19,6 +19,7 @@ int hugetlb_prefault(struct address_spac
 void huge_page_release(struct page *);
 int hugetlb_report_meminfo(char *);
 int is_hugepage_mem_enough(size_t);
+unsigned long hugetbl_total_pages(void);
 struct page *follow_huge_addr(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write);
 struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
@@ -48,6 +49,10 @@ static inline int is_vm_hugetlb_page(str
 {
 	return 0;
 }
+static inline unsigned long hugetlb_total_pages(void)
+{
+	return 0;
+}
 
 #define follow_hugetlb_page(m,v,p,vs,a,b,i)	({ BUG(); 0; })
 #define follow_huge_addr(mm, vma, addr, write)	0
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-03-24 17:56:50.000000000 +0000
+++ current/security/commoncap.c	2004-03-24 17:57:36.000000000 +0000
@@ -314,6 +314,13 @@ int cap_vm_enough_memory(int domain, lon
 
 	vm_acct_memory(domain, pages);
 
+	/* Check against the full compliment of hugepages, no reserve. */
+	if (domain == VM_AD_HUGETLB) {
+		allowed = hugetlb_total_pages();
+
+		goto check;
+	}
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
@@ -367,6 +374,7 @@ int cap_vm_enough_memory(int domain, lon
 	allowed = totalram_pages * sysctl_overcommit_ratio / 100;
 	allowed += total_swap_pages;
 
+check:
 	if (atomic_read(&vm_committed_space[domain]) < allowed)
 		return 0;
 
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-24 17:56:50.000000000 +0000
+++ current/security/dummy.c	2004-03-24 17:57:36.000000000 +0000
@@ -115,6 +115,13 @@ static int dummy_vm_enough_memory(int do
 
 	vm_acct_memory(domain, pages);
 
+	/* Check against the full compliment of hugepages, no reserve. */
+	if (domain == VM_AD_HUGETLB) {
+		allowed = hugetlb_total_pages();
+
+		goto check;
+	}
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-24 17:56:50.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-24 17:57:36.000000000 +0000
@@ -1505,6 +1505,13 @@ static int selinux_vm_enough_memory(int 
 
 	vm_acct_memory(domain, pages);
 
+	/* Check against the full compliment of hugepages, no reserve. */
+	if (domain == VM_AD_HUGETLB) {
+		allowed = hugetlb_total_pages();
+
+		goto check;
+	}
+
 	/* We only account for the default memory domain, assume overcommit
 	 * for all others.
 	 */

[-- Attachment #8: 010-overcommit_docs.txt --]
[-- Type: text/plain, Size: 2200 bytes --]

---
 commoncap.c     |    2 +-
 dummy.c         |    8 ++++++++
 selinux/hooks.c |    2 +-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff -upN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c	2004-02-23 18:15:19.000000000 +0000
+++ current/security/commoncap.c	2004-03-23 15:29:41.000000000 +0000
@@ -303,7 +303,7 @@ int cap_syslog (int type)
  * succeed and -ENOMEM implies there is not.
  *
  * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-acounting
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  *
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.
diff -upN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c	2004-03-11 20:47:31.000000000 +0000
+++ current/security/dummy.c	2004-03-23 15:29:41.000000000 +0000
@@ -101,6 +101,14 @@ static int dummy_syslog (int type)
 	return 0;
 }
 
+/*
+ * Check that a process has enough memory to allocate a new virtual
+ * mapping. 0 means there is enough memory for the allocation to
+ * succeed and -ENOMEM implies there is not.
+ *
+ * We currently support three overcommit policies, which are set via the
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
+ */
 static int dummy_vm_enough_memory(long pages)
 {
 	unsigned long free, allowed;
diff -upN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c	2004-03-11 20:47:31.000000000 +0000
+++ current/security/selinux/hooks.c	2004-03-23 15:29:41.000000000 +0000
@@ -1492,7 +1492,7 @@ static int selinux_syslog(int type)
  * succeed and -ENOMEM implies there is not.
  *
  * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-acounting
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
  *
  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
  * Additional code 2002 Jul 20 by Robert Love.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Hugetlbpages in very large memory machines.......
  2004-03-16  3:15       ` Nobuhiko Yoshida
@ 2004-04-01  9:10         ` Nobuhiko Yoshida
  0 siblings, 0 replies; 32+ messages in thread
From: Nobuhiko Yoshida @ 2004-04-01  9:10 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel
  Cc: raybry, lse-tech, linux-ia64, Hirokazu Takahashi, lhms-devel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 8783 bytes --]

Nobuhiko Yoshida <n-yoshida@pst.fujitsu.com> wroteF
> Hello,
> 
> > > +/*      update_mmu_cache(vma, address, *pte); */
> > 
> > I have not studied low level IA64 VM in detail, but don't you need
> > some kind of TLB flush here?
> 
> Oh! Yes.
> Perhaps, TLB flush is needed here.

- Below is the patch that revised what I contributed before.
- I added the flush of TLB and icache. 

How To Use:
   1. Download linux-2.6.0 source tree
   2. Apply the below patch for linux-2.6.0

Thank you,
Nobuhiko Yoshida

diff -dupr linux-2.6.0/arch/i386/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/i386/mm/hugetlbpage.c
--- linux-2.6.0/arch/i386/mm/hugetlbpage.c  2003-12-18 11:59:38.000000000 +0900
+++ linux-2.6.0.HugeTLB/arch/i386/mm/hugetlbpage.c  2004-04-01 11:48:56.000000000 +0900
@@ -142,8 +142,10 @@ int copy_hugetlb_page_range(struct mm_st
            goto nomem;
        src_pte = huge_pte_offset(src, addr);
        entry = *src_pte;
-       ptepage = pte_page(entry);
-       get_page(ptepage);
+       if (!pte_none(entry)) {
+           ptepage = pte_page(entry);
+           get_page(ptepage);
+       }
        set_pte(dst_pte, entry);
        dst->rss += (HPAGE_SIZE / PAGE_SIZE);
        addr += HPAGE_SIZE;
@@ -173,6 +175,11 @@ follow_hugetlb_page(struct mm_struct *mm
 
            pte = huge_pte_offset(mm, vaddr);
 
+           if (!pte || pte_none(*pte)) {
+               hugetlb_fault(mm, vma, 0, vaddr);
+               pte = huge_pte_offset(mm, vaddr);
+           }
+
            /* hugetlb should be locked, and hence, prefaulted */
            WARN_ON(!pte || pte_none(*pte));
 
@@ -261,12 +268,17 @@ int pmd_huge(pmd_t pmd)
 }
 
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-       pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+       unsigned long address, pmd_t *pmd, int write)
 {
    struct page *page;
 
    page = pte_page(*(pte_t *)pmd);
+
+   if (!page) {
+       hugetlb_fault(mm, vma, write, address);
+       page = pte_page(*(pte_t *)pmd);
+   }
    if (page) {
        page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
        get_page(page);
@@ -527,6 +539,48 @@ int is_hugepage_mem_enough(size_t size)
    return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
 }
 
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+   struct file *file = vma->vm_file;
+   struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+   struct page *page;
+   unsigned long idx;
+   pte_t *pte;
+   int ret = VM_FAULT_MINOR;
+
+   BUG_ON(vma->vm_start & ~HPAGE_MASK);
+   BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+   spin_lock(&mm->page_table_lock);
+
+   idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+       + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+   page = find_get_page(mapping, idx);
+
+   if (!page) {
+       page = alloc_hugetlb_page();
+       if (!page) {
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+       ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+       unlock_page(page);
+       if (ret) {
+           free_huge_page(page);
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+   }
+   pte = huge_pte_alloc(mm, address);
+   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+/*     update_mmu_cache(vma, address, *pte); */
+out:
+   spin_unlock(&mm->page_table_lock);
+   return ret;
+}
+
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
diff -dupr linux-2.6.0/arch/ia64/mm/hugetlbpage.c linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.0/arch/ia64/mm/hugetlbpage.c  2003-12-18 11:58:56.000000000 +0900
+++ linux-2.6.0.HugeTLB/arch/ia64/mm/hugetlbpage.c  2004-03-22 11:29:01.000000000 +0900
@@ -170,8 +170,10 @@ int copy_hugetlb_page_range(struct mm_st
            goto nomem;
        src_pte = huge_pte_offset(src, addr);
        entry = *src_pte;
-       ptepage = pte_page(entry);
-       get_page(ptepage);
+       if (!pte_none(entry)) {
+           ptepage = pte_page(entry);
+           get_page(ptepage);
+       }   
        set_pte(dst_pte, entry);
        dst->rss += (HPAGE_SIZE / PAGE_SIZE);
        addr += HPAGE_SIZE;
@@ -195,6 +197,12 @@ follow_hugetlb_page(struct mm_struct *mm
    do {
        pstart = start & HPAGE_MASK;
        ptep = huge_pte_offset(mm, start);
+
+       if (!ptep || pte_none(*ptep)) {
+           hugetlb_fault(mm, vma, 0, start);
+           ptep = huge_pte_offset(mm, start);
+       }
+
        pte = *ptep;
 
 back1:
@@ -236,6 +244,12 @@ struct page *follow_huge_addr(struct mm_
    pte_t *ptep;
 
    ptep = huge_pte_offset(mm, addr);
+
+   if (!ptep || pte_none(*ptep)) {
+       hugetlb_fault(mm, vma, 0, addr);
+       ptep = huge_pte_offset(mm, addr);
+   }
+
    page = pte_page(*ptep);
    page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
    get_page(page);
@@ -246,7 +260,8 @@ int pmd_huge(pmd_t pmd)
    return 0;
 }
 struct page *
-follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
+follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+       unsigned long address, pmd_t *pmd, int write)
 {
    return NULL;
 }
@@ -518,6 +533,49 @@ int is_hugepage_mem_enough(size_t size)
    return 1;
 }
 
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, int write_access, unsigned long address)
+{
+   struct file *file = vma->vm_file;
+   struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+   struct page *page;
+   unsigned long idx;
+   pte_t *pte;
+   int ret = VM_FAULT_MINOR;
+
+   BUG_ON(vma->vm_start & ~HPAGE_MASK);
+   BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+   spin_lock(&mm->page_table_lock);
+
+   idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+       + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+   page = find_get_page(mapping, idx);
+
+   if (!page) {
+       page = alloc_hugetlb_page();
+       if (!page) {
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+       ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+       unlock_page(page);
+       if (ret) {
+           free_huge_page(page);
+           ret = VM_FAULT_SIGBUS;
+           goto out;
+       }
+   }
+   pte = huge_pte_alloc(mm, address);
+   set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+   flush_tlb_range(vma, address, address + HPAGE_SIZE);
+   update_mmu_cache(vma, address, *pte);
+out:
+   spin_unlock(&mm->page_table_lock);
+   return ret;
+}
+
+
 static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int unused)
 {
    BUG();
diff -dupr linux-2.6.0/include/linux/hugetlb.h linux-2.6.0.HugeTLB/include/linux/hugetlb.h
--- linux-2.6.0/include/linux/hugetlb.h 2003-12-18 11:58:49.000000000 +0900
+++ linux-2.6.0.HugeTLB/include/linux/hugetlb.h 2003-12-19 09:47:25.000000000 +0900
@@ -23,10 +23,12 @@ struct page *follow_huge_addr(struct mm_
            unsigned long address, int write);
 struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
                    unsigned long address);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
-               pmd_t *pmd, int write);
+struct page *follow_huge_pmd(struct mm_struct *mm, struct vm_area_struct *,
+               unsigned long address, pmd_t *pmd, int write);
 int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
 int pmd_huge(pmd_t pmd);
+extern int hugetlb_fault(struct mm_struct *, struct vm_area_struct *,
+               int, unsigned long);
 
 extern int htlbpage_max;
 
@@ -63,6 +65,7 @@ static inline int is_vm_hugetlb_page(str
 #define is_aligned_hugepage_range(addr, len)   0
 #define pmd_huge(x)    0
 #define is_hugepage_only_range(addr, len)  0
+#define hugetlb_fault(mm, vma, write, addr)    0
 
 #ifndef HPAGE_MASK
 #define HPAGE_MASK 0       /* Keep the compiler happy */
diff -dupr linux-2.6.0/mm/memory.c linux-2.6.0.HugeTLB/mm/memory.c
--- linux-2.6.0/mm/memory.c 2003-12-18 11:58:48.000000000 +0900
+++ linux-2.6.0.HugeTLB/mm/memory.c 2003-12-19 09:47:46.000000000 +0900
@@ -640,7 +640,7 @@ follow_page(struct mm_struct *mm, unsign
    if (pmd_none(*pmd))
        goto out;
    if (pmd_huge(*pmd))
-       return follow_huge_pmd(mm, address, pmd, write);
+       return follow_huge_pmd(mm, vma, address, pmd, write);
    if (pmd_bad(*pmd))
        goto out;
 
@@ -1603,7 +1603,7 @@ int handle_mm_fault(struct mm_struct *mm
    inc_page_state(pgfault);
 
    if (is_vm_hugetlb_page(vma))
-       return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+       return hugetlb_fault(mm, vma, write_access, address);
 
    /*
     * We need the page table lock to synchronize with kswapd

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [Lse-tech] Re: Hugetlbpages in very large memory machines.......
@ 2004-03-15 23:31 Seth, Rohit
  0 siblings, 0 replies; 32+ messages in thread
From: Seth, Rohit @ 2004-03-15 23:31 UTC (permalink / raw)
  To: Ray Bryant, Andrew Morton; +Cc: ak, lse-tech, linux-ia64, linux-kernel



>-----Original Message-----
>From: Ray Bryant
>Andrew Morton wrote:
><unrelated text snipped>
>>
>> As for holding mmap_sem for too long, well, that can presumably be
worked
>> around by not mmapping the whole lot in one hit?
>>
>
>There are a number of places that one could do this (explicitly in user
code,
>hidden in library level, or in do_mmap2() where the mm->map_sem is
taken).
>I'm not happy with requiring the user to make a modification to solve
this
>kernel problem.  Hiding the split has the problem of making sure that
if any
>of the sub mmap() operations fail then the rest of the mmap()
operations have
>to be undone, and this all has to happen in a way that makes the mmap()
look
>like a single system call.
>
>An alternative would be put some info in the mm_struct indicating that
a
>hugetlb_prefault() is in progress, then drop the mm->mmap_sem while
>hugetlb_prefault() is running.  Once it is done, regrab the
mm->mmap_sem,
>clear the "in progress flag" and finish up processing.  Any other
mmap()
>that got the mmap_sem and found the "in progress flag" set would have
to
>fail, perhaps with -EAGAIN (again, an mmap() extension).  One can also
>implement more elaborate schemes where there is a list of pending
hugetlb
>mmaps() with the associated address space ranges being listed; one
could
>check this list in get_unmapped_area() and return -EAGAIN if there is
>a conflict.
>

I think both of above options are bit of stretch.

>I'd still rather see us do the "allocate on fault" approach with
prereservation
>to maintain the current ENOMEM return code from mmap() for hugepages.
Let me
>work on that and get back to y'all with a patch and see where we can go
from
>there.  

I think this allocation on fault behavior will become essential when
Andi's mbind becomes part of the base kernel. And this scheme has an
added advantage of following normal semantics of page allocation (if a
user wants preallocation then MAP_LOCKED can be used).  As Andrew said
earlier in the thread that this though runs the risk of different
behavior with applications that currently assume pre-faulting behavior
in terms of performance (even if you decrement count upfront but do lazy
allocation).  As they will get penalized at fault time.  But this is the
kind of optimization that apps can do when porting to 2.6 based
distributions....

> I'll start by taking a look at all of the arch dependent
hugetlbpage.c's and
>see how common they all are and move the common code up to
mm/hugetlbpage.c.
>(or did WLI's note imply that this is impossible?)
>

You should be able to move prefault code to common tree.

>However, is this set of changes something that would still be accepted
in 2.6,
>or is this now a 2.7 discussion?
>
>--
>Best Regards,
>Ray
>-----------------------------------------------
>                   Ray Bryant
>512-453-9679 (work)         512-507-7807 (cell)
>raybry@sgi.com             raybry@austin.rr.com
>The box said: "Requires Windows 98 or better",
>            so I installed Linux.
>-----------------------------------------------
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-ia64"
in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2004-04-01  9:12 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-03-13  3:44 Hugetlbpages in very large memory machines Ray Bryant
2004-03-13  3:48 ` Andi Kleen
2004-03-13  5:49   ` William Lee Irwin III
2004-03-13 16:10     ` [Lse-tech] " Andi Kleen
2004-03-14  0:05       ` William Lee Irwin III
2004-03-14  5:22         ` Peter Chubb
     [not found]     ` <844231526.20040313030948@adinet.com.uy>
     [not found]       ` <20040313061232.GB655@holomorphy.com>
2004-03-13 16:32         ` Re[2]: " Luis Mirabal
2004-03-14  2:45   ` Andrew Morton
2004-03-14  4:06     ` [Lse-tech] " Anton Blanchard
2004-03-17 19:05       ` Andy Whitcroft
2004-03-18 20:25         ` Andrew Morton
2004-03-18 21:22           ` Stephen Smalley
2004-03-18 22:21             ` Andy Whitcroft
2004-03-23 17:30         ` Andy Whitcroft
2004-03-24 17:38           ` Andy Whitcroft
2004-03-14  8:38     ` Ray Bryant
2004-03-14  8:48       ` William Lee Irwin III
2004-03-14  8:57       ` Andrew Morton
2004-03-14  9:02         ` Andrew Morton
2004-03-14  9:07         ` William Lee Irwin III
2004-03-15  6:45         ` Ray Bryant
2004-03-15 23:54           ` William Lee Irwin III
2004-03-13  3:55 ` William Lee Irwin III
2004-03-13  4:56 ` Hirokazu Takahashi
2004-03-16  0:30   ` Nobuhiko Yoshida
2004-03-16  1:54     ` Andi Kleen
2004-03-16  2:32       ` Hirokazu Takahashi
2004-03-16  3:20         ` Hirokazu Takahashi
2004-03-16  3:15       ` Nobuhiko Yoshida
2004-04-01  9:10         ` Nobuhiko Yoshida
2004-03-15 15:28 ` jlnance
2004-03-15 23:31 [Lse-tech] " Seth, Rohit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).