From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 038DE6B0011 for ; Thu, 19 May 2011 09:37:17 -0400 (EDT) Received: by bwz17 with SMTP id 17so3557069bwz.14 for ; Thu, 19 May 2011 06:37:13 -0700 (PDT) MIME-Version: 1.0 Reply-To: aquini@linux.com In-Reply-To: <20110519045630.GA22533@sgi.com> References: <20110518153445.GA18127@sgi.com> <20110519045630.GA22533@sgi.com> Date: Thu, 19 May 2011 10:37:13 -0300 Message-ID: Subject: Re: [PATCH] [BUGFIX] mm: hugepages can cause negative commitlimit From: Rafael Aquini Content-Type: multipart/alternative; boundary=00032555645e4ee04804a3a11a74 Sender: owner-linux-mm@kvack.org List-ID: To: Russ Anderson Cc: Andrea Arcangeli , linux-mm , linux-kernel , Christoph Lameter , Andrew Morton , rja@americas.sgi.com --00032555645e4ee04804a3a11a74 Content-Type: text/plain; charset=ISO-8859-1 Howdy Russ, On Thu, May 19, 2011 at 1:56 AM, Russ Anderson wrote: > On Wed, May 18, 2011 at 09:51:03PM -0300, Rafael Aquini wrote: > > Howdy, > > > > On Wed, May 18, 2011 at 12:34 PM, Russ Anderson wrote: > > > > > If the total size of hugepages allocated on a system is > > > over half of the total memory size, commitlimit becomes > > > a negative number. > > > > > > What happens in fs/proc/meminfo.c is this calculation: > > > > > > allowed = ((totalram_pages - hugetlb_total_pages()) > > > * sysctl_overcommit_ratio / 100) + total_swap_pages; > > > > > > The problem is that hugetlb_total_pages() is larger than > > > totalram_pages resulting in a negative number. Since > > > allowed is an unsigned long the negative shows up as a > > > big number. > > > > > > A similar calculation occurs in __vm_enough_memory() in mm/mmap.c. > > > > > > A symptom of this problem is that /proc/meminfo prints a > > > very large CommitLimit number. > > > > > > CommitLimit: 737869762947802600 kB > > > > > > To reproduce the problem reserve over half of memory as hugepages. > > > For example "default_hugepagesz=1G hugepagesz=1G hugepages=64 > > > Then look at /proc/meminfo "CommitLimit:" to see if it is too big. > > > > > > The fix is to not subtract hugetlb_total_pages(). When hugepages > > > are allocated totalram_pages is decremented so there is no need to > > > subtract out hugetlb_total_pages() a second time. > > > > > > Reported-by: Russ Anderson > > > Signed-off-by: Russ Anderson > > > > > > --- > > > > > > Example of "CommitLimit:" being too big. > > > > > > uv1-sys:~ # cat /proc/meminfo > > > MemTotal: 32395508 kB > > > MemFree: 32029276 kB > > > Buffers: 8656 kB > > > Cached: 89548 kB > > > SwapCached: 0 kB > > > Active: 55336 kB > > > Inactive: 73916 kB > > > Active(anon): 31220 kB > > > Inactive(anon): 36 kB > > > Active(file): 24116 kB > > > Inactive(file): 73880 kB > > > Unevictable: 0 kB > > > Mlocked: 0 kB > > > SwapTotal: 0 kB > > > SwapFree: 0 kB > > > Dirty: 1692 kB > > > Writeback: 0 kB > > > AnonPages: 31132 kB > > > Mapped: 15668 kB > > > Shmem: 152 kB > > > Slab: 70256 kB > > > SReclaimable: 17148 kB > > > SUnreclaim: 53108 kB > > > KernelStack: 6536 kB > > > PageTables: 3704 kB > > > NFS_Unstable: 0 kB > > > Bounce: 0 kB > > > WritebackTmp: 0 kB > > > CommitLimit: 737869762947802600 kB > > > Committed_AS: 394044 kB > > > VmallocTotal: 34359738367 kB > > > VmallocUsed: 713960 kB > > > VmallocChunk: 34325764204 kB > > > HardwareCorrupted: 0 kB > > > HugePages_Total: 32 > > > HugePages_Free: 32 > > > HugePages_Rsvd: 0 > > > HugePages_Surp: 0 > > > Hugepagesize: 1048576 kB > > > DirectMap4k: 16384 kB > > > DirectMap2M: 2064384 kB > > > DirectMap1G: 65011712 kB > > > > > > fs/proc/meminfo.c | 2 +- > > > mm/mmap.c | 3 +-- > > > 2 files changed, 2 insertions(+), 3 deletions(-) > > > > > > Index: linux/fs/proc/meminfo.c > > > =================================================================== > > > --- linux.orig/fs/proc/meminfo.c 2011-05-17 16:03:50.935658801 > -0500 > > > +++ linux/fs/proc/meminfo.c 2011-05-18 08:53:00.568784147 -0500 > > > @@ -36,7 +36,7 @@ static int meminfo_proc_show(struct seq_ > > > si_meminfo(&i); > > > si_swapinfo(&i); > > > committed = percpu_counter_read_positive(&vm_committed_as); > > > - allowed = ((totalram_pages - hugetlb_total_pages()) > > > + allowed = (totalram_pages > > > * sysctl_overcommit_ratio / 100) + total_swap_pages; > > > > > > cached = global_page_state(NR_FILE_PAGES) - > > > Index: linux/mm/mmap.c > > > =================================================================== > > > --- linux.orig/mm/mmap.c 2011-05-17 16:03:51.727658828 -0500 > > > +++ linux/mm/mmap.c 2011-05-18 08:54:34.912222405 -0500 > > > @@ -167,8 +167,7 @@ int __vm_enough_memory(struct mm_struct > > > goto error; > > > } > > > > > > - allowed = (totalram_pages - hugetlb_total_pages()) > > > - * sysctl_overcommit_ratio / 100; > > > + allowed = totalram_pages * sysctl_overcommit_ratio / 100; > > > /* > > > * Leave the last 3% for root > > > */ > > > -- > > > Russ Anderson, OS RAS/Partitioning Project Lead > > > SGI - Silicon Graphics Inc rja@sgi.com > > > > > > I'm afraid this will introduce a bug on how accurate kernel will account > > memory for overcommitment limits. > > > > totalram_pages is not decremented as hugepages are allocated. Since > > Are you running on x86? It decrements totalram_pages on a x86_64 > test system. Perhaps different architectures allocate hugepages > differently. > > The way it was verified was putting a printk in to print totalram_pages > and hugetlb_total_pages. First the system was booted without any huge > pages. The next boot one huge page was allocated. The next boot more > hugepages allocated. Each time totalram_pages was reduced by the nuber > of huge pages allocated, with totalram_pages + hugetlb_total_pages > equaling the original number of pages. > > That behavior is also consistent with allocating over half of memory > resulting in CommitLimit going negative (as is shown in the above > output). > > Here is some data. Each represents a boot using 1G hugepages. > 0 hugepages : totalram_pages 16519867 hugetlb_total_pages 0 > 1 hugepages : totalram_pages 16257723 hugetlb_total_pages 262144 > 2 hugepages : totalram_pages 15995578 hugetlb_total_pages 524288 > 31 hugepages : totalram_pages 8393403 hugetlb_total_pages 8126464 > 32 hugepages : totalram_pages 8131258 hugetlb_total_pages 8388608 > > > > hugepages are reserved, hugetlb_total_pages() has to be accounted and > > subtracted from totalram_pages in order to render an accurate number of > > remaining pages available to the general memory workload commitment. > > > > I've tried to reproduce your findings on my boxes, without > > success, unfortunately. > > Put a printk in meminfo_proc_show() to print totalram_pages and > hugetlb_total_pages(). Add "default_hugepagesz=1G hugepagesz=1G > hugepages=64" > to the boot line (varying the number of hugepages). > > > I'll keep chasing to hit this behaviour, though. > > > > Cheers! > > --aquini > > -- > Russ Anderson, OS RAS/Partitioning Project Lead > SGI - Silicon Graphics Inc rja@sgi.com > I got what I was doing different, and you are partially right. Checking mm/hugetlb.c: 1811 static int __init hugetlb_nrpages_setup(char *s) 1812 { .... 1834 /* 1835 * Global state is always initialized later in hugetlb_init. 1836 * But we need to allocate >= MAX_ORDER hstates here early to still 1837 * use the bootmem allocator. 1838 */ 1839 if (max_hstate && parsed_hstate->order >= MAX_ORDER) 1840 hugetlb_hstate_alloc_pages(parsed_hstate); 1841 1842 last_mhp = mhp; 1843 1844 return 1; 1845 } 1846 __setup("hugepages=", hugetlb_nrpages_setup); I realize this issue you've reported only happens when you're using oversized hugepages. As their order are always >= MAX_ORDER, they got pages early allocated from bootmem allocator. So, these pages are not accounted for totalram_pages. Although your patch covers a fix for the proposed case, it only works for scenarios where oversized hugepages are allocated on boot. I think it will, unfortunately, cause a bug for the remaining scenarios. Cheers! --aquini --00032555645e4ee04804a3a11a74 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Howdy Russ,

On Thu, May 19, 2011 at 1:56 = AM, Russ Anderson <rja@= sgi.com> wrote:
On Wed, May 18, 2011 at 09:51:03PM -0300,= Rafael Aquini wrote:
> Howdy,
>
> On Wed, May 18, 2011 at 12:34 PM, Russ Anderson <rja@sgi.com> wrote:
>
> > If the total size of hugepages allocated on a system is
> > over half of the total memory size, commitlimit becomes
> > a negative number.
> >
> > What happens in fs/proc/meminfo.c is this calculation:
> >
> > =A0 =A0 =A0 =A0allowed =3D ((totalram_pages - hugetlb_total_pages= ())
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* sysctl_overcommit_ratio / 100) += total_swap_pages;
> >
> > The problem is that hugetlb_total_pages() is larger than
> > totalram_pages resulting in a negative number. =A0Since
> > allowed is an unsigned long the negative shows up as a
> > big number.
> >
> > A similar calculation occurs in __vm_enough_memory() in mm/mmap.c= .
> >
> > A symptom of this problem is that /proc/meminfo prints a
> > very large CommitLimit number.
> >
> > CommitLimit: =A0 =A0737869762947802600 kB
> >
> > To reproduce the problem reserve over half of memory as hugepages= .
> > For example "default_hugepagesz=3D1G hugepagesz=3D1G hugepag= es=3D64
> > Then look at /proc/meminfo "CommitLimit:" to see if it = is too big.
> >
> > The fix is to not subtract hugetlb_total_pages(). =A0When hugepag= es
> > are allocated totalram_pages is decremented so there is no need t= o
> > subtract out hugetlb_total_pages() a second time.
> >
> > Reported-by: Russ Anderson <rja= @sgi.com>
> > Signed-off-by: Russ Anderson <r= ja@sgi.com>
> >
> > ---
> >
> > Example of "CommitLimit:" being too big.
> >
> > uv1-sys:~ # cat /proc/meminfo
> > MemTotal: =A0 =A0 =A0 32395508 kB
> > MemFree: =A0 =A0 =A0 =A032029276 kB
> > Buffers: =A0 =A0 =A0 =A0 =A0 =A08656 kB
> > Cached: =A0 =A0 =A0 =A0 =A0 =A089548 kB
> > SwapCached: =A0 =A0 =A0 =A0 =A0 =A00 kB
> > Active: =A0 =A0 =A0 =A0 =A0 =A055336 kB
> > Inactive: =A0 =A0 =A0 =A0 =A073916 kB
> > Active(anon): =A0 =A0 =A031220 kB
> > Inactive(anon): =A0 =A0 =A0 36 kB
> > Active(file): =A0 =A0 =A024116 kB
> > Inactive(file): =A0 =A073880 kB
> > Unevictable: =A0 =A0 =A0 =A0 =A0 0 kB
> > Mlocked: =A0 =A0 =A0 =A0 =A0 =A0 =A0 0 kB
> > SwapTotal: =A0 =A0 =A0 =A0 =A0 =A0 0 kB
> > SwapFree: =A0 =A0 =A0 =A0 =A0 =A0 =A00 kB
> > Dirty: =A0 =A0 =A0 =A0 =A0 =A0 =A01692 kB
> > Writeback: =A0 =A0 =A0 =A0 =A0 =A0 0 kB
> > AnonPages: =A0 =A0 =A0 =A0 31132 kB
> > Mapped: =A0 =A0 =A0 =A0 =A0 =A015668 kB
> > Shmem: =A0 =A0 =A0 =A0 =A0 =A0 =A0 152 kB
> > Slab: =A0 =A0 =A0 =A0 =A0 =A0 =A070256 kB
> > SReclaimable: =A0 =A0 =A017148 kB
> > SUnreclaim: =A0 =A0 =A0 =A053108 kB
> > KernelStack: =A0 =A0 =A0 =A06536 kB
> > PageTables: =A0 =A0 =A0 =A0 3704 kB
> > NFS_Unstable: =A0 =A0 =A0 =A0 =A00 kB
> > Bounce: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00 kB
> > WritebackTmp: =A0 =A0 =A0 =A0 =A00 kB
> > CommitLimit: =A0 =A0737869762947802600 kB
> > Committed_AS: =A0 =A0 394044 kB
> > VmallocTotal: =A0 34359738367 kB
> > VmallocUsed: =A0 =A0 =A0713960 kB
> > VmallocChunk: =A0 34325764204 kB
> > HardwareCorrupted: =A0 =A0 0 kB
> > HugePages_Total: =A0 =A0 =A032
> > HugePages_Free: =A0 =A0 =A0 32
> > HugePages_Rsvd: =A0 =A0 =A0 =A00
> > HugePages_Surp: =A0 =A0 =A0 =A00
> > Hugepagesize: =A0 =A01048576 kB
> > DirectMap4k: =A0 =A0 =A0 16384 kB
> > DirectMap2M: =A0 =A0 2064384 kB
> > DirectMap1G: =A0 =A065011712 kB
> >
> > =A0fs/proc/meminfo.c | =A0 =A02 +-
> > =A0mm/mmap.c =A0 =A0 =A0 =A0 | =A0 =A03 +--
> > =A02 files changed, 2 insertions(+), 3 deletions(-)
> >
> > Index: linux/fs/proc/meminfo.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- linux.orig/fs/proc/meminfo.c =A0 =A0 =A0 =A02011-05-17 16:03:= 50.935658801 -0500
> > +++ linux/fs/proc/meminfo.c =A0 =A0 2011-05-18 08:53:00.568784147= -0500
> > @@ -36,7 +36,7 @@ static int meminfo_proc_show(struct seq_
> > =A0 =A0 =A0 =A0si_meminfo(&i);
> > =A0 =A0 =A0 =A0si_swapinfo(&i);
> > =A0 =A0 =A0 =A0committed =3D percpu_counter_read_positive(&vm= _committed_as);
> > - =A0 =A0 =A0 allowed =3D ((totalram_pages - hugetlb_total_pages(= ))
> > + =A0 =A0 =A0 allowed =3D (totalram_pages
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* sysctl_overcommit_ratio / 100) += total_swap_pages;
> >
> > =A0 =A0 =A0 =A0cached =3D global_page_state(NR_FILE_PAGES) -
> > Index: linux/mm/mmap.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- linux.orig/mm/mmap.c =A0 =A0 =A0 =A02011-05-17 16:03:51.72765= 8828 -0500
> > +++ linux/mm/mmap.c =A0 =A0 2011-05-18 08:54:34.912222405 -0500 > > @@ -167,8 +167,7 @@ int __vm_enough_memory(struct mm_struct
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto error;
> > =A0 =A0 =A0 =A0}
> >
> > - =A0 =A0 =A0 allowed =3D (totalram_pages - hugetlb_total_pages()= )
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 * sysctl_overcommit_ratio / 100; > > + =A0 =A0 =A0 allowed =3D totalram_pages * sysctl_overcommit_rati= o / 100;
> > =A0 =A0 =A0 =A0/*
> > =A0 =A0 =A0 =A0 * Leave the last 3% for root
> > =A0 =A0 =A0 =A0 */
> > --
> > Russ Anderson, OS RAS/Partitioning Project Lead
> > SGI - Silicon Graphics Inc =A0 =A0 =A0 =A0 =A0rja@sgi.com
>
>
> I'm afraid this will introduce a bug on how accurate kernel will a= ccount
> memory for overcommitment limits.
>
> totalram_pages is not decremented as hugepages are allocated. Since
Are you running on x86? =A0It decrements totalram_pages on a x8= 6_64
test system. =A0Perhaps different architectures allocate hugepages
differently.

The way it was verified was putting a printk in to print totalram_pages
and hugetlb_total_pages. =A0First the system was booted without any huge pages. =A0The next boot one huge page was allocated. =A0The next boot more<= br> hugepages allocated. =A0Each time totalram_pages was reduced by the nuber of huge pages allocated, with totalram_pages + hugetlb_total_pages
equaling the original number of pages.

That behavior is also consistent with allocating over half of memory
resulting in CommitLimit going negative (as is shown in the above
output).

Here is some data. =A0Each represents a boot using 1G hugepages.
=A0 0 hugepages : totalram_pages 16519867 hugetlb_total_pages =A0 =A0 =A0 = 0
=A0 1 hugepages : totalram_pages 16257723 hugetlb_total_pages =A0262144 =A0 2 hugepages : totalram_pages 15995578 hugetlb_total_pages =A0524288 =A031 hugepages : totalram_pages =A08393403 hugetlb_total_pages 8126464 =A032 hugepages : totalram_pages =A08131258 hugetlb_total_pages 8388608


> hugepages are reserved, hugetlb_total_pages() has to be accounted and<= br> > subtracted from totalram_pages in order to render an accurate number o= f
> remaining pages available to the general memory workload commitment. >
> I've tried to reproduce your findings on my boxes, =A0without
> success, unfortunately.

Put a printk in meminfo_proc_show() to print totalram_pages and
hugetlb_total_pages(). =A0Add "default_hugepagesz=3D1G hugepagesz=3D1G= hugepages=3D64"
to the boot line (varying the number of hugepages).

> I'll keep chasing to hit this behaviour, though.
>
> Cheers!
> --aquini

--
Russ Anderson, OS RAS/Partitioning Projec= t Lead
SGI - Silicon Graphics Inc =A0 =A0 =A0 =A0 =A0rja@sgi.com


I got what I was doi= ng different, and you are partially right.=A0
Checking=A0mm/huget= lb.c:
1811 static int __init hugetlb_nrpages_setup(char *s)
1812 {
....
1834 =A0 =A0 =A0 =A0 /*
1835= =A0 =A0 =A0 =A0 =A0* Global state is always initialized later in hugetlb_i= nit.
1836 =A0 =A0 =A0 =A0 =A0* But we need to allocate >=3D MA= X_ORDER hstates here early to still
1837 =A0 =A0 =A0 =A0 =A0* use the bootmem allocator.
1838 = =A0 =A0 =A0 =A0 =A0*/
1839 =A0 =A0 =A0 =A0 if (max_hstate &&a= mp; parsed_hstate->order >=3D MAX_ORDER)
1840 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 hugetlb_hstate_alloc_pages(parsed_hstate);
1841=A0
1842 =A0 =A0 =A0 =A0 last_mhp =3D mhp;
184= 3=A0
1844 =A0 =A0 =A0 =A0 return 1;
1845 }
18= 46 __setup("hugepages=3D", hugetlb_nrpages_setup);

=
I realize this issue you've reported only happens when you&#= 39;re using oversized hugepages. As their order are always >=3D MAX_ORDE= R, they got pages early allocated from bootmem allocator. So, these pages a= re not accounted for totalram_pages.

Although your patch covers a fix for the proposed case,= it only works for scenarios where oversized hugepages are allocated on boo= t. I think it will, unfortunately, cause a bug for the remaining scenarios.=

Cheers!
--aquini
--00032555645e4ee04804a3a11a74-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org