linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Huge pages and small pages. . .
@ 2006-01-17 18:52 John Richard Moser
  2006-01-17 19:06 ` William Lee Irwin III
  2006-01-17 19:18 ` linux-os (Dick Johnson)
  0 siblings, 2 replies; 9+ messages in thread
From: John Richard Moser @ 2006-01-17 18:52 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Is there anything in the kernel that shifts the physical pages for 1024
physically allocated and contiguous virtual pages together physically
and remaps them as one huge page?  This would probably work well for the
low end of the heap, until someone figures out a way to tell the system
to free intermittent pages in a big mapping (if the heap has an
allocation up high, it can have huge, unused areas that are allocated).
 It may possibly work for disk cache as well, albeit I can't say for
sure if it's common to have a 4 meg contiguous section of program data
loaded.

Shifting odd huge allocations around would be neat to, re:

{2m}[4M  ]{2m}  ->  [4M  ][4M  ]

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDzTzjhDd4aOud5P8RAud1AJ9MVy90XzvJWmgHmlBUdHcpsYNtWACfVxY6
f/jYDM1XiM8/09TfrzEDI3w=
=CsLK
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 18:52 Huge pages and small pages. . John Richard Moser
@ 2006-01-17 19:06 ` William Lee Irwin III
  2006-01-17 19:41   ` John Richard Moser
  2006-01-17 19:18 ` linux-os (Dick Johnson)
  1 sibling, 1 reply; 9+ messages in thread
From: William Lee Irwin III @ 2006-01-17 19:06 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel

On Tue, Jan 17, 2006 at 01:52:20PM -0500, John Richard Moser wrote:
> Is there anything in the kernel that shifts the physical pages for 1024
> physically allocated and contiguous virtual pages together physically
> and remaps them as one huge page?  This would probably work well for the
> low end of the heap, until someone figures out a way to tell the system
> to free intermittent pages in a big mapping (if the heap has an
> allocation up high, it can have huge, unused areas that are allocated).
>  It may possibly work for disk cache as well, albeit I can't say for
> sure if it's common to have a 4 meg contiguous section of program data
> loaded.
> Shifting odd huge allocations around would be neat to, re:
> {2m}[4M  ]{2m}  ->  [4M  ][4M  ]

I've got bugs and feature work written by others that has sat on hold
for ages to merge, so I won't be looking to experiment myself.

Do write things yourself and send in the resulting patches, though.


-- wli

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 18:52 Huge pages and small pages. . John Richard Moser
  2006-01-17 19:06 ` William Lee Irwin III
@ 2006-01-17 19:18 ` linux-os (Dick Johnson)
  2006-01-17 19:40   ` John Richard Moser
  1 sibling, 1 reply; 9+ messages in thread
From: linux-os (Dick Johnson) @ 2006-01-17 19:18 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-kernel


On Tue, 17 Jan 2006, John Richard Moser wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Is there anything in the kernel that shifts the physical pages for 1024
> physically allocated and contiguous virtual pages together physically
> and remaps them as one huge page?  This would probably work well for the

A page is something that is defined by the CPU. Perhaps you mean
"order"? When acquiring pages for DMA, they need to be contiguous if
you are going to access more than one page at a time. Therefore, one
can attempt to get two or more pages, i.e., the order or pages.

Since the CPU uses virtual memory always, there is no advantage to
having contiguous pages. You just map anything that's free into
what looks like contiguous memory and away you go.

> low end of the heap, until someone figures out a way to tell the system
> to free intermittent pages in a big mapping (if the heap has an
> allocation up high, it can have huge, unused areas that are allocated).

The actual allocation only occurs when an access happens. You can
allocate all the virtual memory in the machine and never use any
of it. When you allocate memory, the kernel just marks a promised
page 'not present'. When you attempt to access it, a page-fault
occurs and the kernel tries to find a free page to map into your
address space.

> It may possibly work for disk cache as well, albeit I can't say for
> sure if it's common to have a 4 meg contiguous section of program data
> loaded.
>

But it __is__ contiguous as far as the program is concerned.
The only time you need physically contiguous pages is when a
DMA operation occurs that crosses page boundaries. Otherwise,
it's a waste of time and CPU resources trying to make
something contiguous. Also, modern DMA engines use scatter-
lists so one can DMA to pages scattered all over the address-
space in one operation. In this case, you just build a list of
pages. You don't care where they physically reside, although you
do need to tell the DMA engine their correct locations.

Now there are some M$Garbage "high-memory" so-called enhancements
that, using page-registers, "map" more that 4 GB of memory into
the 4 GB address space. This is like the garbage that M$ created
to use "high memory" for DOS. Use of this kind of hardware-hack
is not relevant to the discussion about virtual memory.

> Shifting odd huge allocations around would be neat to, re:
>
> {2m}[4M  ]{2m}  ->  [4M  ][4M  ]
>
> - --
> All content of all messages exchanged herein are left in the
> Public Domain, unless otherwise explicitly stated.
>
>    Creative brains are a valuable, limited resource. They shouldn't be
>    wasted on re-inventing the wheel when there are so many fascinating
>    new problems waiting out there.
>                                                 -- Eric Steven Raymond
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQFDzTzjhDd4aOud5P8RAud1AJ9MVy90XzvJWmgHmlBUdHcpsYNtWACfVxY6
> f/jYDM1XiM8/09TfrzEDI3w=
> =CsLK
> -----END PGP SIGNATURE-----
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.54 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 19:18 ` linux-os (Dick Johnson)
@ 2006-01-17 19:40   ` John Richard Moser
  2006-01-17 23:18     ` Paul Mundt
  2006-01-18 10:36     ` Helge Hafting
  0 siblings, 2 replies; 9+ messages in thread
From: John Richard Moser @ 2006-01-17 19:40 UTC (permalink / raw)
  To: linux-os (Dick Johnson); +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



linux-os (Dick Johnson) wrote:
> On Tue, 17 Jan 2006, John Richard Moser wrote:
> 
> 
> Is there anything in the kernel that shifts the physical pages for 1024
> physically allocated and contiguous virtual pages together physically
> and remaps them as one huge page?  This would probably work well for the
> 
> 
>> A page is something that is defined by the CPU. Perhaps you mean
>> "order"? When acquiring pages for DMA, they need to be contiguous if
>> you are going to access more than one page at a time. Therefore, one
>> can attempt to get two or more pages, i.e., the order or pages.
> 
>> Since the CPU uses virtual memory always, there is no advantage to
>> having contiguous pages. You just map anything that's free into
>> what looks like contiguous memory and away you go.
> 

Well, pages are typically 4KiB seen by the MMU.  If you fault across
them, you need to have them cached in the TLB; if the TLB runs out of
room, you invalidate entries; then when you hit entries not in the TLB,
the TLB has to searhc for the page mapping in the PTE chain.

There are 4MiB pages, called "huge pages," that if you clump 1024
contiguous 4KiB pages together and draw a PTE entry up for can correlate
to a single TLB entry.  In this way, there's no page faulting until you
cross boundaries spaced 4MiB apart from eachother, and you use 1 TLB
entry where you would normally use 1024.

> 
> low end of the heap, until someone figures out a way to tell the system
> to free intermittent pages in a big mapping (if the heap has an
> allocation up high, it can have huge, unused areas that are allocated).
> 
> 
>> The actual allocation only occurs when an access happens. You can
>> allocate all the virtual memory in the machine and never use any
>> of it. When you allocate memory, the kernel just marks a promised
>> page 'not present'. When you attempt to access it, a page-fault
>> occurs and the kernel tries to find a free page to map into your
>> address space.
> 

Yes.  The heap manager brk()s up the heap to allocate more space, all
mapped to the zero page; then the application writes to these pages,
causing them to be COW'd to real memory.  They will stay forever
allocated until the highest pages of the heap are unused by the program,
in which case the heap manager brk()s down the heap and frees them to
the system.

Currently the heap manager can't seem to tell the system that a page
somewhere in the middle of the heap isn't really needed anymore, and can
be freed and mapped back to the zero page to await COW again.  So in
effect, you'll eventually wind out with a heap that's 20, 50, 100, or
200 megs wide and probably all actually mapped to real, usable memory;
at this point, you can probably replace most of those entries with huge
pages to save on TLB entries and page faults.

When the program would try to free up "pages" in a huge page, the kernel
would have to recognize that the application is working in terms of 4KiB
small pages, and take appropriate action shattering the huge page back
into 1024 small pages first.

> 
> It may possibly work for disk cache as well, albeit I can't say for
> sure if it's common to have a 4 meg contiguous section of program data
> loaded.
> 
> 
> 
>> But it __is__ contiguous as far as the program is concerned.

It's a contiguous set of 1024 PTE entries the MMU has to juggle around
in a TLB that can handle 256 entries at once.

>> The only time you need physically contiguous pages is when a
>> DMA operation occurs that crosses page boundaries. Otherwise,
>> it's a waste of time and CPU resources trying to make
>> something contiguous. Also, modern DMA engines use scatter-
>> lists so one can DMA to pages scattered all over the address-
>> space in one operation. In this case, you just build a list of
>> pages. You don't care where they physically reside, although you
>> do need to tell the DMA engine their correct locations.
> 
>> Now there are some M$Garbage "high-memory" so-called enhancements
>> that, using page-registers, "map" more that 4 GB of memory into
>> the 4 GB address space. This is like the garbage that M$ created
>> to use "high memory" for DOS. Use of this kind of hardware-hack
>> is not relevant to the discussion about virtual memory.
> 
> 
> Shifting odd huge allocations around would be neat to, re:
> 
> {2m}[4M  ]{2m}  ->  [4M  ][4M  ]
> 
> --
> All content of all messages exchanged herein are left in the
> Public Domain, unless otherwise explicitly stated.
> 
>    Creative brains are a valuable, limited resource. They shouldn't be
>    wasted on re-inventing the wheel when there are so many fascinating
>    new problems waiting out there.
>                                                 -- Eric Steven Raymond
- -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

> Cheers,
> Dick Johnson
> Penguin : Linux version 2.6.13.4 on an i686 machine (5589.54 BogoMips).
> Warning : 98.36% of all statistics are fiction.
> .

> ****************************************************************
> The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

> Thank you.


- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDzUgZhDd4aOud5P8RApgqAJ0T0tzanihdjbNou034+NoQ1TNfUgCgiWA/
pxFhbWnoVf3ltyGra0o+B5w=
=UqUX
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 19:06 ` William Lee Irwin III
@ 2006-01-17 19:41   ` John Richard Moser
  0 siblings, 0 replies; 9+ messages in thread
From: John Richard Moser @ 2006-01-17 19:41 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



William Lee Irwin III wrote:
> On Tue, Jan 17, 2006 at 01:52:20PM -0500, John Richard Moser wrote:
> 
>>Is there anything in the kernel that shifts the physical pages for 1024
>>physically allocated and contiguous virtual pages together physically
>>and remaps them as one huge page?  This would probably work well for the
>>low end of the heap, until someone figures out a way to tell the system
>>to free intermittent pages in a big mapping (if the heap has an
>>allocation up high, it can have huge, unused areas that are allocated).
>> It may possibly work for disk cache as well, albeit I can't say for
>>sure if it's common to have a 4 meg contiguous section of program data
>>loaded.
>>Shifting odd huge allocations around would be neat to, re:
>>{2m}[4M  ]{2m}  ->  [4M  ][4M  ]
> 
> 
> I've got bugs and feature work written by others that has sat on hold
> for ages to merge, so I won't be looking to experiment myself.
> 
> Do write things yourself and send in the resulting patches, though.
> 

A simple "no" would have sufficed; I was trying to find out if it's
there already.
> 
> -- wli
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDzUhdhDd4aOud5P8RAhqGAJsFBLK7791jWZE3nvA8YZXX7L5PtQCfZGdj
mo5CQcA55RPZCfZrBTOq3AI=
=bqFO
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 19:40   ` John Richard Moser
@ 2006-01-17 23:18     ` Paul Mundt
  2006-01-18  5:50       ` Ian Wienand
  2006-01-18 10:36     ` Helge Hafting
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Mundt @ 2006-01-17 23:18 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-os (Dick Johnson), linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

On Tue, Jan 17, 2006 at 02:40:10PM -0500, John Richard Moser wrote:
> Well, pages are typically 4KiB seen by the MMU.  If you fault across
> them, you need to have them cached in the TLB; if the TLB runs out of
> room, you invalidate entries; then when you hit entries not in the TLB,
> the TLB has to searhc for the page mapping in the PTE chain.
> 
> There are 4MiB pages, called "huge pages," that if you clump 1024
> contiguous 4KiB pages together and draw a PTE entry up for can correlate
> to a single TLB entry.  In this way, there's no page faulting until you
> cross boundaries spaced 4MiB apart from eachother, and you use 1 TLB
> entry where you would normally use 1024.
> 
Transparent superpages would certainly be nice. There's already various
superpage implementations floating around, but not without their
drawbacks. You might consider the Shimizu superpage patch for x86 if
you're not too concerned about demotion when trying to swap out the page.

There's some links on this subject on the ia64 wiki:

	http://www.gelato.unsw.edu.au/IA64wiki/SuperPages

Alternately, if you're simply interested in cutting down on the page
fault overhead and want a simpler and more naive approach, read the rice
paper and consider something like the IRIX/HP-UX approach (though for
x86, arm, etc. this might be slightly more work, since the larger pages
are at the PMD level, as opposed to other architectures where it's just a
matter of setting some bits in the PTE).

Since this topic seems to come up rather frequently, perhaps it would be
worthwhile documenting some of this on the linux-mm wiki.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 23:18     ` Paul Mundt
@ 2006-01-18  5:50       ` Ian Wienand
  0 siblings, 0 replies; 9+ messages in thread
From: Ian Wienand @ 2006-01-18  5:50 UTC (permalink / raw)
  To: Paul Mundt, John Richard Moser, linux-os (Dick Johnson), linux-kernel
  Cc: Paul Cameron Davies

[-- Attachment #1: Type: text/plain, Size: 1378 bytes --]

On Wed, Jan 18, 2006 at 01:18:02AM +0200, Paul Mundt wrote:
> Transparent superpages would certainly be nice. There's already various
> superpage implementations floating around, but not without their
> drawbacks. You might consider the Shimizu superpage patch for x86 if
> you're not too concerned about demotion when trying to swap out the page.
> 
> There's some links on this subject on the ia64 wiki:
> 
> 	http://www.gelato.unsw.edu.au/IA64wiki/SuperPages

Hi,

I'm working on superpages, targeted at Itanium, and I just updated
the WiKi page to explain a bit more where I'm at.

http://www.gelato.unsw.edu.au/IA64wiki/Ia64SuperPages

We're also looking at other things, like completely abstracting out
the 3 level page table to allow us to experiment with VM schemes more
appropriate for large, sparse address spaces and superpages.

This is far from realistic migrate into the kernel stuff (so not
really for here), but if sufficient (say 5 or 10) people drop me a
reply email to say they're interested I can start a low-traffic list
where we can discuss long term (far-fetched?) VM implementation ideas,
share patches, etc.

> Since this topic seems to come up rather frequently, perhaps it would be
> worthwhile documenting some of this on the linux-mm wiki.

Yes, I'll look at putting something in there too.

-i
ianw@gelato.unsw.edu.au
http://www.gelato.unsw.edu.au

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-17 19:40   ` John Richard Moser
  2006-01-17 23:18     ` Paul Mundt
@ 2006-01-18 10:36     ` Helge Hafting
  2006-01-18 19:11       ` John Richard Moser
  1 sibling, 1 reply; 9+ messages in thread
From: Helge Hafting @ 2006-01-18 10:36 UTC (permalink / raw)
  To: John Richard Moser; +Cc: linux-os (Dick Johnson), linux-kernel

John Richard Moser wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>
>
>linux-os (Dick Johnson) wrote:
>  
>
>>On Tue, 17 Jan 2006, John Richard Moser wrote:
>>
>>
>>Is there anything in the kernel that shifts the physical pages for 1024
>>physically allocated and contiguous virtual pages together physically
>>and remaps them as one huge page?  This would probably work well for the
>>
>>
>>    
>>
>>>A page is something that is defined by the CPU. Perhaps you mean
>>>"order"? When acquiring pages for DMA, they need to be contiguous if
>>>you are going to access more than one page at a time. Therefore, one
>>>can attempt to get two or more pages, i.e., the order or pages.
>>>      
>>>
>>>Since the CPU uses virtual memory always, there is no advantage to
>>>having contiguous pages. You just map anything that's free into
>>>what looks like contiguous memory and away you go.
>>>      
>>>
>
>Well, pages are typically 4KiB seen by the MMU.  If you fault across
>them, you need to have them cached in the TLB; if the TLB runs out of
>room, you invalidate entries; then when you hit entries not in the TLB,
>the TLB has to searhc for the page mapping in the PTE chain.
>
>There are 4MiB pages, called "huge pages," that if you clump 1024
>contiguous 4KiB pages together and draw a PTE entry up for can correlate
>to a single TLB entry.  In this way, there's no page faulting until you
>cross boundaries spaced 4MiB apart from eachother, and you use 1 TLB
>entry where you would normally use 1024.
>
>  
>
>>low end of the heap, until someone figures out a way to tell the system
>>to free intermittent pages in a big mapping (if the heap has an
>>allocation up high, it can have huge, unused areas that are allocated).
>>
>>
>>    
>>
>>>The actual allocation only occurs when an access happens. You can
>>>allocate all the virtual memory in the machine and never use any
>>>of it. When you allocate memory, the kernel just marks a promised
>>>page 'not present'. When you attempt to access it, a page-fault
>>>occurs and the kernel tries to find a free page to map into your
>>>address space.
>>>      
>>>
>
>Yes.  The heap manager brk()s up the heap to allocate more space, all
>mapped to the zero page; then the application writes to these pages,
>causing them to be COW'd to real memory.  They will stay forever
>allocated until the highest pages of the heap are unused by the program,
>in which case the heap manager brk()s down the heap and frees them to
>the system.
>
>Currently the heap manager can't seem to tell the system that a page
>somewhere in the middle of the heap isn't really needed anymore, and can
>be freed and mapped back to the zero page to await COW again.  So in
>effect, you'll eventually wind out with a heap that's 20, 50, 100, or
>200 megs wide and probably all actually mapped to real, usable memory;
>at this point, you can probably replace most of those entries with huge
>pages to save on TLB entries and page faults.
>  
>
This would be a nice performance win _if_ the program is
actually using all those pages regularly. And I mean _all_.

The idea fails if we get any memory pressure, and some
of the little-used pages gets swapped out.  You can't swap
out part of a huge page, so you either have to break it up,
or suffer the possibly large performance loss of having
one megabyte less of virtual memory.   (That'll be even worse
if several processes do this, of course.)

>When the program would try to free up "pages" in a huge page, the kernel
>would have to recognize that the application is working in terms of 4KiB
>small pages, and take appropriate action shattering the huge page back
>into 1024 small pages first.
>  
>
To see whether this is worthwile, I suggest instrumenting a kernel
to detect:
(1) When a process gets opportunity to have a huge page
      (I.e. it has a sufficiently large and aligned memory block,
       none of it is swapped, and rearranging into contigous
       physical memory is possible.)
(2) When a huge page would need to be broken up.
      (I.e. parts of a previously identified huge page gets swapped
       out, deallocated, shared, memprotected and so on.)


Both of these detectors will be necessary anyway if automatic huge
pages gets deployed.  In the meantime, you can check to see
if such occations arise and for how long the huge pages remain
viable. I believe the swapper will kill them fast once you have
any memory pressure, and that fragmentation normally will prevent
such pages from forming.  But I could be wrong of course.

For this to be fesible, the reduction in TLB misses would have to
outweigh the initial cost of copying all that memory into a contigous
region, the cost of shattering the mapping once the swapper gets
aggressive, and of course the ongoing cost of identifying mergeable blocks.

Helge Hafting

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Huge pages and small pages. . .
  2006-01-18 10:36     ` Helge Hafting
@ 2006-01-18 19:11       ` John Richard Moser
  0 siblings, 0 replies; 9+ messages in thread
From: John Richard Moser @ 2006-01-18 19:11 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-os (Dick Johnson), linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Helge Hafting wrote:
> John Richard Moser wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>>
>>
>> linux-os (Dick Johnson) wrote:
>>  
>>
>>> On Tue, 17 Jan 2006, John Richard Moser wrote:
>>>
>>>
>>> Is there anything in the kernel that shifts the physical pages for 1024
>>> physically allocated and contiguous virtual pages together physically
>>> and remaps them as one huge page?  This would probably work well for the
>>>
>>>
>>>   
>>>
>>>> A page is something that is defined by the CPU. Perhaps you mean
>>>> "order"? When acquiring pages for DMA, they need to be contiguous if
>>>> you are going to access more than one page at a time. Therefore, one
>>>> can attempt to get two or more pages, i.e., the order or pages.
>>>>     
>>>> Since the CPU uses virtual memory always, there is no advantage to
>>>> having contiguous pages. You just map anything that's free into
>>>> what looks like contiguous memory and away you go.
>>>>     
>>
>>
>> Well, pages are typically 4KiB seen by the MMU.  If you fault across
>> them, you need to have them cached in the TLB; if the TLB runs out of
>> room, you invalidate entries; then when you hit entries not in the TLB,
>> the TLB has to searhc for the page mapping in the PTE chain.
>>
>> There are 4MiB pages, called "huge pages," that if you clump 1024
>> contiguous 4KiB pages together and draw a PTE entry up for can correlate
>> to a single TLB entry.  In this way, there's no page faulting until you
>> cross boundaries spaced 4MiB apart from eachother, and you use 1 TLB
>> entry where you would normally use 1024.
>>
>>  
>>
>>> low end of the heap, until someone figures out a way to tell the system
>>> to free intermittent pages in a big mapping (if the heap has an
>>> allocation up high, it can have huge, unused areas that are allocated).
>>>
>>>
>>>   
>>>
>>>> The actual allocation only occurs when an access happens. You can
>>>> allocate all the virtual memory in the machine and never use any
>>>> of it. When you allocate memory, the kernel just marks a promised
>>>> page 'not present'. When you attempt to access it, a page-fault
>>>> occurs and the kernel tries to find a free page to map into your
>>>> address space.
>>>>     
>>
>>
>> Yes.  The heap manager brk()s up the heap to allocate more space, all
>> mapped to the zero page; then the application writes to these pages,
>> causing them to be COW'd to real memory.  They will stay forever
>> allocated until the highest pages of the heap are unused by the program,
>> in which case the heap manager brk()s down the heap and frees them to
>> the system.
>>
>> Currently the heap manager can't seem to tell the system that a page
>> somewhere in the middle of the heap isn't really needed anymore, and can
>> be freed and mapped back to the zero page to await COW again.  So in
>> effect, you'll eventually wind out with a heap that's 20, 50, 100, or
>> 200 megs wide and probably all actually mapped to real, usable memory;
>> at this point, you can probably replace most of those entries with huge
>> pages to save on TLB entries and page faults.
>>  
>>
> This would be a nice performance win _if_ the program is
> actually using all those pages regularly. And I mean _all_.
> 
> The idea fails if we get any memory pressure, and some
> of the little-used pages gets swapped out.  You can't swap
> out part of a huge page, so you either have to break it up,
> or suffer the possibly large performance loss of having
> one megabyte less of virtual memory.   (That'll be even worse
> if several processes do this, of course.)
> 
>> When the program would try to free up "pages" in a huge page, the kernel
>> would have to recognize that the application is working in terms of 4KiB
>> small pages, and take appropriate action shattering the huge page back
>> into 1024 small pages first.
>>  
>>
> To see whether this is worthwile, I suggest instrumenting a kernel
> to detect:
> (1) When a process gets opportunity to have a huge page
>      (I.e. it has a sufficiently large and aligned memory block,
>       none of it is swapped, and rearranging into contigous
>       physical memory is possible.)
> (2) When a huge page would need to be broken up.
>      (I.e. parts of a previously identified huge page gets swapped
>       out, deallocated, shared, memprotected and so on.)
> 
> 
> Both of these detectors will be necessary anyway if automatic huge
> pages gets deployed.  In the meantime, you can check to see
> if such occations arise and for how long the huge pages remain
> viable. I believe the swapper will kill them fast once you have
> any memory pressure, and that fragmentation normally will prevent
> such pages from forming.  But I could be wrong of course.
> 

You're probably right.  On a system with optimum memory, this would be
optimal; aside from that it could get painful.

> For this to be fesible, the reduction in TLB misses would have to
> outweigh the initial cost of copying all that memory into a contigous
> region, the cost of shattering the mapping once the swapper gets
> aggressive, and of course the ongoing cost of identifying mergeable blocks.
> 
> Helge Hafting
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDzpLjhDd4aOud5P8RAmrAAJ0bTjc/SIrTSUECLIIEG8xCNjVnBACdF/N3
Cl/3+2m2nNm7IsGhJcFsWUs=
=Vrrl
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-01-18 19:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-17 18:52 Huge pages and small pages. . John Richard Moser
2006-01-17 19:06 ` William Lee Irwin III
2006-01-17 19:41   ` John Richard Moser
2006-01-17 19:18 ` linux-os (Dick Johnson)
2006-01-17 19:40   ` John Richard Moser
2006-01-17 23:18     ` Paul Mundt
2006-01-18  5:50       ` Ian Wienand
2006-01-18 10:36     ` Helge Hafting
2006-01-18 19:11       ` John Richard Moser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).