linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
[parent not found: <200208041331.24895.frankeh@watson.ibm.com.suse.lists.linux.kernel>]
* RE: large page patch (fwd) (fwd)
@ 2002-08-05 23:30 Seth, Rohit
  2002-08-06  5:01 ` David Mosberger
  2002-08-06 19:11 ` Hubertus Franke
  0 siblings, 2 replies; 110+ messages in thread
From: Seth, Rohit @ 2002-08-05 23:30 UTC (permalink / raw)
  To: 'frankeh@watson.ibm.com', Linus Torvalds
  Cc: David S. Miller, davidm, davidm, gh, Martin.Bligh, wli, linux-kernel



> -----Original Message-----
> From: Hubertus Franke [mailto:frankeh@watson.ibm.com]
> Sent: Sunday, August 04, 2002 12:30 PM
> To: Linus Torvalds
> Cc: David S. Miller; davidm@hpl.hp.com; davidm@napali.hpl.hp.com;
> gh@us.ibm.com; Martin.Bligh@us.ibm.com; wli@holomorphy.com;
> linux-kernel@vger.kernel.org
> Subject: Re: large page patch (fwd) (fwd)
> 
> Well, in what you described above there is no concept of superpages
> the way it is defined for the purpose of <tracking> and <TLB overhead 
> reduction>. 
> If you don't know about super pages at the VM level, then you need to
> deal with them at TLB fault level to actually create the <large TLB> 
> entry. That what the INTC patch will do, namely throughing all the 
> complexity over the fence for the page fault.
Our patch does the preallocation of large pages at the time of request.
There is really nothing special like replicating PTEs (that you mentioned
below in your design) happens there. In any case,  even for IA-64 where the
TLBs are also sw controlled (we also have Hardware Page Walker that can walk
any 3rd level pt and insert the PTE in TLB.) there are almost no changes (to
be precise one additional asm instructionin the begining of handler for
shifting extra bits) in our implementation that pollute the low level TLB
fault handlers to have the knowledge of large page size in traversing the
3-level page table. (Though there are couple of other asm instructions that
are added in this low-level routine to set helping register with proper
page_size while inserting bigger TLBs).  On IA-32 obviously things fall in
place automagically as the page tables are setup as per arch.

> In your case not keeping track of the super pages in the 
> VM layer and PT layer requires to discover the large page at soft TLB 
> time by scanning PT proximity for contigous pages if we are 
> talking now 
> about the read_ahead ....
> In our case, we store the same physical address of the super page 
> in the PTEs spanning the superpage together with the page order.
> At software TLB time we simply extra the single PTE from the PT based
> on the faulting address and move it into the TLB. This 
> ofcourse works only
> for software TLBs (PwrPC, MIPS, IA64). For HW TLB (x86) the 
> PT structure
> by definition overlaps the large page size support.
> The HW TLB case can be extended to not store the same PA in 
> all the PTEs,
> but conceptually carry the superpage concept for the purpose 
> described above.
> 
I'm afraid you may be wasting a lot of extra memory by replicaitng these
PTEs(Take an example of one 4G large TLB size entry and assume there are few
hunderd processes using that same physical page.)

> We have that concept exactly the way you want it, but the dress code 
> seems to be wrong. That can be worked on.
> Our goal was in the long run 2.7 to explore the Rice approach to see
> whether it yields benefits or whether we getting down the road of 
> fragmentation reduction overhead that will kill all the 
> benefits we get
> from reduced TLB overhead. Time would tell.
> 
> But to go down this route we need the concept of a superpage 
> in the VM,
> not just at TLB time or a hack that throws these things over 
> the fence. 
> 
As others have already said that you may want to have the support of smaller
superpages in this way.  Where VM is embeded with some knowledge of
different page sizes that it can support.  Demoting and permoting pages from
one size to another (efficiently)will be very critical in the design. In my
opinion supporting the largest TLB on archs (like 256M or 4G) will need more
direct appraoch and less intrusion from kernel VM will be prefered.
Ofcourse, kernel will need to put extra checks etc. to maintain some sanity
for allowed users. 

There has already been lot of discussion on this mailing list about what is
the right approach.  Whether the new APIs are needed or something like
madvise would do it, whether kernel needs to allocate large_pages
transparently to the user or we should expose the underlying HW feature to
user land.  There are issues that favor one approach over another.  But the
bottom line is: 1) We should not break anything semantically for regular
system calls that happen to be using large TLBs and 2) The performance
advantage of this HW feature (on most of the archs I hope) is just too much
to let go without notice.  I hope we get to consensus for getting this
support in kernel ASAP.  This will benefit lot of Linux users.  (And yes I
understand that we need to do things right in kernel so that we don't make
unforeseen errors.)
> 
> > And no, I do not want separate coloring support in the 
> allocator. I think
> > coloring without superpage support is stupid and worthless (and
> > complicates the code for no good reason).
> >
> > 		Linus
> 
> That <stupid> seems premature. You are mixing the concept of 
> superpage from a TLB miss reduction perspective 
> with the concept of superpage for page coloring. 
> 
>
I have seen couple of HPC apps that try to fit (configure) in their data
sets on the L3 caches size (Like on IA-64 4M).  I think these are the apps
that really get hit hardest by lack of proper page coloring support in Linux
kernel.  The performance variation of these workloads from run to run could
be as much as 60%  And with the page coloring patch, these apps seems to be
giving consistent higher throuput (The real bad part is that once the
throughput of these workloads drop, it stays down thereafter :( )  But seems
like DavidM has enough real world data that prohibits the use of this
approach in kernel for real world scenarios.  The good part of large TLBs is
that, TLBs larger than CPU cache size will automatically get you perfect
page coloring .........for free. 

rohit

^ permalink raw reply	[flat|nested] 110+ messages in thread
* RE: large page patch (fwd) (fwd)
@ 2002-08-06 20:38 Luck, Tony
  2002-08-06 21:03 ` Hubertus Franke
  0 siblings, 1 reply; 110+ messages in thread
From: Luck, Tony @ 2002-08-06 20:38 UTC (permalink / raw)
  To: 'frankeh@watson.ibm.com', Seth, Rohit, Linus Torvalds
  Cc: linux-kernel, linux-mm

> > 4GB TLB entry size ??? 
> I assume you mean 4MB TLB entry size or did I fall
> into a coma for 10 years

That wasn't a typo ... Itanium2 supports page sizes up
to 4 Gigabytes.  Databases (well, Oracle for sure) want
to use those huge TLB entries to map their multi-gigabyte
shared memory areas.

-Tony


^ permalink raw reply	[flat|nested] 110+ messages in thread
* RE: large page patch (fwd) (fwd)
@ 2002-08-09 17:51 Seth, Rohit
  0 siblings, 0 replies; 110+ messages in thread
From: Seth, Rohit @ 2002-08-09 17:51 UTC (permalink / raw)
  To: 'Daniel Phillips', Linus Torvalds
  Cc: frankeh, davidm, David Mosberger, David S. Miller, gh,
	Martin.Bligh, wli, linux-kernel



> -----Original Message-----
> From: Daniel Phillips [mailto:phillips@arcor.de]
> Sent: Friday, August 09, 2002 10:12 AM
> To: Linus Torvalds
> Cc: frankeh@watson.ibm.com; davidm@hpl.hp.com; David 
> Mosberger; David S.
> Miller; gh@us.ibm.com; Martin.Bligh@us.ibm.com; wli@holomorphy.com;
> linux-kernel@vger.kernel.org
> Subject: Re: large page patch (fwd) (fwd)
> 
> 
> On Friday 09 August 2002 18:51, Linus Torvalds wrote:
> > On Fri, 9 Aug 2002, Daniel Phillips wrote:
> > > Slab allocations would not have GFP_DEFRAG (I mistakenly 
> wrote GFP_LARGE 
> > > earlier) and so would be allocated outside ZONE_LARGE.
> > 
> > .. at which poin tyou then get zone balancing problems.
> > 
> > Or we end up with the same kind of special zone that we 
> have _anyway_ in
> > the current large-page patch, in which case the point of 
> doing this is
> > what?
> 
> The current large-page patch doesn't have any kind of 
> defragmentation in the 
> special zone and that memory is just not available for other 
> uses.  The thing 
> is, when demand for large pages is low the zone should be 
> allowed to fragment.
> 

You are right that as long as the pages are in large page pool they are not
available for other regualr purposes.  Though the current implementation
basically allows on-demand moving of pages between large_page and other
regular pools using sysctl interface.   The issue is really not forced (in
the sense that large pages are freed only if they are available and vice
versa).  And it will not be an issue where demand for large pages is low.
Theoritically you can extend this support in pageout daemon to find out if
it can retrieve some free large pages (for environments where expectations
are that most of the memory will be used for large pages but actual usage is
not as per the expectations. Though I doubt if those environments will
occur, but bad configurations are always there)  The current approach really
allows the large page/regular_page movement without doing too much of house
cleaning.  It is likely that once a large page goes back to general pool, it
will not easy to replenish the large_page pool because of fragmentation in
regular memory pool (for memory starved machines.  For the scenarios where
sometime the machine is running low on regular memory and sometimes on
large_pages....probably it would be a good idea to add in more RAM in these
cases.).
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2002-08-22 12:06 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <E17ahdi-0001RC-00@w-gerrit2>
2002-08-02 19:34 ` large page patch (fwd) (fwd) Linus Torvalds
2002-08-03  3:19   ` David Mosberger
2002-08-03  3:32     ` Linus Torvalds
2002-08-03  4:17       ` David Mosberger
2002-08-03  4:26         ` Linus Torvalds
2002-08-03  4:39           ` David Mosberger
2002-08-03  5:20             ` David S. Miller
2002-08-03 17:35               ` Linus Torvalds
2002-08-03 19:30                 ` David Mosberger
2002-08-03 19:43                   ` Linus Torvalds
2002-08-03 21:18                     ` David Mosberger
2002-08-03 21:54                       ` Hubertus Franke
2002-08-04  0:35                         ` David S. Miller
2002-08-04  2:25                           ` David Mosberger
2002-08-04 17:19                             ` Hubertus Franke
2002-08-09 15:20                               ` Daniel Phillips
2002-08-09 15:56                                 ` Linus Torvalds
2002-08-09 16:15                                   ` Daniel Phillips
2002-08-09 16:31                                     ` Rik van Riel
2002-08-09 18:08                                       ` Daniel Phillips
2002-08-09 16:51                                     ` Linus Torvalds
2002-08-09 17:11                                       ` Daniel Phillips
2002-08-09 16:27                                   ` Rik van Riel
2002-08-09 16:52                                     ` Linus Torvalds
2002-08-09 17:40                                       ` yodaiken
2002-08-09 19:15                                         ` Rik van Riel
2002-08-09 21:20                                           ` Linus Torvalds
2002-08-09 21:19                                         ` Marcin Dalecki
2002-08-09 17:46                                       ` Bill Rugolsky Jr.
2002-08-12  9:23                                     ` Helge Hafting
2002-08-13  3:15                                       ` Bill Davidsen
2002-08-13  3:31                                         ` Rik van Riel
2002-08-13  7:28                                         ` Helge Hafting
2002-08-09 21:38                                   ` Andrew Morton
2002-08-10 18:20                                     ` Eric W. Biederman
2002-08-10 18:59                                       ` Daniel Phillips
2002-08-10 19:55                                       ` Rik van Riel
2002-08-10 19:54                                         ` Eric W. Biederman
2002-08-09 18:32                                 ` Hubertus Franke
2002-08-09 18:43                                   ` Daniel Phillips
2002-08-09 19:17                                     ` Hubertus Franke
2002-08-11 20:30                                 ` Alan Cox
2002-08-11 22:33                                   ` Daniel Phillips
2002-08-11 22:55                                     ` Linus Torvalds
2002-08-11 22:56                                       ` Linus Torvalds
2002-08-11 23:36                                         ` William Lee Irwin III
2002-08-12  0:46                                         ` Alan Cox
2002-08-11 23:42                                           ` Rik van Riel
2002-08-11 23:50                                             ` Larry McVoy
2002-08-12  8:22                                               ` Daniel Phillips
2002-08-13  8:40                                                 ` Rob Landley
2002-08-13 15:06                                                   ` Alan Cox
2002-08-13 11:36                                                     ` Rob Landley
2002-08-13 16:51                                                       ` Linus Torvalds
2002-08-13 12:53                                                         ` Rob Landley
2002-08-13 17:14                                                         ` Ruth Ivimey-Cook
2002-08-13 17:29                                                         ` Rik van Riel
2002-08-13 13:18                                                           ` Rob Landley
2002-08-13 18:32                                                             ` Linus Torvalds
2002-08-13 13:50                                                               ` Rob Landley
2002-08-13 17:45                                                           ` Alexander Viro
2002-08-13 17:55                                                           ` Linus Torvalds
2002-08-13 17:59                                                             ` Rik van Riel
2002-08-13 13:35                                                               ` Rob Landley
2002-08-13 19:12                                                             ` Daniel Phillips
2002-08-22 12:03                                                           ` bill davidsen
     [not found]                                                       ` <Pine.LNX.4.44.0208130942130.7411-100000@home.transmeta.com >
2002-08-13 18:46                                                         ` large page patch (fwd) Mike Galbraith
2002-08-11 23:44                                           ` large page patch (fwd) (fwd) Daniel Phillips
2002-08-13  8:51                                             ` Rob Landley
2002-08-13 16:47                                               ` Daniel Phillips
2002-08-13 13:09                                                 ` Rob Landley
2002-08-11 23:15                                       ` Larry McVoy
2002-08-12  1:26                                         ` Linus Torvalds
2002-08-12  5:05                                           ` Larry McVoy
2002-08-12 10:31                                           ` Alan Cox
2002-08-04  0:28                 ` David S. Miller
2002-08-04 17:31                   ` Hubertus Franke
2002-08-04 18:38                     ` Linus Torvalds
2002-08-04 19:23                       ` Andrew Morton
2002-08-04 19:28                         ` Linus Torvalds
2002-08-05  5:42                           ` David S. Miller
2002-08-04 19:30                       ` Hubertus Franke
2002-08-04 20:23                         ` William Lee Irwin III
2002-08-05 16:59                         ` David Mosberger
2002-08-05 17:21                           ` Hubertus Franke
2002-08-05 21:10                             ` Jamie Lokier
2002-08-04 19:41                       ` Rik van Riel
2002-08-05  5:40                     ` David S. Miller
2002-08-03 18:41             ` Hubertus Franke
2002-08-03 19:39               ` Linus Torvalds
2002-08-04  0:32                 ` David S. Miller
2002-08-03 19:41               ` David Mosberger
2002-08-03 20:53                 ` Hubertus Franke
2002-08-03 21:26                   ` David Mosberger
2002-08-03 21:50                     ` Hubertus Franke
2002-08-04  0:34                   ` David S. Miller
2002-08-04  0:31                 ` David S. Miller
2002-08-04 17:25                   ` Hubertus Franke
     [not found] <200208041331.24895.frankeh@watson.ibm.com.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.44.0208041131380.10314-100000@home.transmeta.com.suse.lists.linux.kernel>
     [not found]   ` <3D4D7F24.10AC4BDB@zip.com.au.suse.lists.linux.kernel>
2002-08-04 20:20     ` Andi Kleen
2002-08-04 23:51       ` Eric W. Biederman
2002-08-05 23:30 Seth, Rohit
2002-08-06  5:01 ` David Mosberger
2002-08-06  4:58   ` David S. Miller
2002-08-06  5:19     ` David Mosberger
2002-08-06  5:08       ` David S. Miller
2002-08-06  5:32         ` David Mosberger
2002-08-06 19:11 ` Hubertus Franke
2002-08-06 20:38 Luck, Tony
2002-08-06 21:03 ` Hubertus Franke
2002-08-09 17:51 Seth, Rohit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).