All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: Matt Mackall <mpm@selenic.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	akpm@linux-foundation.org, Hugh Dickins <hugh@veritas.com>,
	Nick Piggin <npiggin@suse.de>
Subject: Re: [PATCH] fix/improve generic page table walker
Date: Thu, 12 Mar 2009 15:42:29 +0100	[thread overview]
Message-ID: <20090312154229.3ee463eb@skybase> (raw)
In-Reply-To: <1236867014.3213.16.camel@calx>

On Thu, 12 Mar 2009 09:10:14 -0500
Matt Mackall <mpm@selenic.com> wrote:

> [Nick and Hugh, maybe you can shed some light on this for me]
> 
> On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote:
> > On Wed, 11 Mar 2009 12:24:23 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > > > From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > > > 
> > > > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > > > caused by the unconditional loop over all pgd/pud entries as specified
> > > > by the address range passed to walk_page_range. The tricky bit here
> > > > is that the pgd++ in the outer loop may only be done if the page table
> > > > really has 4 levels. For the pud++ in the second loop the page table needs
> > > > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > > > pud pointer can get out-of-bounds causing all kinds of mayhem.
> > > 
> > > Not sure why this should be a problem without delving into the S390
> > > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> > > a way that's transparent to the walker.
> > 
> > Its hard to understand without looking at the s390 details. The main
> > difference between x86 and s390 in that respect is that on s390 the
> > number of page table levels is determined at runtime on a per process
> > basis. A compat process uses 2 levels, a 64 bit process starts with 3
> > levels and can "upgrade" to 4 levels if something gets mapped above
> > 4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
> > a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
> > can point to a region-third or a segment table. The page table
> > primitives know about this semantic, in particular pud_offset and
> > pmd_offset check the type of the page table pointed to by *pgd and *pud
> > and do nothing with the pointer if it is a lower level page table.
> > The only operation I can not "patch" is the pgd++/pud++ operation.
> 
> So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another
> object with different semantics that generic code can trip over.

Then what exactly is a pgd_t? For me it is the top level page table
which can have very different meaning for the various architectures.

> Can I get you to explain why this is necessary or even preferable to
> doing it the generic way where pgd_t has a fixed software meaning
> regardless of how many hardware levels are in play?

Well, the hardware can do up to 5 levels of page tables for the full
64 bit address space. With the introduction of pud's we wanted to
extend our address space from 3 levels / 42 bits to 4 levels / 53 bits.
But this comes at a cost: additional page table levels cost memory and
performance. In particular for the compat processes which can only
address a maximum of 2 GB it is a waste to allocate 4 levels. With the
dynamic page tables we allocate as much as required by each process.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


WARNING: multiple messages have this Message-ID (diff)
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
To: Matt Mackall <mpm@selenic.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	akpm@linux-foundation.org, Hugh Dickins <hugh@veritas.com>,
	Nick Piggin <npiggin@suse.de>
Subject: Re: [PATCH] fix/improve generic page table walker
Date: Thu, 12 Mar 2009 15:42:29 +0100	[thread overview]
Message-ID: <20090312154229.3ee463eb@skybase> (raw)
In-Reply-To: <1236867014.3213.16.camel@calx>

On Thu, 12 Mar 2009 09:10:14 -0500
Matt Mackall <mpm@selenic.com> wrote:

> [Nick and Hugh, maybe you can shed some light on this for me]
> 
> On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote:
> > On Wed, 11 Mar 2009 12:24:23 -0500
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > > > From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > > > 
> > > > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > > > caused by the unconditional loop over all pgd/pud entries as specified
> > > > by the address range passed to walk_page_range. The tricky bit here
> > > > is that the pgd++ in the outer loop may only be done if the page table
> > > > really has 4 levels. For the pud++ in the second loop the page table needs
> > > > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > > > pud pointer can get out-of-bounds causing all kinds of mayhem.
> > > 
> > > Not sure why this should be a problem without delving into the S390
> > > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> > > a way that's transparent to the walker.
> > 
> > Its hard to understand without looking at the s390 details. The main
> > difference between x86 and s390 in that respect is that on s390 the
> > number of page table levels is determined at runtime on a per process
> > basis. A compat process uses 2 levels, a 64 bit process starts with 3
> > levels and can "upgrade" to 4 levels if something gets mapped above
> > 4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
> > a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
> > can point to a region-third or a segment table. The page table
> > primitives know about this semantic, in particular pud_offset and
> > pmd_offset check the type of the page table pointed to by *pgd and *pud
> > and do nothing with the pointer if it is a lower level page table.
> > The only operation I can not "patch" is the pgd++/pud++ operation.
> 
> So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another
> object with different semantics that generic code can trip over.

Then what exactly is a pgd_t? For me it is the top level page table
which can have very different meaning for the various architectures.

> Can I get you to explain why this is necessary or even preferable to
> doing it the generic way where pgd_t has a fixed software meaning
> regardless of how many hardware levels are in play?

Well, the hardware can do up to 5 levels of page tables for the full
64 bit address space. With the introduction of pud's we wanted to
extend our address space from 3 levels / 42 bits to 4 levels / 53 bits.
But this comes at a cost: additional page table levels cost memory and
performance. In particular for the compat processes which can only
address a maximum of 2 GB it is a waste to allocate 4 levels. With the
dynamic page tables we allocate as much as required by each process.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-03-12 14:46 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-11 13:49 [PATCH] fix/improve generic page table walker Martin Schwidefsky
2009-03-11 13:49 ` Martin Schwidefsky
2009-03-11 17:24 ` Matt Mackall
2009-03-11 17:24   ` Matt Mackall
2009-03-12  8:33   ` Martin Schwidefsky
2009-03-12  8:33     ` Martin Schwidefsky
2009-03-12 10:19     ` Martin Schwidefsky
2009-03-12 10:19       ` Martin Schwidefsky
2009-03-12 11:24       ` Martin Schwidefsky
2009-03-12 11:24         ` Martin Schwidefsky
2009-03-12 14:10     ` Matt Mackall
2009-03-12 14:10       ` Matt Mackall
2009-03-12 14:42       ` Martin Schwidefsky [this message]
2009-03-12 14:42         ` Martin Schwidefsky
2009-03-12 15:58         ` Matt Mackall
2009-03-12 15:58           ` Matt Mackall
2009-03-16 12:27           ` Martin Schwidefsky
2009-03-16 12:27             ` Martin Schwidefsky
2009-03-16 12:36             ` Nick Piggin
2009-03-16 12:36               ` Nick Piggin
2009-03-16 12:55               ` Martin Schwidefsky
2009-03-16 12:55                 ` Martin Schwidefsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090312154229.3ee463eb@skybase \
    --to=schwidefsky@de.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=gerald.schaefer@de.ibm.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mpm@selenic.com \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.