From: Martin Schwidefsky <schwidefsky@de.ibm.com> To: Matt Mackall <mpm@selenic.com> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gerald Schaefer <gerald.schaefer@de.ibm.com>, akpm@linux-foundation.org, Hugh Dickins <hugh@veritas.com>, Nick Piggin <npiggin@suse.de> Subject: Re: [PATCH] fix/improve generic page table walker Date: Thu, 12 Mar 2009 15:42:29 +0100 [thread overview] Message-ID: <20090312154229.3ee463eb@skybase> (raw) In-Reply-To: <1236867014.3213.16.camel@calx> On Thu, 12 Mar 2009 09:10:14 -0500 Matt Mackall <mpm@selenic.com> wrote: > [Nick and Hugh, maybe you can shed some light on this for me] > > On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote: > > On Wed, 11 Mar 2009 12:24:23 -0500 > > Matt Mackall <mpm@selenic.com> wrote: > > > > > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote: > > > > From: Martin Schwidefsky <schwidefsky@de.ibm.com> > > > > > > > > On s390 the /proc/pid/pagemap interface is currently broken. This is > > > > caused by the unconditional loop over all pgd/pud entries as specified > > > > by the address range passed to walk_page_range. The tricky bit here > > > > is that the pgd++ in the outer loop may only be done if the page table > > > > really has 4 levels. For the pud++ in the second loop the page table needs > > > > to have at least 3 levels. With the dynamic page tables on s390 we can have > > > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the > > > > pud pointer can get out-of-bounds causing all kinds of mayhem. > > > > > > Not sure why this should be a problem without delving into the S390 > > > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in > > > a way that's transparent to the walker. > > > > Its hard to understand without looking at the s390 details. The main > > difference between x86 and s390 in that respect is that on s390 the > > number of page table levels is determined at runtime on a per process > > basis. A compat process uses 2 levels, a 64 bit process starts with 3 > > levels and can "upgrade" to 4 levels if something gets mapped above > > 4TB. Which means that a *pgd can point to a region-second (2**53 bytes), > > a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud > > can point to a region-third or a segment table. The page table > > primitives know about this semantic, in particular pud_offset and > > pmd_offset check the type of the page table pointed to by *pgd and *pud > > and do nothing with the pointer if it is a lower level page table. > > The only operation I can not "patch" is the pgd++/pud++ operation. > > So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another > object with different semantics that generic code can trip over. Then what exactly is a pgd_t? For me it is the top level page table which can have very different meaning for the various architectures. > Can I get you to explain why this is necessary or even preferable to > doing it the generic way where pgd_t has a fixed software meaning > regardless of how many hardware levels are in play? Well, the hardware can do up to 5 levels of page tables for the full 64 bit address space. With the introduction of pud's we wanted to extend our address space from 3 levels / 42 bits to 4 levels / 53 bits. But this comes at a cost: additional page table levels cost memory and performance. In particular for the compat processes which can only address a maximum of 2 GB it is a waste to allocate 4 levels. With the dynamic page tables we allocate as much as required by each process. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
WARNING: multiple messages have this Message-ID (diff)
From: Martin Schwidefsky <schwidefsky@de.ibm.com> To: Matt Mackall <mpm@selenic.com> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gerald Schaefer <gerald.schaefer@de.ibm.com>, akpm@linux-foundation.org, Hugh Dickins <hugh@veritas.com>, Nick Piggin <npiggin@suse.de> Subject: Re: [PATCH] fix/improve generic page table walker Date: Thu, 12 Mar 2009 15:42:29 +0100 [thread overview] Message-ID: <20090312154229.3ee463eb@skybase> (raw) In-Reply-To: <1236867014.3213.16.camel@calx> On Thu, 12 Mar 2009 09:10:14 -0500 Matt Mackall <mpm@selenic.com> wrote: > [Nick and Hugh, maybe you can shed some light on this for me] > > On Thu, 2009-03-12 at 09:33 +0100, Martin Schwidefsky wrote: > > On Wed, 11 Mar 2009 12:24:23 -0500 > > Matt Mackall <mpm@selenic.com> wrote: > > > > > On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote: > > > > From: Martin Schwidefsky <schwidefsky@de.ibm.com> > > > > > > > > On s390 the /proc/pid/pagemap interface is currently broken. This is > > > > caused by the unconditional loop over all pgd/pud entries as specified > > > > by the address range passed to walk_page_range. The tricky bit here > > > > is that the pgd++ in the outer loop may only be done if the page table > > > > really has 4 levels. For the pud++ in the second loop the page table needs > > > > to have at least 3 levels. With the dynamic page tables on s390 we can have > > > > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the > > > > pud pointer can get out-of-bounds causing all kinds of mayhem. > > > > > > Not sure why this should be a problem without delving into the S390 > > > code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in > > > a way that's transparent to the walker. > > > > Its hard to understand without looking at the s390 details. The main > > difference between x86 and s390 in that respect is that on s390 the > > number of page table levels is determined at runtime on a per process > > basis. A compat process uses 2 levels, a 64 bit process starts with 3 > > levels and can "upgrade" to 4 levels if something gets mapped above > > 4TB. Which means that a *pgd can point to a region-second (2**53 bytes), > > a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud > > can point to a region-third or a segment table. The page table > > primitives know about this semantic, in particular pud_offset and > > pmd_offset check the type of the page table pointed to by *pgd and *pud > > and do nothing with the pointer if it is a lower level page table. > > The only operation I can not "patch" is the pgd++/pud++ operation. > > So in short, sometimes a pgd_t isn't really a pgd_t at all. It's another > object with different semantics that generic code can trip over. Then what exactly is a pgd_t? For me it is the top level page table which can have very different meaning for the various architectures. > Can I get you to explain why this is necessary or even preferable to > doing it the generic way where pgd_t has a fixed software meaning > regardless of how many hardware levels are in play? Well, the hardware can do up to 5 levels of page tables for the full 64 bit address space. With the introduction of pud's we wanted to extend our address space from 3 levels / 42 bits to 4 levels / 53 bits. But this comes at a cost: additional page table levels cost memory and performance. In particular for the compat processes which can only address a maximum of 2 GB it is a waste to allocate 4 levels. With the dynamic page tables we allocate as much as required by each process. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-03-12 14:46 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2009-03-11 13:49 [PATCH] fix/improve generic page table walker Martin Schwidefsky 2009-03-11 13:49 ` Martin Schwidefsky 2009-03-11 17:24 ` Matt Mackall 2009-03-11 17:24 ` Matt Mackall 2009-03-12 8:33 ` Martin Schwidefsky 2009-03-12 8:33 ` Martin Schwidefsky 2009-03-12 10:19 ` Martin Schwidefsky 2009-03-12 10:19 ` Martin Schwidefsky 2009-03-12 11:24 ` Martin Schwidefsky 2009-03-12 11:24 ` Martin Schwidefsky 2009-03-12 14:10 ` Matt Mackall 2009-03-12 14:10 ` Matt Mackall 2009-03-12 14:42 ` Martin Schwidefsky [this message] 2009-03-12 14:42 ` Martin Schwidefsky 2009-03-12 15:58 ` Matt Mackall 2009-03-12 15:58 ` Matt Mackall 2009-03-16 12:27 ` Martin Schwidefsky 2009-03-16 12:27 ` Martin Schwidefsky 2009-03-16 12:36 ` Nick Piggin 2009-03-16 12:36 ` Nick Piggin 2009-03-16 12:55 ` Martin Schwidefsky 2009-03-16 12:55 ` Martin Schwidefsky
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20090312154229.3ee463eb@skybase \ --to=schwidefsky@de.ibm.com \ --cc=akpm@linux-foundation.org \ --cc=gerald.schaefer@de.ibm.com \ --cc=hugh@veritas.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mpm@selenic.com \ --cc=npiggin@suse.de \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.