linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hugh@veritas.com>
To: Dave McCracken <dmccr@us.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux Memory Management <linux-mm@kvack.org>
Subject: Re: [PATCH/RFC] Shared page tables
Date: Tue, 24 Jan 2006 17:50:17 +0000 (GMT)	[thread overview]
Message-ID: <Pine.LNX.4.61.0601241604550.4262@goblin.wat.veritas.com> (raw)
In-Reply-To: <6F40FCDC9FFDE7B6ACD294F5@[10.1.1.4]>

On Mon, 23 Jan 2006, Dave McCracken wrote:
> --On Friday, January 20, 2006 21:24:08 +0000 Hugh Dickins
> <hugh@veritas.com> wrote:
> 
> I'll look into getting some profiles.

Thanks, that will help everyone to judge the performance/complexity better.

> The pmd level is important for ppc because it works in segments, which are
> 256M in size and consist of a full pmd page.  The current implementation of
> the way ppc loads its tlb doesn't allow sharing at smaller than segment
> size.

Does that make pmd page sharing strictly necessary?  The way you describe
it, it sounds like it's merely that you "might as well" share pmd page,
because otherwise would always just waste memory on PPC.  But if it's
significantly less complex not to go to that level, it may be worth that
waste of memory (which would be a small fraction of what's now wasted at
the pte level, wouldn't it?).  Sorry for belabouring this point, which
may just be a figment of my ignorance, but you've not convinced me yet.

And maybe I'm exaggerating the additional complexity: you have, after
all, been resolute in treating the levels in the same way.  It's more
a matter of offputting patch size than complexity: imagine splitting
the patch into two, one to implement it at the pte level first, then
a second to take it up to pmd level, that would be better.

> I needed a function that returns a struct page for pgd and pud, defined in
> each architecture.  I decided the simplest way was to redefine pgd_page and
> pud_page to match pmd_page and pte_page.  Both functions are pretty much
> used one place per architecture, so the change is trivial.  I could come up
> with new functions instead if you think it's an issue.  I do have a bit of
> a fetish about symmetry across levels :)

Sounds to me like you made the right decision.

> >> +#define	pt_decrement_share(page)
> >> +#define pt_shareable_vma(vma)	(0)
> >> +#define	pt_unshare_range(vma, address, end)
> >> +#endif /* CONFIG_PTSHARE */
> > 
> > Please keep to "#define<space>MACRO<tab(s)definition" throughout:
> > easiest thing would be to edit the patch itself.
> 
> Sorry.  Done.  I thought the standard was "#define<TAB>" like all the other
> C code I've ever seen.  I didn't realize Linux does it different.

No, I can't claim "#define<space>" is a Linux standard: I happen to
prefer it myself, and it seems to predominate in the header files I've
most often added to; but I was only trying to remove the distracting
inconsistency from your patch, whichever way round.

> >>  static inline int copy_pmd_range(struct mm_struct *dst_mm, struct
> >>  mm_struct *src_mm, pud_t *dst_pud, pud_t *src_pud, struct
> >>  		vm_area_struct *vma,
> >> -		unsigned long addr, unsigned long end)
> >> +		unsigned long addr, unsigned long end, int shareable)
> >>  {
> > 
> > I'd have preferred not to add the shareable argument at each level,
> > and work it out here; or is that significantly less efficient?
> 
> My gut feeling is that the vma-level tests for shareability are significant
> enough that we only want to do them once, then pass the result down the
> call  stack.  I could change it if you disagree about the relative cost.

I now think cut out completely your mods from copy_page_range and its
subfunctions.  Given Nick's "Don't copy ptes..." optimization there,
what shareable areas will reach the lower levels?  Only the VM_PFNMAP
and VM_INSERTPAGE ones: which do exist, and may be large enough to
qualify; but they're not the areas you're interested in targetting,
so I'd say keep it simple and forget about shareability here.  The
simpler you keep your patch, the likelier it is to convince people.

And that leads on to another issue which occurred to me today, in
reading through your overnight replies.  Notice the check on anon_vma
in that optimization?  None of your "shareable" tests check anon_vma:
the vm_flags shared/write check may be enough in some contexts, but
not in all.  VM_WRITE can come and go, and even a VM_SHARED vma may
have anon pages in it (through Linus' strange ptrace case): you must
keep away from sharing pagetables once you've got anonymous pages in
your vma (well, you could share those pagetables not yet anonymized,
but that's just silly complexity).  And in particular, the prio_tree
loop next_shareable_vma needs to skip any vma with its anon_vma set:
at present you're (unlikely but) liable to offer entirely unsuitable
pagetables for sharing there.

(In the course of writing this, I've discovered that 2.6.16-rc
supports COWed hugepages: anonymous pages without any anon_vma.
I'd better get up to speed on those: maybe we'll want to allocate
an anon_vma just for the appropriate tests, maybe we'll want to
set another VM_flag, don't know yet... for the moment you ignore
them, and assume anon_vma is the thing to test for, as in 2.6.15.)

One other thing I forgot to comment on your patch: too many largish
inline functions - our latest fashion is only to say "inline" on the
one-or-two liners.

Hugh

  parent reply	other threads:[~2006-01-24 17:49 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-01-05 16:19 [PATCH/RFC] Shared page tables Dave McCracken
2006-01-07 12:25 ` Heiko Carstens
2006-01-07 18:09   ` Dave McCracken
2006-01-08 12:09     ` Heiko Carstens
2006-01-08 14:04       ` Dave McCracken
2006-01-13  5:15 ` Brian Twichell
2006-01-13 22:34   ` Ray Bryant
2006-01-17  4:50     ` Brian Twichell
2006-01-25  4:14   ` Brian Twichell
2006-01-13 15:18 ` Phillip Susi
2006-01-14 20:45   ` Brian Twichell
2006-01-17 23:53 ` Robin Holt
2006-01-18  0:17   ` Dave Hansen
2006-01-18  6:11     ` Dave McCracken
2006-01-18  1:27   ` Chen, Kenneth W
2006-01-18  3:32     ` Robin Holt
2006-01-23 23:58   ` Ray Bryant
2006-01-24  0:16     ` Ray Bryant
2006-01-24  0:39       ` Andi Kleen
2006-01-24  0:51         ` Dave McCracken
2006-01-24  1:11           ` Andi Kleen
2006-01-24  1:26             ` Dave McCracken
2006-01-24  0:53         ` Ray Bryant
2006-01-24  1:00           ` Dave McCracken
2006-01-24  1:10           ` Andi Kleen
2006-01-24  1:23             ` Benjamin LaHaise
2006-01-24  1:38               ` Andi Kleen
2006-01-24  7:08                 ` Arjan van de Ven
2006-01-24  7:06             ` Arjan van de Ven
2006-01-24  7:18               ` Andi Kleen
2006-01-27 18:16                 ` Martin Bligh
2006-02-01  9:49                 ` Nick Piggin
2006-01-24 14:48               ` Dave McCracken
2006-01-24 14:56                 ` Arjan van de Ven
2006-01-24  0:19     ` Dave McCracken
2006-01-24  0:46       ` Ray Bryant
2006-01-24 23:43       ` Ray Bryant
2006-01-24 23:50         ` Dave McCracken
2006-01-25  0:21           ` Ray Bryant
2006-01-25 22:48           ` Ray Bryant
2006-01-25 22:52             ` Dave McCracken
2006-01-26  0:16               ` Ray Bryant
2006-01-26  0:58               ` Ray Bryant
2006-01-26  4:06                 ` Robin Holt
2006-01-20 21:24 ` Hugh Dickins
2006-01-20 21:54   ` Chen, Kenneth W
2006-01-23 17:39   ` Dave McCracken
2006-01-23 20:19     ` Benjamin LaHaise
2006-01-24 17:50     ` Hugh Dickins [this message]
2006-01-24 18:07       ` Dave McCracken
2006-01-24 18:20         ` Hugh Dickins
2006-01-27 22:50   ` Brian Twichell
2006-01-30 18:46     ` Ray Bryant
2006-01-31 18:47       ` Brian Twichell
2006-01-31 19:18         ` Dave McCracken

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.61.0601241604550.4262@goblin.wat.veritas.com \
    --to=hugh@veritas.com \
    --cc=akpm@osdl.org \
    --cc=dmccr@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).