linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Robin Holt <holt@sgi.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Robin Holt <holt@sgi.com>, Nick Piggin <nickpiggin@yahoo.com.au>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <andrea@qumranet.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <clameter@sgi.com>,
	Jack Steiner <steiner@sgi.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm-devel@lists.sourceforge.net,
	Kanoj Sarcar <kanojsarcar@yahoo.com>,
	Roland Dreier <rdreier@cisco.com>,
	Steve Wise <swise@opengridcomputing.com>,
	linux-kernel@vger.kernel.org, Avi Kivity <avi@qumranet.com>,
	linux-mm@kvack.org, general@lists.openfabrics.org,
	Hugh Dickins <hugh@veritas.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Anthony Liguori <aliguori@us.ibm.com>,
	Chris Wright <chrisw@redhat.com>,
	Marcelo Tosatti <marcelo@kvack.org>,
	Eric Dumazet <dada1@cosmosbay.com>,
	"Paul E. McKenney" <paulmck@us.ibm.com>
Subject: Re: [PATCH 08 of 11] anon-vma-rwsem
Date: Thu, 15 May 2008 06:01:48 -0500	[thread overview]
Message-ID: <20080515110147.GD10126@sgi.com> (raw)
In-Reply-To: <20080515075747.GA7177@wotan.suse.de>

We are pursuing Linus' suggestion currently.  This discussion is
completely unrelated to that work.

On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote:
> I'm not sure if you're thinking about what I'm thinking of. With the
> scheme I'm imagining, all you will need is some way to raise an IPI-like
> interrupt on the target domain. The IPI target will have a driver to
> handle the interrupt, which will determine the mm and virtual addresses
> which are to be invalidated, and will then tear down those page tables
> and issue hardware TLB flushes within its domain. On the Linux side,
> I don't see why this can't be done.

We would need to deposit the payload into a central location to do the
invalidate, correct?  That central location would either need to be
indexed by physical cpuid (65536 possible currently, UV will push that
up much higher) or some sort of global id which is difficult because
remote partitions can reboot giving you a different view of the machine
and running partitions would need to be updated.  Alternatively, that
central location would need to be protected by a global lock or atomic
type operation, but a majority of the machine does not have coherent
access to other partitions so they would need to use uncached operations.
Essentially, take away from this paragraph that it is going to be really
slow or really large.

Then we need to deposit the information needed to do the invalidate.

Lastly, we would need to interrupt.  Unfortunately, here we have a
thundering herd.  There could be up to 16256 processors interrupting the
same processor.  That will be a lot of work.  It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

Unfortunately, the sending side is not free to continue (in most cases)
until it knows that the invalidate is completed.  So it will need to spin
waiting for a completion signal will could be as simple as an uncached
word.  But how will it handle the possible failure of the other partition?
How will it detect that failure and recover?  A timeout value could be
difficult to gauge because the other side may be off doing a considerable
amount of work and may just be backed up.

> Sure, you obviously would need to rework your code because it's been
> written with the assumption that it can sleep.

It is an assumption based upon some of the kernel functions we call
doing things like grabbing mutexes or rw_sems.  That pushes back to us.
I think the kernel's locking is perfectly reasonable.  The problem we run
into is we are trying to get from one context in one kernel to a different
context in another and the in-between piece needs to be sleepable.

> What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

XPMEM allows one process to make a portion of its virtual address range
directly addressable by another process with the appropriate access.
The other process can be on other partitions.  As long as Numa-link
allows access to the memory, we can make it available.  Userland has an
advantage in that the kernel entrance/exit code contains memory errors
so we can contain hardware failures (in most cases) to only needing to
terminate a user program and not lose the partition.  The kernel enjoys
no such fault containment so it can not safely directly reference memory.


Thanks,
Robin

  reply	other threads:[~2008-05-15 11:02 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-07 14:35 [PATCH 00 of 11] mmu notifier #v16 Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 01 of 11] mmu-notifier-core Andrea Arcangeli
2008-05-07 17:35   ` Rik van Riel
2008-05-07 20:02   ` Andrew Morton
2008-05-07 20:05   ` Andrew Morton
2008-05-07 20:30     ` Linus Torvalds
2008-05-07 21:58       ` Andrea Arcangeli
2008-05-07 22:11         ` Linus Torvalds
2008-05-07 22:27           ` Andrea Arcangeli
2008-05-07 22:31             ` [ofa-general] " Roland Dreier
2008-05-07 22:39               ` Andrea Arcangeli
2008-05-07 23:03                 ` Linus Torvalds
2008-05-07 22:37             ` Andrea Arcangeli
2008-05-07 23:38               ` Linus Torvalds
2008-05-07 23:00             ` Linus Torvalds
2008-05-07 14:35 ` [PATCH 02 of 11] get_task_mm Andrea Arcangeli
2008-05-07 15:59   ` Robin Holt
2008-05-07 16:20     ` Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 03 of 11] invalidate_page outside PT lock Andrea Arcangeli
2008-05-07 17:39   ` Rik van Riel
2008-05-07 17:57     ` Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 04 of 11] free-pgtables Andrea Arcangeli
2008-05-07 17:41   ` Rik van Riel
2008-05-07 14:35 ` [PATCH 05 of 11] unmap vmas tlb flushing Andrea Arcangeli
2008-05-07 17:46   ` Rik van Riel
2008-05-07 14:35 ` [PATCH 06 of 11] rwsem contended Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 07 of 11] i_mmap_rwsem Andrea Arcangeli
2008-05-07 14:35 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli
2008-05-07 20:56   ` Linus Torvalds
2008-05-07 21:26     ` Andrea Arcangeli
2008-05-07 21:36       ` Linus Torvalds
2008-05-07 22:22         ` Andrea Arcangeli
2008-05-07 22:31           ` Andrew Morton
2008-05-07 22:44             ` Andrea Arcangeli
2008-05-07 22:59               ` Andrew Morton
2008-05-07 23:19                 ` Linus Torvalds
2008-05-07 23:39                   ` Christoph Lameter
2008-05-08  0:03                     ` Linus Torvalds
2008-05-08  0:52                       ` Robin Holt
2008-05-08  0:56                       ` Christoph Lameter
2008-05-08  1:07                         ` Linus Torvalds
2008-05-08  1:39                         ` Linus Torvalds
2008-05-08  1:52                           ` Andrea Arcangeli
2008-05-08  1:57                             ` Linus Torvalds
2008-05-08  2:24                               ` Andrea Arcangeli
2008-05-08  2:32                                 ` Linus Torvalds
2008-05-07 23:39                 ` Andrea Arcangeli
2008-05-08  1:02                   ` Linus Torvalds
2008-05-08  1:12                     ` Christoph Lameter
2008-05-08  1:32                       ` Linus Torvalds
2008-05-08  2:56                       ` Andrea Arcangeli
2008-05-08  3:10                         ` Christoph Lameter
2008-05-08  3:41                           ` Andrea Arcangeli
2008-05-08  4:14                             ` Linus Torvalds
2008-05-08  5:20                               ` Andrea Arcangeli
2008-05-08  5:27                                 ` Pekka Enberg
2008-05-08  5:30                                   ` Pekka Enberg
2008-05-08  5:49                                     ` Andrea Arcangeli
2008-05-08 15:03                                 ` Linus Torvalds
2008-05-08 16:11                                   ` Linus Torvalds
2008-05-08 22:01                                     ` Andrea Arcangeli
2008-05-09 18:37                                     ` Peter Zijlstra
2008-05-09 18:55                                       ` Andrea Arcangeli
2008-05-09 19:04                                         ` Peter Zijlstra
2008-05-08  1:26                     ` Andrea Arcangeli
2008-05-07 23:28               ` Benjamin Herrenschmidt
2008-05-07 23:45                 ` Andrea Arcangeli
2008-05-08  1:34                   ` Andrea Arcangeli
2008-05-13 12:14                     ` Nick Piggin
2008-05-14  5:43                       ` Benjamin Herrenschmidt
2008-05-14  6:06                         ` Nick Piggin
2008-05-14 13:15                         ` Jack Steiner
2008-05-07 22:44           ` Linus Torvalds
2008-05-07 22:58             ` Andrea Arcangeli
2008-05-07 23:02               ` Andrea Arcangeli
2008-05-07 23:09               ` Linus Torvalds
2008-05-08  0:38         ` Robin Holt
2008-05-08  0:55           ` Linus Torvalds
2008-05-13 12:06           ` Nick Piggin
2008-05-13 15:32             ` Robin Holt
2008-05-14  4:11               ` Nick Piggin
2008-05-14 11:26                 ` Robin Holt
2008-05-14 15:18                   ` Linus Torvalds
2008-05-14 16:22                     ` Robin Holt
2008-05-14 16:56                       ` Linus Torvalds
2008-05-14 17:57                     ` Christoph Lameter
2008-05-14 18:27                       ` Linus Torvalds
2008-05-17  1:38                         ` mm notifier: Notifications when pages are unmapped Christoph Lameter
2008-05-15  7:57                   ` [PATCH 08 of 11] anon-vma-rwsem Nick Piggin
2008-05-15 11:01                     ` Robin Holt [this message]
2008-05-15 11:12                       ` Avi Kivity
2008-05-15 17:33                     ` Christoph Lameter
2008-05-15 23:52                       ` Nick Piggin
2008-05-16 11:23                         ` Robin Holt
2008-05-16 11:50                           ` Robin Holt
2008-05-20  5:31                             ` Nick Piggin
2008-05-20 10:01                               ` Robin Holt
2008-05-20 10:50                                 ` Nick Piggin
2008-05-20 11:05                                   ` Robin Holt
2008-05-20 11:14                                     ` Nick Piggin
2008-05-20 11:26                                       ` Robin Holt
2008-05-07 22:42       ` Jack Steiner
2008-05-07 14:35 ` [PATCH 09 of 11] mm_lock-rwsem Andrea Arcangeli
2008-05-07 14:36 ` [PATCH 10 of 11] export zap_page_range for XPMEM Andrea Arcangeli
2008-05-07 14:36 ` [PATCH 11 of 11] mmap sems Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2008-05-02 15:05 [PATCH 00 of 11] mmu notifier #v15 Andrea Arcangeli
2008-05-02 15:05 ` [PATCH 08 of 11] anon-vma-rwsem Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080515110147.GD10126@sgi.com \
    --to=holt@sgi.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=aliguori@us.ibm.com \
    --cc=andrea@qumranet.com \
    --cc=avi@qumranet.com \
    --cc=chrisw@redhat.com \
    --cc=clameter@sgi.com \
    --cc=dada1@cosmosbay.com \
    --cc=general@lists.openfabrics.org \
    --cc=hugh@veritas.com \
    --cc=kanojsarcar@yahoo.com \
    --cc=kvm-devel@lists.sourceforge.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=marcelo@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=npiggin@suse.de \
    --cc=paulmck@us.ibm.com \
    --cc=rdreier@cisco.com \
    --cc=rusty@rustcorp.com.au \
    --cc=steiner@sgi.com \
    --cc=swise@opengridcomputing.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).