linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Christoph Hellwig <hch@infradead.org>
Cc: Terje Eggestad <terje.eggestad@scali.com>,
	Arjan van de Ven <arjanv@redhat.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	D.A.Fedorov@inp.nsk.su
Subject: Re: The disappearing sys_call_table export.
Date: 06 May 2003 01:30:35 -0600	[thread overview]
Message-ID: <m17k94bkh0.fsf@frodo.biederman.org> (raw)
In-Reply-To: <20030505112531.B16914@infradead.org>

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, May 05, 2003 at 11:33:36AM +0200, Terje Eggestad wrote:
> > 1. performance is everything. 
> 
> then Linux is the wrong OS for you :)
> 
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application. 
> 
> I can't remember that the MPI spec tells anything about intercepting
> syscalls..
> 
> > 3b. the performance loss from copying from a receive area to the
> > userspace buffer is unacceptable. 
> > 3c. It's therefore necessary for HW to access user pages. 
> > 4. In order to to 3, the user pages must be pinned down. 
> > 5. the way MPI is written, it's not using a special malloc() to allocate
> > the send receive buffers. It can't since it would break language binding
> > to fortran. Thus ANY writeable user page may be used. 

Looking at the mpi spec there are two forms of point to point communications.
1) mpi_send/mpi_recv which do have that limitation.
2) mpi_put/mpi_get which are restricted to be used with a specifically
   allocated window, and the window can be restricted to areas allocated
   with mpi_alloc_mem.

So the mpi_put/mpi_get should be easy to optimize.

Handling mpi_send/mpi_recv is more difficult.  MPI specifies
that the data can be copied it just does not require it so in
sufficiently weird situations a copy slow path can be taken.

So there are really two questions here.
1)  What is a clean way to provide a high performance message
    passing layer.  Assuming you have a network card for which
    it is safe to mmap a subset of control registers.

2) What is a good way to map MPI onto that clean layer.

I believe the answer on how to do a clean safe interface is
to allocate the memory and tell the card about it in the driver,
and then allow user space to mmap it.  With the driver mmap operation
informing the network card of the mapping.

A good implementation of mpi on top of that is an interesting
question.  Replacing malloc and free and having everything run on
top of the mmapped buffer sounds like a possibility.  But it is
additionally desirable for the memory used by an MPI job to come
from hugetlbfs, or the equivalent.  And I don't know if a driver
can provide huge pages.

At this point I am strongly tempted to see what it would take to come
up with an MPI-2.1 to fix this issue.

> so use get_user_pages.
> 
> > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > buffers every time they're used. 
> 
> Umm, pinning memory all the time means you get a bunch of nice DoS
> attachs due to the huge amount of memory.

I wonder if there is an easy way to optimize this if you don't have
swap configured.  In general it is a bug if an MPI job swaps.

In general there is one mpi process per cpu running on a machine.  So
I have trouble seeing this as a denial of service.

> > 7. The only way to cache buffers (to see if they're used before and
> > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > to a device file is prohibitive expensive under point 1.  
> 
> That's a horribly b0rked approach..
> 
> Again, where's your driver source so we can help you to find a better
> approach out of that mess?

With some digging I can find the source for both quadrics and myrinet
drivers, and they have the same issues.  This is a general problem
for running MPI jobs so it is probably worth finding a solution that
works for those people whose source we can obtain.

Eric

  parent reply	other threads:[~2003-05-06  7:22 UTC|newest]

Thread overview: 207+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-05-05  8:19 The disappearing sys_call_table export Terje Eggestad
2003-05-05  8:23 ` Christoph Hellwig
2003-05-05  9:33   ` Terje Eggestad
2003-05-05  9:38     ` Arjan van de Ven
2003-05-05 10:12       ` Terje Eggestad
2003-05-05 10:25     ` Christoph Hellwig
2003-05-05 11:23       ` Terje Eggestad
2003-05-05 11:27         ` Arjan van de Ven
2003-05-05 11:31         ` Terje Eggestad
2003-05-05 11:33           ` Arjan van de Ven
2003-05-05 15:53             ` Tigran Aivazian
2003-05-05 14:57               ` Christoph Hellwig
2003-05-05 14:59               ` Arjan van de Ven
2003-05-05 12:52         ` Christoph Hellwig
2003-05-05 13:41           ` Terje Eggestad
2003-05-05 13:43             ` Christoph Hellwig
2003-05-05 13:50               ` Terje Eggestad
2003-05-05 13:54                 ` Arjan van de Ven
2003-05-05 13:55                 ` Christoph Hellwig
2003-05-05 14:28                   ` Carl-Daniel Hailfinger
2003-05-05 14:34                     ` Christoph Hellwig
2003-05-05 15:25                       ` Carl-Daniel Hailfinger
2003-05-06  7:30       ` Eric W. Biederman [this message]
2003-05-06  8:14         ` Terje Eggestad
2003-05-06  9:21           ` Eric W. Biederman
2003-05-06 11:21             ` Terje Eggestad
2003-05-06 11:37               ` Eric W. Biederman
2003-05-06 12:08                 ` Terje Eggestad
2003-05-05 11:16     ` Alan Cox
2003-05-05 13:23       ` Terje Eggestad
2003-05-08 12:25       ` Terje Malmedal
2003-05-08 12:29         ` Christoph Hellwig
2003-05-08 13:18           ` Terje Malmedal
2003-05-08 14:25             ` Christoph Hellwig
2003-05-08 15:29               ` Terje Malmedal
2003-05-08 18:13                 ` Jesse Pollard
2003-05-08 19:17                   ` Christoph Hellwig
2003-05-09  9:18                   ` Terje Malmedal
2003-05-08 14:58         ` Alan Cox
2003-05-09  8:56           ` Terje Malmedal
2003-05-07  2:14     ` Ben Lau
2003-05-05  8:27 ` Arjan van de Ven
2003-05-05  9:01 ` Dmitry A. Fedorov
2003-05-05  9:19   ` Christoph Hellwig
2003-05-05  9:32   ` Arjan van de Ven
2003-05-05 13:30 Dmitry A. Fedorov
2003-05-05 13:42 ` Christoph Hellwig
2003-05-05 14:46   ` Dmitry A. Fedorov
2003-05-05 13:45 ` viro
2003-05-05 14:29   ` Dmitry A. Fedorov
     [not found] <mailman.1052142720.4060.linux-kernel2news@redhat.com>
2003-05-05 20:50 ` Pete Zaitcev
2003-05-06  2:17   ` Dmitry A. Fedorov
2003-05-05 21:29 Chuck Ebbert
2003-05-05 22:49 ` Terje Eggestad
2003-05-06  2:23   ` Dmitry A. Fedorov
2003-05-06  7:27     ` Terje Eggestad
2003-05-06  8:21       ` Dmitry A. Fedorov
2003-05-06  8:45 Yoav Weiss
2003-05-06  9:15 ` David S. Miller
2003-05-06 19:45   ` David Schwartz
2003-05-06 10:06 ` Dmitry A. Fedorov
2003-05-06 17:01 ` Jerry Cooperstein
2003-05-06 17:45   ` Yoav Weiss
2003-05-06 15:51 Yoav Weiss
2003-05-06 20:48 Chuck Ebbert
2003-05-07 15:34 petter wahlman
2003-05-07 15:48 ` Arjan van de Ven
2003-05-07 16:00 ` Richard B. Johnson
2003-05-07 16:08   ` petter wahlman
2003-05-07 16:45     ` Richard B. Johnson
2003-05-07 16:59     ` Richard B. Johnson
2003-05-07 18:07       ` petter wahlman
2003-05-07 18:33         ` Richard B. Johnson
2003-05-08  8:58           ` petter wahlman
2003-05-08 15:11             ` Richard B. Johnson
2003-05-07 21:27         ` Jesse Pollard
2003-05-07 17:21     ` Jesse Pollard
2003-05-07 16:18 ` Steffen Persvold
2003-05-08 12:23   ` Eric W. Biederman
2003-05-07 19:04 Chuck Ebbert
2003-05-08  9:58 ` Terje Eggestad
2003-05-08  9:59   ` Arjan van de Ven
2003-05-08 10:20     ` viro
2003-05-08 12:54     ` Terje Eggestad
2003-05-08 12:58       ` Christoph Hellwig
2003-05-08 19:10         ` Shachar Shemesh
2003-05-08 19:15           ` Christoph Hellwig
2003-05-08 21:48             ` J.A. Magallon
2003-05-09  7:43               ` Muli Ben-Yehuda
2003-05-09  7:42             ` Muli Ben-Yehuda
2003-05-09  8:08               ` Greg KH
2003-05-09 19:07                 ` Muli Ben-Yehuda
2003-05-08 14:08 Chuck Ebbert
2003-05-08 14:36 ` Christoph Hellwig
2003-05-08 14:42 ` Alan Cox
2003-05-08 14:56 ` Jesse Pollard
2003-05-08 15:22   ` Alan Cox
2003-05-08 17:02     ` William Stearns
2003-05-08 18:28     ` Jesse Pollard
2003-05-10 14:38     ` Ahmed Masud
2003-05-10 16:50       ` Arjan van de Ven
2003-05-10 17:51         ` Ahmed Masud
2003-05-10 17:56           ` Arjan van de Ven
2003-05-10 18:03             ` Ahmed Masud
2003-05-10 18:09             ` Ahmed Masud
2003-05-10 18:43           ` Werner Almesberger
2003-05-10 18:26         ` Werner Almesberger
2003-05-11 11:01         ` Terje Malmedal
2003-05-11 11:57           ` Ahmed Masud
2003-05-08 19:43 Chuck Ebbert
2003-05-08 19:48 ` Christoph Hellwig
2003-05-08 21:44 ` Alan Cox
2003-05-08 19:43 Chuck Ebbert
2003-05-08 19:58 ` Christoph Hellwig
2003-05-09 13:53 ` Jesse Pollard
2003-05-09 14:37   ` Ragnar =?unknown-8bit?Q?Kj=F8rstad?=
2003-05-12 14:19     ` Jesse Pollard
2003-05-12 15:56       ` Christoph Hellwig
2003-05-08 19:43 Chuck Ebbert
2003-05-09  7:50 Chuck Ebbert
2003-05-09  7:59 ` Christoph Hellwig
2003-05-09 12:18 ` Alan Cox
2003-05-09 17:07   ` Valdis.Kletnieks
2003-05-10 15:34     ` Alan Cox
2003-05-09  7:50 Chuck Ebbert
2003-05-09  7:57 ` Christoph Hellwig
2003-05-09  9:11 Chuck Ebbert
2003-05-09 10:47 ` Christoph Hellwig
2003-05-09  9:43 Chuck Ebbert
2003-05-09 11:09 Chuck Ebbert
2003-05-09 12:41 Chuck Ebbert
2003-05-09 12:47 ` Christoph Hellwig
2003-05-09 17:07 Chuck Ebbert
2003-05-09 17:07 Chuck Ebbert
2003-05-09 18:27 ` Richard B. Johnson
2003-05-09 19:02   ` Valdis.Kletnieks
2003-05-09 19:18     ` Richard B. Johnson
2003-05-09 19:25       ` Valdis.Kletnieks
2003-05-09 21:22 Chuck Ebbert
2003-05-10 19:18 Yoav Weiss
2003-05-10 19:53 ` Muli Ben-Yehuda
2003-05-10 20:06   ` Yoav Weiss
2003-05-11  3:54     ` Ahmed Masud
2003-05-10 20:48 ` David Wagner
2003-05-10 19:32 Chuck Ebbert
2003-05-10 21:45 Yoav Weiss
2003-05-11 16:32 Chuck Ebbert
2003-05-11 17:20 ` David Wagner
2003-05-11 17:53 ` Yoav Weiss
2003-05-11 20:39 Chuck Ebbert
2003-05-11 22:32 ` Yoav Weiss
2003-05-11 21:46   ` Alan Cox
2003-05-11 22:57     ` David Schwartz
2003-05-14 21:08       ` H. Peter Anvin
2003-05-11 23:22     ` Yoav Weiss
2003-05-11 22:32 ` Ahmed Masud
     [not found] <20030511164010$5d34@gated-at.bofh.it>
2003-05-12  0:47 ` Ben Pfaff
2003-05-12 16:32 Chuck Ebbert
2003-05-12 16:46 ` Alan Cox
     [not found] <20030512164017$6c09@gated-at.bofh.it>
2003-05-12 17:02 ` Pascal Schmidt
2003-05-12 21:51 Chuck Ebbert
2003-05-12 21:05 ` Alan Cox
2003-05-12 22:12 ` Valdis.Kletnieks
2003-05-12 21:19   ` Alan Cox
2003-05-12 22:29     ` Valdis.Kletnieks
2003-05-13 12:31     ` Ahmed Masud
2003-05-12 22:57 Yoav Weiss
2003-05-12 23:58 ` Bryan Andersen
2003-05-13 12:11 ` Jesse Pollard
2003-05-13 13:44   ` Yoav Weiss
2003-05-13 21:26     ` Jesse Pollard
2003-05-13 22:21       ` Yoav Weiss
2003-05-14 13:05         ` Jesse Pollard
2003-05-13  1:57 Chuck Ebbert
2003-05-13  2:25 ` Yoav Weiss
2003-05-13  1:57 Chuck Ebbert
2003-05-13 12:24 ` Jesse Pollard
2003-05-13  9:52 Chuck Ebbert
2003-05-13 13:32 ` Yoav Weiss
2003-05-14  7:44 ` Mike Touloumtzis
2003-05-14 10:34   ` Ahmed Masud
2003-05-14 20:58     ` Mike Touloumtzis
2003-05-14 21:32       ` Richard B. Johnson
2003-05-14 21:37         ` Yoav Weiss
2003-05-14 21:51           ` Richard B. Johnson
2003-05-15 13:17         ` Jesse Pollard
2003-05-15 15:16           ` Chris Ricker
2003-05-15 15:31             ` Richard B. Johnson
2003-05-15 15:33               ` Chris Ricker
2003-05-15 15:46                 ` Richard B. Johnson
2003-05-15 16:21                   ` Ahmed Masud
2003-05-15  2:06       ` Ahmed Masud
2003-05-13 13:58 Yoav Weiss
2003-05-13 22:51 ` Ahmed Masud
2003-05-13 23:58   ` Yoav Weiss
2003-06-12 23:20     ` Nigel Cunningham
2003-06-15 22:37       ` Yoav Weiss
2003-05-13 14:45 Chuck Ebbert
2003-05-13 21:32 ` Jesse Pollard
2003-05-13 14:45 Chuck Ebbert
2003-05-13 19:00 ` jjs
2003-05-13 21:44 ` Jesse Pollard
2003-05-14  8:41 Chuck Ebbert
2003-05-14 23:24 Chuck Ebbert
2003-05-15  0:49 ` David Schwartz
2003-05-15  8:16 Chuck Ebbert
2003-05-16 16:15 Chuck Ebbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m17k94bkh0.fsf@frodo.biederman.org \
    --to=ebiederm@xmission.com \
    --cc=D.A.Fedorov@inp.nsk.su \
    --cc=arjanv@redhat.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=terje.eggestad@scali.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).