linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
       [not found] ` <3VNYt-4M4-15@gated-at.bofh.it>
@ 2005-04-22 13:10   ` Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>
  2005-04-22 17:01     ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation Fab Tillier
  0 siblings, 1 reply; 7+ messages in thread
From: Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org> @ 2005-04-22 13:10 UTC (permalink / raw)
  To: Andy Isaacson, Timur Tabi, Troy Benjegerdes, Bernhard Fischer,
	Arjan van de Ven, linux-kernel, openib-general

Andy Isaacson <adi@hexapodia.org> wrote:
> On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote:

>> I don't know if VM_REGISTERED is a good idea or not, but it should be
>> absolutely impossible for the kernel to reclaim "registered" (aka pinned)
>> memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc),
>> it's normal for non-root processes to pin hundreds of megabytes of memory,
>> and that memory better be locked to those physical pages until the
>> application deregisters them.
> 
> If you take the hardline position that "the app is the only thing that
> matters", your code is unlikely to get merged.  Linux is a
> general-purpose OS.

All userspace hardware drivers with DMA will require pinned pages (and some
of them will require continuous memory). Since this memory may be scheduled
to be accessed by DMA, reclaiming those pages may (aka. will) result in
"random" memory corruption unless done by the driver itself.

You can't even set a time limit, the driver may have allocated all DMA
memory to queued transfers, and some media needs to get plugged in by
the lazy robot. As soon as the robot arrives - boom. (For the same reason,
this memory MUST NOT be freed if the application terminates abnormally,
e.g. killed by OOM).

In other words, you need to make this memory as unaccessible as the
framebuffer on a graphic card. If that causes a lockup, you better had
prevented that while allocating.

> In a Linux context, I doubt that fullblown SA is necessary or
> appropriate.  Rather, I'd suggest two new signals, SIGMEMLOW and
> SIGMEMCRIT.  The userland comms library registers handlers for both.
> When the kernel decides that it needs to reclaim some memory from the
> app, it sends SIGMEMLOW.  The comms library then has the responsibility
> to un-reserve some memory in an orderly fashion.  If a reasonable [1]
> time has expired since SIGMEMLOW and the kernel is still hungry, the
> kernel sends SIGMEMCRIT.  At this point, the comms lib *must* unregister
> some memory [2] even if it has to drop state to do so; if it returns
> from the signal handler without having unregistered the memory, the
> kernel will SIGKILL.

Choosing Data loss vs. finitely stalled system may sometimes be a bad
decision.

If I designes an application that might get a "gimme memory or die",
I'd reserve an extra bunch of memory with the only purpose of being
released in this situation. If the kernel had done that instead, this
part of memory could have been used e.g. as a read-only disk cache in
the meantime (off cause provided somebody cared to implement that).

> [2] Is there a way for the kernel to pass down to userspace how many
>     pages it wants, maybe in the sigcontext?

Then you'd need only one signal.

I think this interface is usefull, it would e.g. allow a picture viewer
to cache as many decoded and scaled pictures as the RAM permits, freeing
them if the RAM gets full and the swap would have to be used.

-- 
"When the pin is pulled, Mr. Grenade is not our friend.
-U.S. Marine Corps


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation
  2005-04-22 13:10   ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>
@ 2005-04-22 17:01     ` Fab Tillier
  2005-04-22 22:01       ` Bodo Eggert
  0 siblings, 1 reply; 7+ messages in thread
From: Fab Tillier @ 2005-04-22 17:01 UTC (permalink / raw)
  To: 'Bodo Eggert
	<harvested.in.lkml@posting.7eggert.dyndns.org>',
	Andy Isaacson, Timur Tabi, Troy Benjegerdes, Bernhard Fischer,
	Arjan van de Ven, linux-kernel, openib-general

> From: Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>
> Sent: Friday, April 22, 2005 6:10 AM
> 
> All userspace hardware drivers with DMA will require pinned pages (and
> some of them will require continuous memory). Since this memory may be
> scheduled to be accessed by DMA, reclaiming those pages may (aka. will)
> result in "random" memory corruption unless done by the driver itself.

Any reclaim must involve the driver.  That doesn't mean that it must involve
the application.  That said this isn't trivial to implement.

> 
> You can't even set a time limit, the driver may have allocated all DMA
> memory to queued transfers, and some media needs to get plugged in by
> the lazy robot. As soon as the robot arrives - boom. (For the same reason,
> this memory MUST NOT be freed if the application terminates abnormally,
> e.g. killed by OOM).

InfiniBand provides support for deregistering memory that might be
referenced at some future time by an RDMA operation.  The only side effect
this has is that the QP on both sides of the connection transition to an
error state.

Upon abnormal termination, all registrations must be undone and the memory
unpinned.  This must be synchronized with the hardware so that there are no
races.  The IB deregistration semantics provide such synchronization.  I'd
venture that any HW design that does not do this is broken.

Requiring the memory to never be freed upon abnormal termination equates to
a serious memory leak, in that physical memory is leaked, not virtual.

- Fab


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation
  2005-04-22 17:01     ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation Fab Tillier
@ 2005-04-22 22:01       ` Bodo Eggert
  0 siblings, 0 replies; 7+ messages in thread
From: Bodo Eggert @ 2005-04-22 22:01 UTC (permalink / raw)
  To: Fab Tillier
  Cc: 'Bodo Eggert
	<harvested.in.lkml@posting.7eggert.dyndns.org>',
	Andy Isaacson, Timur Tabi, Troy Benjegerdes, Bernhard Fischer,
	Arjan van de Ven, linux-kernel, openib-general

On Fri, 22 Apr 2005, Fab Tillier wrote:
> > From: Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>
> > Sent: Friday, April 22, 2005 6:10 AM

> > You can't even set a time limit, the driver may have allocated all DMA
> > memory to queued transfers, and some media needs to get plugged in by
> > the lazy robot. As soon as the robot arrives - boom. (For the same reason,
> > this memory MUST NOT be freed if the application terminates abnormally,
> > e.g. killed by OOM).
> 
> InfiniBand provides support for deregistering memory that might be
> referenced at some future time by an RDMA operation.  The only side effect
> this has is that the QP on both sides of the connection transition to an
> error state.
> 
> Upon abnormal termination, all registrations must be undone and the memory
> unpinned.  This must be synchronized with the hardware so that there are no
> races.

If you know the hardware. If you have userspace drivers, this will be
impossible, and even if you have kernel drivers, you'll need to know 
which of them is responsible for each part of the pinned memory.

This doesn't imply the affected memory to be lost. The same application
that created the pinned memory can reset the hardware (provided nobody
changed the configuration), then reconnect to the shared memory segment
you'll use for that purpose and use or free it.

-- 
To iterate is human; to recurse, divine. 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation
  2005-04-25 23:13   ` Timur Tabi
  2005-04-25 23:17     ` Andrew Morton
@ 2005-04-25 23:29     ` Bob Woodruff
  1 sibling, 0 replies; 7+ messages in thread
From: Bob Woodruff @ 2005-04-25 23:29 UTC (permalink / raw)
  To: 'Timur Tabi'
  Cc: 'Andrew Morton',
	Davis, Arlin R, hch, linux-kernel, openib-general

Timur Tabi wrote,
 
>Any limit would have to be very high - definitely more than just half.
What if the 
>application needs to pin 2GB?  The customer is not going to buy 4+ GB of
RAM just 
>because 
>Linux doesn't like pinning more than half.  In an x86-32 system, that would
required >PAE 
>support and slow everything down.

>Off the top of my head, I'd say Linux would need to allow all but 512MB to
be pinned.  >So 
>you have 3GB of RAM, Linux should allow you to pin 2.5GB.

That is why we made it tunable, so that people could decide how to allow.

There is probably a better way to do it than some hard limit, but 
that would take a little more understanding of the VM system than we had,
and that is why some of the core kernel folks maybe able to help us come up
with a better solution.

woody


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation
  2005-04-25 23:13   ` Timur Tabi
@ 2005-04-25 23:17     ` Andrew Morton
  2005-04-25 23:29     ` Bob Woodruff
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2005-04-25 23:17 UTC (permalink / raw)
  To: Timur Tabi
  Cc: robert.j.woodruff, arlin.r.davis, hch, linux-kernel, openib-general

Timur Tabi <timur.tabi@ammasso.com> wrote:
>
> Bob Woodruff wrote:
> 
> > There definitely needs to be a mechanism to prevent people from pinning
> > too much memory. 
> 
> Any limit would have to be very high - definitely more than just half.  What if the 
> application needs to pin 2GB?  The customer is not going to buy 4+ GB of RAM just because 
> Linux doesn't like pinning more than half.  In an x86-32 system, that would required PAE 
> support and slow everything down.
> 
> Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned.  So 
> you have 3GB of RAM, Linux should allow you to pin 2.5GB.
> 

You can pin the whole darn lot *if you have the correct privileges*.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation
  2005-04-25 22:51 ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation Bob Woodruff
@ 2005-04-25 23:13   ` Timur Tabi
  2005-04-25 23:17     ` Andrew Morton
  2005-04-25 23:29     ` Bob Woodruff
  0 siblings, 2 replies; 7+ messages in thread
From: Timur Tabi @ 2005-04-25 23:13 UTC (permalink / raw)
  To: Bob Woodruff
  Cc: 'Andrew Morton',
	Davis, Arlin R, hch, linux-kernel, openib-general

Bob Woodruff wrote:

> There definitely needs to be a mechanism to prevent people from pinning
> too much memory. 

Any limit would have to be very high - definitely more than just half.  What if the 
application needs to pin 2GB?  The customer is not going to buy 4+ GB of RAM just because 
Linux doesn't like pinning more than half.  In an x86-32 system, that would required PAE 
support and slow everything down.

Off the top of my head, I'd say Linux would need to allow all but 512MB to be pinned.  So 
you have 3GB of RAM, Linux should allow you to pin 2.5GB.

-- 
Timur Tabi
Staff Software Engineer
timur.tabi@ammasso.com

One thing a Southern boy will never say is,
"I don't think duct tape will fix it."
      -- Ed Smylie, NASA engineer for Apollo 13

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation
  2005-04-25 22:35 [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Andrew Morton
@ 2005-04-25 22:51 ` Bob Woodruff
  2005-04-25 23:13   ` Timur Tabi
  0 siblings, 1 reply; 7+ messages in thread
From: Bob Woodruff @ 2005-04-25 22:51 UTC (permalink / raw)
  To: 'Andrew Morton', Timur Tabi, Davis, Arlin R
  Cc: hch, linux-kernel, openib-general

 Andrew Morton wrote,
>Yes, we expect that all the pages which get_user_pages() pinned will become
>unpinned within the context of the syscall which pinned the pages.  Or
>shortly after, in the case of async I/O.

>This is because there is no file descriptor or anything else associated
>with the pages which permits the kernel to clean stuff up on unclean
>application exit.  Also there are the obvious issues with permitting
>pinning of unbounded amounts of memory.

There definitely needs to be a mechanism to prevent people from pinning
too much memory. We saw issues in the sourceforge stack and some of the
vendors stacks where we could lock memory till the system hung. 
In the sourceforge InfiniBand stack, we put in a 
check to make sure that people did not pin too much memory. 
It was sort of a crude/bruit force mechanism, but effective. I think that we
limited people from locking down more that 1/2 of kernel memory or
70 % of all memory (it was tunable with a module option) and if they
exceeded
the limit, their requests to register memory would begin to fail. 
Arlin can provide details on how we did it or people can look at the 
IBAL code for an example. 

woody




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-04-25 23:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3VAeQ-1To-7@gated-at.bofh.it>
     [not found] ` <3VNYt-4M4-15@gated-at.bofh.it>
2005-04-22 13:10   ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>
2005-04-22 17:01     ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation Fab Tillier
2005-04-22 22:01       ` Bodo Eggert
2005-04-25 22:35 [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Andrew Morton
2005-04-25 22:51 ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation Bob Woodruff
2005-04-25 23:13   ` Timur Tabi
2005-04-25 23:17     ` Andrew Morton
2005-04-25 23:29     ` Bob Woodruff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).