linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hugh@veritas.com>
To: Timur Tabi <timur.tabi@ammasso.com>
Cc: Libor Michalek <libor@topspin.com>, Andrew Morton <akpm@osdl.org>,
	Andrea Arcangeli <andrea@suse.de>,
	linux-kernel@vger.kernel.org, openib-general@openib.org
Subject: Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Date: Sat, 7 May 2005 17:30:52 +0100 (BST)	[thread overview]
Message-ID: <Pine.LNX.4.61.0505071617470.5718@goblin.wat.veritas.com> (raw)
In-Reply-To: <427CD49E.6080300@ammasso.com>

On Sat, 7 May 2005, Timur Tabi wrote:
> 
> > Oh, well, maybe, but what is the real problem?
> > Are you sure that copy-on-write doesn't come into it?
> 
> No, but I do know that my test case doesn't call fork(), so it's reproducible
> without involving COW.  Of course, I'm sure someone's going to tell me now
> that COW comes into effect even without fork().  If so, please explain.

I'll try.  COW comes into effect whenever you're sharing a page and
then need to make private changes to it.  Fork is one way of sharing
(with ancestor and descendant processes).  Using the empty zero page
is another way of sharing (with all other processes and parts of your
own address space with a readonly page full of zeroes).  Using a file
page from the page cache is another way of sharing.

None of those is actually your case, but our test for whether a page
is shared has been inadequate: oversimplifying, if page_count is more
than 1 then we have to assume it is shared and do the copy-on-write
(if the modifications are to be private).  But there are various places
where the page_count is temporarily raised (e.g. while paging out),
which we cannot distinguish, so occasionally we'll copy on write even
when it's not necessary, but we lack the information to tell us so.

In particular, of course, get_user_pages raises page_count to pin
the page: so making a page appear shared when it's not shared at all.

> The short answer: under "extreme" memory pressure, the data inside a page
> pinned by get_user_pages() is swapped out, moved, or deleted (I'm not sure
> which).  Some other data is placed into that physical location.
> 
> By extreme memory pressure, I mean having the process allocate and touch as
> much memory as possible.  Something like this:
> 
> num_bytes = get_amount_of_physical_ram();
> char *p = malloc(num_bytes);
> for (i=0; i<num_bytes; i+=PAGE_SIZE)
> p[i] = 0;
> 
> The above over-simplified code fails on earlier 2.6 kernels (or earlier
> versions of glibc that accompany most distros the use the earlier 2.6
> kernels).  Either malloc() returns NULL, or the p[i]=0 part causes a segfault.
> I haven't bothered to trace down why.  But when it does work, the page pinned
> by get_user_pages() changes.

Which has to be a bug with get_user_pages, which has no other purpose
than to pin the pages.  I cannot criticize you for working around it
to get your app working on lots of releases, but what _we_ have to do
is fix get_user_pages - and it appears that Andrea did so a year ago.

I'm surprised if it's as simple as you describe (you do say over-
simplified, maybe the critical points have fallen out), since GUP
users would have complained long ago if it wasn't doing the job in
normal cases of memory pressure.  Andrea's case does involve the
process independently trying to touch a page it has pinned for I/O
with get_user_pages.  Or (and I've only just thought of this, suspect
it might be exactly your case) not touch, but apply get_user_pages
again to a page already so pinned (while memory pressure has caused
try_to_unmap_one temporarily to detach it from the user address space
- the aspect of the problem that Andrea's fix attacks).

> My understanding is that mlock() could in theory allow the page to be moved,
> but that currently nothing in the kernel would actually move it.  However,
> that could change in the future to allow hot-swapping of RAM.

That's my understanding too, that nothing currently does so.  Aside from
hot-swapping RAM, there's also a need to be able to migrate pages around
RAM, either to unfragment memory allowing higher-order allocations to
succeed more often, or to get around extreme dmamem/normal-mem/highmem
imbalances without dedicating huge reserves.  Those would more often
succeed if uninhibited by mlock.

> So I need to take into account distro vendors that use an earlier kernel, like
> 2.6.5, and back-port the patch from 2.6.7.  The distro vendor will keep the
> 2.6.5 version number, which is why I can't rely on it.
> 
> I need to know exactly what the fix is, so that when I scan mm/rmap.c, I know
> what to look for.  Currently, I look for this regex:
> 
> try_to_unmap_one.*vm_area_struct
> 
> which seems to work.  However, now I think it's just a coincidence.

Perhaps any release based on 2.6.7 or above, or any release which
mentions "get_user_pages" in its mm/rmap.c or mm/objrmap.c?

> > By the way, please don't be worried when soon the try_to_unmap_one
> > comment and code that you identified above disappear.  When I'm
> > back in patch submission mode, I'll be sending Andrew a patch which
> > removes it, instead reworking can_share_swap_page to rely on the
> > page_mapcount instead of page_count, which avoids the ironical
> > behaviour my comment refers to, and allows an awkward page migration
> > case to proceed (once unpinned).  Andrea and I now both prefer this
> > page_mapcount approach.
> 
> Ugh, that means my regex is probably going to break.  Not only that, but I
> don't understand what you're saying either.  Trying to understand the VM is
> really hard.

Sorry about that, but suiting your regex is low in our priorities for
VM design!  I was tempted to offer to keep a comment on get_user_pages
in mm/rmap.c after the change, but that's really rather babyish: just
assume 2.6.7 and upwards are fixed (or complain if you find not).

Perhaps I'll manage a clearer explanation when I come to write the
change description for the patch, we'll have to see.

> I guess in this specific case, it doesn't really matter, because calling
> mlock() when I should be calling get_user_pages() is not a bad thing.

If you can afford to keep that amount of memory mlocked, and have to
capability to do so, yes, it should do no harm.  We were just upset
to think that mlock was still needed to get around a get_user_pages
bug which was fixed a year ago.

Hugh

  reply	other threads:[~2005-05-07 16:30 UTC|newest]

Thread overview: 144+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-04 22:09 [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Roland Dreier
2005-04-04 22:09 ` [PATCH][RFC][1/4] IB: core changes for userspace verbs Roland Dreier
2005-04-04 22:09   ` [PATCH][RFC][2/4] IB: userspace verbs main module Roland Dreier
2005-04-04 22:09     ` [PATCH][RFC][3/4] IB: userspace verbs mthca changes Roland Dreier
2005-04-04 22:09       ` [PATCH][RFC][4/4] IB: userspace verbs Kconfig/Makefile changes Roland Dreier
2005-04-04 22:49       ` [openib-general] [PATCH][RFC][3/4] IB: userspace verbs mthca changes Tom Duffy
2005-04-04 23:34         ` Roland Dreier
2005-04-21  0:37       ` [PATCH][MTHCA] fix sparc build WAS: " Tom Duffy
2005-04-21  0:38         ` David S. Miller
2005-04-11 14:22 ` [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Troy Benjegerdes
2005-04-11 15:34   ` Roland Dreier
2005-04-11 16:33     ` Troy Benjegerdes
2005-04-11 16:56       ` Roland Dreier
2005-04-11 18:01         ` Troy Benjegerdes
2005-04-11 18:03           ` Roland Dreier
2005-04-12  0:13             ` Andrew Morton
2005-04-12  0:21               ` Roland Dreier
2005-04-12 18:23                 ` Michael S. Tsirkin
2005-04-13 18:28                   ` Roland Dreier
2005-04-13 19:32                     ` Andrew Morton
2005-04-13  1:04               ` [openib-general] " Libor Michalek
2005-04-18 17:15                 ` Timur Tabi
2005-04-26  3:31                 ` Libor Michalek
2005-05-04 18:27                   ` Timur Tabi
2005-05-05 18:48                     ` Timur Tabi
2005-05-06 23:08                       ` Timur Tabi
2005-05-07 13:18                         ` Hugh Dickins
2005-05-07 14:45                           ` Timur Tabi
2005-05-07 16:30                             ` Hugh Dickins [this message]
2005-05-11 20:12                               ` William Jordan
2005-05-11 20:42                                 ` Hugh Dickins
2005-05-11 22:52                                   ` Andrea Arcangeli
2005-05-11 22:49                                 ` Andrea Arcangeli
2005-05-11 22:53                                   ` Timur Tabi
2005-05-11 23:05                                     ` Andrea Arcangeli
2005-05-05 23:34                     ` Libor Michalek
2005-04-18 16:22               ` Timur Tabi
2005-04-18 16:43                 ` Christoph Hellwig
2005-04-18 16:45                   ` Timur Tabi
2005-04-24  2:44                     ` Andrew Morton
2005-04-24 14:23                       ` Timur Tabi
2005-04-24 20:53                         ` Greg KH
2005-04-24 21:52                           ` Timur Tabi
2005-04-25  1:03                             ` Greg KH
2005-04-25  4:12                               ` Timur Tabi
2005-04-25 13:30                                 ` Dave Hansen
2005-04-25 13:15                         ` Roland Dreier
2005-04-25 13:17                           ` Christoph Hellwig
2005-04-25 14:16                             ` Roland Dreier
2005-04-25 20:54                           ` Andrew Morton
2005-04-25 21:12                             ` Roland Dreier
2005-04-25 22:14                               ` Andrew Morton
2005-04-25 22:21                                 ` Timur Tabi
2005-04-25 22:32                                   ` Andrew Morton
2005-04-25 23:58                                     ` Roland Dreier
2005-04-26  0:11                                       ` Andrew Morton
2005-04-26  0:23                                         ` Roland Dreier
2005-04-26  0:37                                           ` Andrew Morton
2005-04-26  2:21                                             ` Timur Tabi
2005-04-26  3:16                                               ` Andrew Morton
2005-04-26  3:38                                                 ` Timur Tabi
2005-04-26  4:33                                                   ` Andrew Morton
2005-04-26 14:07                                                     ` Timur Tabi
2005-04-26 15:31                                             ` Roland Dreier
2005-04-26 15:42                                               ` [openib-general] " Libor Michalek
2005-04-26 15:49                                                 ` Roland Dreier
2005-04-26 19:28                                                   ` Andrew Morton
2005-04-26 20:14                                                     ` Roland Dreier
2005-04-26 20:18                                                       ` Timur Tabi
2005-04-26 20:37                                                         ` Andrew Morton
2005-04-29 14:26                                                           ` Bill Jordan
2005-04-29 15:56                                                             ` Caitlin Bestler
2005-04-29 16:45                                                               ` RDMA memory registration (was: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation) Roland Dreier
2005-04-29 17:23                                                                 ` Libor Michalek
2005-04-29 18:22                                                                 ` RDMA memory registration Brice Goglin
2005-04-29 18:31                                                                   ` Roland Dreier
2005-04-29 19:33                                                                   ` [openib-general] " Grant Grundler
2005-05-03  8:42                                                                     ` David Addison
2005-05-03 15:36                                                                       ` Grant Grundler
2005-04-29 19:43                                                                 ` RDMA memory registration (was: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation) Bill Jordan
2005-04-29 19:45                                                                   ` RDMA memory registration Roland Dreier
2005-04-29 17:04                                                               ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Libor Michalek
2005-04-30  0:31                                                                 ` Caitlin Bestler
2005-05-03 18:43                                                                   ` Andy Isaacson
2005-05-03 19:04                                                                     ` Caitlin Bestler
2005-05-04 18:22                                                                     ` William Jordan
2005-05-05  1:27                                                                       ` Rik van Riel
2005-05-05  1:57                                                                         ` Andy Isaacson
2005-04-26 20:32                                                       ` Andrew Morton
2005-04-26 21:23                                                         ` Roland Dreier
2005-04-27  0:05                                                           ` Andrew Morton
2005-04-27  2:13                                                             ` Roland Dreier
2005-04-27  3:21                                                             ` Caitlin Bestler
2005-04-27  3:15                                                     ` Caitlin Bestler
2005-04-26  2:03                                       ` IWAMOTO Toshihiro
2005-04-26  2:16                                         ` Timur Tabi
2005-04-26  2:26                                         ` [openib-general] " Stephen Langdon
2005-04-25 22:23                                 ` Timur Tabi
2005-04-25 22:35                                   ` Andrew Morton
2005-04-25 22:42                                     ` Timur Tabi
2005-04-25 23:13                                       ` Andrew Morton
2005-04-25 23:21                                         ` Timur Tabi
2005-04-25 23:27                                           ` Andrew Morton
2005-04-26  0:08                                         ` Roland Dreier
2005-04-25 22:51                                     ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation Bob Woodruff
2005-04-25 23:13                                       ` Timur Tabi
2005-04-25 23:17                                         ` Andrew Morton
2005-04-25 23:29                                         ` Bob Woodruff
2005-04-25 23:17                                     ` [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation Libor Michalek
2005-04-25 23:24                                       ` Andrew Morton
2005-04-25 23:37                                         ` Caitlin Bestler
2005-04-26  0:10                                           ` Andrew Morton
2005-04-26  3:55                                         ` Libor Michalek
2005-04-26  0:02                                 ` Roland Dreier
2005-04-26  6:12                                   ` Christoph Hellwig
2005-04-26 13:45                                     ` [openib-general] " Caitlin Bestler
2005-04-26 15:24                                     ` Timur Tabi
2005-04-25 19:11                       ` Andy Isaacson
2005-04-18 16:09     ` Timur Tabi
2005-04-18 16:12       ` Roland Dreier
2005-04-18 16:50         ` Timur Tabi
2005-04-21 19:47           ` Pavel Machek
2005-04-18 16:16       ` Arjan van de Ven
2005-04-18 16:25         ` Timur Tabi
2005-04-18 19:40           ` Arjan van de Ven
2005-04-18 20:00             ` Timur Tabi
2005-04-18 20:05               ` Arjan van de Ven
2005-04-18 20:19                 ` Timur Tabi
2005-04-18 20:07             ` [openib-general] " Bernhard Fischer
2005-04-21  2:17               ` Troy Benjegerdes
2005-04-21  3:07                 ` Timur Tabi
2005-04-21 17:38                   ` Andy Isaacson
2005-04-21 18:39                     ` Timur Tabi
2005-04-21 19:56                       ` Andy Isaacson
2005-04-21 20:07                         ` Timur Tabi
2005-04-21 20:12                           ` Chris Wright
2005-04-21 20:14                             ` Timur Tabi
2005-04-21 20:25                               ` Chris Wright
2005-04-21 20:30                                 ` Arjan van de Ven
2005-04-22  6:14                           ` Greg KH
2005-04-22 17:55         ` Timur Tabi
2005-04-22 18:12           ` Arjan van de Ven
2005-04-29  0:56         ` Andrew Morton
     [not found] <3VAeQ-1To-7@gated-at.bofh.it>
     [not found] ` <3VNYt-4M4-15@gated-at.bofh.it>
2005-04-22 13:10   ` [openib-general] " Bodo Eggert <harvested.in.lkml@posting.7eggert.dyndns.org>

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.61.0505071617470.5718@goblin.wat.veritas.com \
    --to=hugh@veritas.com \
    --cc=akpm@osdl.org \
    --cc=andrea@suse.de \
    --cc=libor@topspin.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=openib-general@openib.org \
    --cc=timur.tabi@ammasso.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).