linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mark Hairgrove <mhairgrove@nvidia.com>
To: Jerome Glisse <j.glisse@gmail.com>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"joro@8bytes.org" <joro@8bytes.org>,
	"Mel Gorman" <mgorman@suse.de>, "H. Peter Anvin" <hpa@zytor.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Johannes Weiner" <jweiner@redhat.com>,
	"Larry Woodman" <lwoodman@redhat.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Dave Airlie" <airlied@redhat.com>,
	"Brendan Conoboy" <blc@redhat.com>,
	"Joe Donohue" <jdonohue@redhat.com>,
	"Duncan Poole" <dpoole@nvidia.com>,
	"Sherry Cheung" <SCheung@nvidia.com>,
	"Subhash Gutti" <sgutti@nvidia.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Lucien Dunning" <ldunning@nvidia.com>,
	"Cameron Buschardt" <cabuschardt@nvidia.com>,
	"Arvind Gopalakrishnan" <arvindg@nvidia.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Shachar Raindel" <raindel@mellanox.com>,
	"Liran Liss" <liranl@mellanox.com>,
	"Roland Dreier" <roland@purestorage.com>,
	"Ben Sander" <ben.sander@amd.com>,
	"Greg Stoner" <Greg.Stoner@amd.com>,
	"John Bridgman" <John.Bridgman@amd.com>,
	"Michael Mantor" <Michael.Mantor@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Laurent Morichetti" <Laurent.Morichetti@amd.com>,
	"Alexander Deucher" <Alexander.Deucher@amd.com>,
	"Oded Gabbay" <Oded.Gabbay@amd.com>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Jatin Kumar" <jakumar@nvidia.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
Date: Thu, 11 Jun 2015 15:26:46 -0700	[thread overview]
Message-ID: <alpine.DEB.2.00.1506111520350.25907@mdh-linux64-2.nvidia.com> (raw)
In-Reply-To: <20150611142313.GA26195@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6383 bytes --]



On Thu, 11 Jun 2015, Jerome Glisse wrote:

> On Wed, Jun 10, 2015 at 06:15:08PM -0700, Mark Hairgrove wrote:
> 
> [...]
> > > There is no race here, the mirror struct will only be freed once as again
> > > the list is a synchronization point. Whoever remove the mirror from the
> > > list is responsible to drop the list reference.
> > > 
> > > In the fixed code the only thing that will happen twice is the ->release()
> > > callback. Even that can be work around to garanty it is call only once.
> > > 
> > > Anyway i do not see anyrace here.
> > > 
> > 
> > The mirror lifetime is fine. The problem I see is with the device lifetime 
> > on a multi-core system. Imagine this sequence:
> > 
> > - On CPU1 the mm associated with the mirror is going down
> > - On CPU2 the driver unregisters the mirror then the device
> > 
> > When this happens, the last device mutex_unlock on CPU1 is the only thing 
> > preventing the free of the device in CPU2. That doesn't work, as described 
> > in this thread: https://lkml.org/lkml/2013/12/2/997
> > 
> > Here's the full sequence again with mutex_unlock split apart. Hopefully 
> > this shows the device_unregister problem more clearly:
> > 
> > CPU1 (mm release)                   CPU2 (driver)
> > ----------------------              ----------------------
> > hmm_notifier_release
> >   down_write(&hmm->rwsem);
> >   hlist_del_init(&mirror->mlist);
> >   up_write(&hmm->rwsem);
> > 
> >   // CPU1 thread is preempted or 
> >   // something
> >                                     hmm_mirror_unregister
> >                                       hmm_mirror_kill
> >                                         down_write(&hmm->rwsem);
> >                                         // mirror removed by CPU1 already
> >                                         // so hlist_unhashed returns 1
> >                                         up_write(&hmm->rwsem);
> > 
> >                                       hmm_mirror_unref(&mirror);
> >                                       // Mirror ref now 1
> > 
> >                                       // CPU2 thread is preempted or
> >                                       // something
> > // CPU1 thread is scheduled
> > 
> > hmm_mirror_unref(&mirror);
> >   // Mirror ref now 0, cleanup
> >   hmm_mirror_destroy(&mirror)
> >     mutex_lock(&device->mutex);
> >     list_del_init(&mirror->dlist);
> >     device->ops->release(mirror);
> >       kfree(mirror);
> >                                       // CPU2 thread is scheduled, now
> >                                       // both CPU1 and CPU2 are running
> > 
> >                                     hmm_device_unregister
> >                                       mutex_lock(&device->mutex);
> >                                         mutex_optimistic_spin()
> >     mutex_unlock(&device->mutex);
> >       [...]
> >       __mutex_unlock_common_slowpath
> >         // CPU2 releases lock
> >         atomic_set(&lock->count, 1);
> >                                           // Spinning CPU2 acquires now-
> >                                           // free lock
> >                                       // mutex_lock returns
> >                                       // Device list empty
> >                                       mutex_unlock(&device->mutex);
> >                                       return 0;
> >                                     kfree(hmm_device);
> >         // CPU1 still accessing 
> >         // hmm_device->mutex in 
> >         //__mutex_unlock_common_slowpath
> 
> Ok i see the race you are afraid of and really it is an unlikely one
> __mutex_unlock_common_slowpath() take a spinlock right after allowing
> other to take the mutex, when we are in your scenario there is no
> contention on that spinlock so it is taken right away and as there
> is no one in the mutex wait list then it goes directly to unlock the
> spinlock and return. You can ignore the debug function as if debugging
> is enabled than the mutex_lock() would need to also take the spinlock
> and thus you would have proper synchronization btw 2 thread thanks to
> the mutex.wait_lock.
> 
> So basicly while CPU1 is going :
> spin_lock(mutex.wait_lock)
> if (!list_empty(mutex.wait_list)) {
>   // wait_list is empty so branch not taken
> }
> spin_unlock(mutex.wait_lock)
> 
> CPU2 would have to test the mirror list and mutex_unlock and return
> before the spin_unlock() of CPU1. This is a tight race, i can add a
> synchronize_rcu() to device_unregister after the mutex_unlock() so
> that we also add a grace period before the device is potentialy freed
> which should make that race completely unlikely.
> 
> Moreover for something really bad to happen it would need that the
> freed memory to be reallocated right away by some other thread. Which
> really sound unlikely unless CPU1 is the slowest of all :)
> 
> Cheers,
> Jerome
> 

But CPU1 could get preempted between the atomic_set and the 
spin_lock_mutex, and then it doesn't matter whether or not a grace period 
has elapsed before CPU2 proceeds.

Making race conditions less likely just makes them harder to pinpoint when 
they inevitably appear in the wild. I don't think it makes sense to spend 
any effort in making a race condition less likely, and that thread I 
referenced (https://lkml.org/lkml/2013/12/2/997) is fairly strong evidence 
that fixing this race actually matters. So, I think this race condition 
really needs to be fixed.

One fix is for hmm_mirror_unregister to wait for hmm_notifier_release 
completion between hmm_mirror_kill and hmm_mirror_unref. It can do this by 
calling synchronize_srcu() on the mmu_notifier's srcu. This has the 
benefit that the driver is guaranteed not to get the "mm is dead" callback 
after hmm_mirror_unregister returns.

In fact, are there any callbacks on the mirror that can arrive after 
hmm_mirror_unregister? If so, how will hmm_device_unregister solve them?

>From a general standpoint, hmm_device_unregister must perform some kind of 
synchronization to be sure that all mirrors are completely released and 
done and no new callbacks will trigger. Since that has to be true, can't 
that synchronization be moved into hmm_mirror_unregister instead?

If that happens there's no need for a "mirror can be freed" ->release 
callback at all because the driver is guaranteed that a mirror is done 
after hmm_mirror_unregister.

  reply	other threads:[~2015-06-11 22:26 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-21 19:31 HMM (Heterogeneous Memory Management) v8 j.glisse
2015-05-21 19:31 ` [PATCH 01/36] mmu_notifier: add event information to address invalidation v7 j.glisse
2015-05-30  3:43   ` John Hubbard
2015-06-01 19:03     ` Jerome Glisse
2015-06-01 23:10       ` John Hubbard
2015-06-03 16:07         ` Jerome Glisse
2015-06-03 23:02           ` John Hubbard
2015-05-21 19:31 ` [PATCH 02/36] mmu_notifier: keep track of active invalidation ranges v3 j.glisse
2015-05-27  5:09   ` Aneesh Kumar K.V
2015-05-27 14:32     ` Jerome Glisse
2015-06-02  9:32   ` John Hubbard
2015-06-03 17:15     ` Jerome Glisse
2015-06-05  3:29       ` John Hubbard
2015-05-21 19:31 ` [PATCH 03/36] mmu_notifier: pass page pointer to mmu_notifier_invalidate_page() j.glisse
2015-05-27  5:17   ` Aneesh Kumar K.V
2015-05-27 14:33     ` Jerome Glisse
2015-06-03  4:25   ` John Hubbard
2015-05-21 19:31 ` [PATCH 04/36] mmu_notifier: allow range invalidation to exclude a specific mmu_notifier j.glisse
2015-05-21 19:31 ` [PATCH 05/36] HMM: introduce heterogeneous memory management v3 j.glisse
2015-05-27  5:50   ` Aneesh Kumar K.V
2015-05-27 14:38     ` Jerome Glisse
2015-06-08 19:40   ` Mark Hairgrove
2015-06-08 21:17     ` Jerome Glisse
2015-06-09  1:54       ` Mark Hairgrove
2015-06-09 15:56         ` Jerome Glisse
2015-06-10  3:33           ` Mark Hairgrove
2015-06-10 15:42             ` Jerome Glisse
2015-06-11  1:15               ` Mark Hairgrove
2015-06-11 14:23                 ` Jerome Glisse
2015-06-11 22:26                   ` Mark Hairgrove [this message]
2015-06-15 14:32                     ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 06/36] HMM: add HMM page table v2 j.glisse
2015-06-19  2:06   ` Mark Hairgrove
2015-06-19 18:07     ` Jerome Glisse
2015-06-20  2:34       ` Mark Hairgrove
2015-06-25 22:57   ` Mark Hairgrove
2015-06-26 16:30     ` Jerome Glisse
2015-06-27  1:34       ` Mark Hairgrove
2015-06-29 14:43         ` Jerome Glisse
2015-07-01  2:51           ` Mark Hairgrove
2015-07-01 15:07             ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 07/36] HMM: add per mirror page table v3 j.glisse
2015-06-25 23:05   ` Mark Hairgrove
2015-06-26 16:43     ` Jerome Glisse
2015-06-27  3:02       ` Mark Hairgrove
2015-06-29 14:50         ` Jerome Glisse
2015-05-21 19:31 ` [PATCH 08/36] HMM: add device page fault support v3 j.glisse
2015-05-21 19:31 ` [PATCH 09/36] HMM: add mm page table iterator helpers j.glisse
2015-05-21 19:31 ` [PATCH 10/36] HMM: use CPU page table during invalidation j.glisse
2015-05-21 19:31 ` [PATCH 11/36] HMM: add discard range helper (to clear and free resources for a range) j.glisse
2015-05-21 19:31 ` [PATCH 12/36] HMM: add dirty range helper (to toggle dirty bit inside mirror page table) j.glisse
2015-05-21 19:31 ` [PATCH 13/36] HMM: DMA map memory on behalf of device driver j.glisse
2015-05-21 19:31 ` [PATCH 14/36] fork: pass the dst vma to copy_page_range() and its sub-functions j.glisse
2015-05-21 19:31 ` [PATCH 15/36] memcg: export get_mem_cgroup_from_mm() j.glisse
2015-05-21 19:31 ` [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory j.glisse
2015-06-24  7:49   ` Haggai Eran
2015-05-21 19:31 ` [PATCH 17/36] HMM: add new HMM page table flag (valid device memory) j.glisse
2015-05-21 19:31 ` [PATCH 18/36] HMM: add new HMM page table flag (select flag) j.glisse
2015-05-21 19:31 ` [PATCH 19/36] HMM: handle HMM device page table entry on mirror page table fault and update j.glisse
2015-05-21 20:22 ` [PATCH 20/36] HMM: mm add helper to update page table when migrating memory back jglisse
2015-05-21 20:22   ` [PATCH 21/36] HMM: mm add helper to update page table when migrating memory jglisse
2015-05-21 20:22   ` [PATCH 22/36] HMM: add new callback for copying memory from and to device memory jglisse
2015-05-21 20:22   ` [PATCH 23/36] HMM: allow to get pointer to spinlock protecting a directory jglisse
2015-05-21 20:23   ` [PATCH 24/36] HMM: split DMA mapping function in two jglisse
2015-05-21 20:23   ` [PATCH 25/36] HMM: add helpers for migration back to system memory jglisse
2015-05-21 20:23   ` [PATCH 26/36] HMM: fork copy migrated memory into system memory for child process jglisse
2015-05-21 20:23   ` [PATCH 27/36] HMM: CPU page fault on migrated memory jglisse
2015-05-21 20:23   ` [PATCH 28/36] HMM: add mirror fault support for system to device memory migration jglisse
2015-05-21 20:23   ` [PATCH 29/36] IB/mlx5: add a new paramter to __mlx_ib_populated_pas for ODP with HMM jglisse
2015-05-21 20:23   ` [PATCH 30/36] IB/mlx5: add a new paramter to mlx5_ib_update_mtt() " jglisse
2015-05-21 20:23   ` [PATCH 31/36] IB/odp: export rbt_ib_umem_for_each_in_range() jglisse
2015-05-21 20:23   ` [PATCH 32/36] IB/odp/hmm: add new kernel option to use HMM for ODP jglisse
2015-05-21 20:23   ` [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM jglisse
2015-06-24 13:59     ` Haggai Eran
2015-05-21 20:23   ` [PATCH 34/36] IB/mlx5/hmm: add mlx5 HMM device initialization and callback jglisse
2015-05-21 20:23   ` [PATCH 35/36] IB/mlx5/hmm: add page fault support for ODP on HMM jglisse
2015-05-21 20:23   ` [PATCH 36/36] IB/mlx5/hmm: enable ODP using HMM jglisse
2015-05-30  3:01 ` HMM (Heterogeneous Memory Management) v8 John Hubbard
2015-05-31  6:56 ` Haggai Eran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.1506111520350.25907@mdh-linux64-2.nvidia.com \
    --to=mhairgrove@nvidia.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Greg.Stoner@amd.com \
    --cc=John.Bridgman@amd.com \
    --cc=Laurent.Morichetti@amd.com \
    --cc=Michael.Mantor@amd.com \
    --cc=Oded.Gabbay@amd.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=SCheung@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arvindg@nvidia.com \
    --cc=ben.sander@amd.com \
    --cc=blc@redhat.com \
    --cc=cabuschardt@nvidia.com \
    --cc=dpoole@nvidia.com \
    --cc=haggaie@mellanox.com \
    --cc=hpa@zytor.com \
    --cc=j.glisse@gmail.com \
    --cc=jakumar@nvidia.com \
    --cc=jdonohue@redhat.com \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=jweiner@redhat.com \
    --cc=ldunning@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=liranl@mellanox.com \
    --cc=lwoodman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=raindel@mellanox.com \
    --cc=riel@redhat.com \
    --cc=roland@purestorage.com \
    --cc=sgutti@nvidia.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).