linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: "Nadav Amit" <nadav.amit@gmail.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Bernhard Held" <berny156@gmx.de>,
	"Adam Borowski" <kilobyte@angband.pl>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Takashi Iwai" <tiwai@suse.de>, "Mike Galbraith" <efault@gmx.de>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	axie <axie@amd.com>, "Andrew Morton" <akpm@linux-foundation.org>
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
Date: Thu, 31 Aug 2017 00:17:39 +0200	[thread overview]
Message-ID: <20170830221739.GK13559@redhat.com> (raw)
In-Reply-To: <20170830204549.GA9445@redhat.com>

On Wed, Aug 30, 2017 at 04:45:49PM -0400, Jerome Glisse wrote:
> So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
> do not matter AFAICT (i doubt that hardware rewalk the page table just to
> decide not to flush that would be terribly dumb for hardware engineer to do
> so).
> 
> Intel have a deferred flush mecanism, basicly if no device is actively using
> the page table then there is no flush (see deferred invalidation in [1]). But
> i am unsure about the hardware ie does it also means that when a PASID is not
> actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
> unsure about ATS/PASID specification in respect to having device always report
> when they are done with a given PASID (ie does the spec say that device tlb
> must be invalidated when device stop using a pasid).
> 
> https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html
> 
> So i think we can side with caution here and call invalidate_range() under the
> page table lock. If IOMMU folks see performance issue with real workload due
> to the double invalidation that take place then we can work on that.
> 
> KVM or XEN are not impacted by this as they only care about start/end with this
> patchset.
> 
> Andrea is that inline with your assessment ?

That is obviously safe. The mmu_notifier_invalidate_range()
calls under PT lock could always be removed later.

However I'm afraid I've to figure out if
mmu_notifier_invalidate_range() must run inside the PT lock
regardless, because that's just the very same problem of
->invalidate_page run outside the PT lock with a majority of
production kernels out there:

	pte_unmap_unlock(pte, ptl);
	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
		mmu_notifier_invalidate_page(mm, address);

So even if we take the easy route for mmu_notifier_invalidate_range,
I'm still forced to think about this issue.

Currently I tend to believe the old code is safe and in turn I'm more
inclined to skip the explicit mmu_notifier_invalidate_range() inside
the PT lock for amd_iommu_v2 and intel-svm, and add it later if it's
truly proven unsafe.

However as long as the reason for this is just to keep it simpler and
robustness, I don't mind either ways. Double call is not nice though
and not doing double call will mess the code more.

So I thought it was attractive to explain why it was safe not to run
mmu_notifier_invalidate_range() inside the PT lock, which would then
allow for the most strightforward and more optimal upstream
implementation (in addition of not having to fix the old code).

> It is for softdirty, we should probably invalidate_range() in that case i
> need to check how dirtyness is handled in ATS/PASID ie does device update
> the dirty bit of the CPU page table on write. Or maybe device don't update
> the dirty flag.

They both call handle_mm_fault and establish a readonly secondary MMU
mapping and then call handle_mm_fault again before there's a DMA to
memory, to establish a writable mapping and set the dirty bit here at
the first write fault post read fault:

	bool write = vmf->flags & FAULT_FLAG_WRITE;

	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
		goto unlock;

	entry = pmd_mkyoung(orig_pmd);
	if (write)
		entry = pmd_mkdirty(entry);
[..]
	if (vmf->flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(vmf);
		entry = pte_mkdirty(entry);
	}

I doubt the PT lock is relevant for how the dirty bit gets set by
those two faulting-capable iommus.

All it matters is that post clear_refs_write there's a
mmu_notifier_invalidate_range_end so ->invalidate_range is called and
at the next write they call handle_mm_fault(FAULT_FLAG_WRITE) once
again. Where exactly the invalidate runs (i.e. post PT lock) I don't
see a big issue there, definitely clear_refs wouldn't change the
pte/hugepmd to point to a different page, so any coherency issue with
primary and secondary MMU temporarily being out of sync doesn't exist
there.

Kirill is the last committer to the handle_mm_fault line in both
drivers so it'd be great if he can double check to confirm that's the
way the dirty bit is handled and in turn causes no concern at
->invalidate_range time.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-08-30 22:17 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-29 23:54 [PATCH 00/13] mmu_notifier kill invalidate_page callback Jérôme Glisse
2017-08-29 23:54 ` [PATCH 01/13] dax: update to new mmu_notifier semantic Jérôme Glisse
2017-08-29 23:54 ` [PATCH 02/13] mm/rmap: " Jérôme Glisse
2017-08-30  2:46   ` Nadav Amit
2017-08-30  2:59     ` Jerome Glisse
2017-08-30  3:16       ` Nadav Amit
2017-08-30  3:18         ` Nadav Amit
2017-08-30 17:27     ` Andrea Arcangeli
2017-08-30 18:00       ` Nadav Amit
2017-08-30 21:25         ` Andrea Arcangeli
2017-08-30 23:25           ` Nadav Amit
2017-08-31  0:47             ` Jerome Glisse
2017-08-31 17:12               ` Andrea Arcangeli
2017-08-31 19:15                 ` Nadav Amit
2017-08-30 18:20       ` Jerome Glisse
2017-08-30 18:40         ` Nadav Amit
2017-08-30 20:45           ` Jerome Glisse
2017-08-30 22:17             ` Andrea Arcangeli [this message]
2017-08-30 20:55           ` Andrea Arcangeli
2017-08-30 16:52   ` Andrea Arcangeli
2017-08-30 17:48     ` Jerome Glisse
2017-08-30 21:53     ` Linus Torvalds
2017-08-30 23:01       ` Andrea Arcangeli
2017-08-31 18:25         ` Jerome Glisse
2017-08-31 19:40           ` Linus Torvalds
2017-08-29 23:54 ` [PATCH 03/13] powerpc/powernv: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 04/13] drm/amdgpu: " Jérôme Glisse
2017-08-30  6:18   ` Christian König
2017-08-29 23:54 ` [PATCH 05/13] IB/umem: " Jérôme Glisse
2017-08-30  6:13   ` Leon Romanovsky
2017-08-29 23:54 ` [PATCH 06/13] IB/hfi1: " Jérôme Glisse
2017-09-06 14:08   ` Arumugam, Kamenee
2017-08-29 23:54 ` [PATCH 07/13] iommu/amd: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 08/13] iommu/intel: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 09/13] misc/mic/scif: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 10/13] sgi-gru: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 11/13] xen/gntdev: " Jérôme Glisse
2017-08-30 19:32   ` Boris Ostrovsky
2017-08-29 23:54 ` [PATCH 12/13] KVM: " Jérôme Glisse
2017-08-29 23:54 ` [PATCH 13/13] mm/mmu_notifier: kill invalidate_page Jérôme Glisse
2017-08-30  0:11 ` [PATCH 00/13] mmu_notifier kill invalidate_page callback Linus Torvalds
2017-08-30  0:56   ` Jerome Glisse
2017-08-30  8:40     ` Mike Galbraith
2017-08-30 14:57     ` Adam Borowski
2017-09-01 14:47       ` Jeff Cook
2017-09-01 14:50         ` taskboxtester
2017-11-30  9:33 ` BSOD with " Fabian Grünbichler
2017-11-30 11:20   ` Paolo Bonzini
2017-11-30 16:19     ` Radim Krčmář
2017-11-30 18:05       ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Radim Krčmář
2017-11-30 18:05         ` [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier Radim Krčmář
2017-12-01 15:15           ` Paolo Bonzini
2017-12-03 17:24             ` Andrea Arcangeli
2017-12-01 12:21         ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Fabian Grünbichler
2017-12-01 15:27         ` Paolo Bonzini
2017-12-03 17:28         ` Andrea Arcangeli
2017-12-06  2:32         ` Wanpeng Li
2017-12-06  9:50           ` 王金浦
2017-12-06 10:00             ` Paolo Bonzini
2017-12-06  8:15         ` Fabian Grünbichler
2017-12-13 12:54         ` Richard Purdie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170830221739.GK13559@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axie@amd.com \
    --cc=berny156@gmx.de \
    --cc=dan.j.williams@intel.com \
    --cc=efault@gmx.de \
    --cc=jglisse@redhat.com \
    --cc=kernellwp@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tiwai@suse.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).