Re: [RFC PATCH 5/5] mm proc/task_mmu.c: add hugetlb specific routine for clear_refs

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-s390@vger.kernel.org, shu wang <malate_wangshu@hotmail.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Michel Lespinasse <walken@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH 5/5] mm proc/task_mmu.c: add hugetlb specific routine for clear_refs
Date: Thu, 18 Feb 2021 16:14:30 -0800	[thread overview]
Message-ID: <c356f931-8400-b3c8-73b0-2537815d7a6b@oracle.com> (raw)
In-Reply-To: <20210217202518.GA19238@xz-x1>

On 2/17/21 12:25 PM, Peter Xu wrote:
> On Wed, Feb 10, 2021 at 04:03:22PM -0800, Mike Kravetz wrote:
>> There was is no hugetlb specific routine for clearing soft dirty and
>> other referrences.  The 'default' routines would only clear the
>> VM_SOFTDIRTY flag in the vma.
>>
>> Add new routine specifically for hugetlb vmas.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  fs/proc/task_mmu.c | 110 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 110 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 829b35016aaa..f06cf9b131a8 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -1116,6 +1116,115 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>>  }
>>  #endif
>>  
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +static inline bool huge_pte_is_pinned(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t pte)
>> +{
>> +	struct page *page;
>> +
>> +	if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
>> +		return false;
>> +	page = pte_page(pte);
>> +	if (!page)
>> +		return false;
>> +	return page_maybe_dma_pinned(page);
>> +}
>> +
>> +static int clear_refs_hugetlb_range(pte_t *ptep, unsigned long hmask,
>> +				unsigned long addr, unsigned long end,
>> +				struct mm_walk *walk)
>> +{
>> +	struct clear_refs_private *cp = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	struct hstate *h = hstate_vma(walk->vma);
>> +	unsigned long adj_start = addr, adj_end = end;
>> +	spinlock_t *ptl;
>> +	pte_t old_pte, pte;
>> +
>> +	/*
>> +	 * clear_refs should only operate on complete vmas.  Therefore,
>> +	 * values passed here should be huge page aligned and huge page
>> +	 * size in length.  Quick validation before taking any action in
>> +	 * case upstream code is changed.
>> +	 */
>> +	if ((addr & hmask) != addr || end - addr != huge_page_size(h)) {
>> +		WARN_ONCE(1, "%s passed unaligned address\n", __func__);
>> +		return 1;
>> +	}
> 
> I wouldn't worry too much on the interface change - The one who will change the
> interface should guarantee all existing hooks will still work, isn't it? :)

Yeah, I can drop this.

> It's slightly confusing to me on why "clear_refs should only operate on
> complete vmas" is related to the check, though.

Mostly me thinking that since it is operating on complete (hugetlb) vma,
then we know vms is huge page aligned and a multiple of huge page in size.
So, all passed addressed should be huge page aligned as well.

> 
>> +
>> +	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, ptep);
>> +
>> +	/* Soft dirty and pmd sharing do not mix */
> 
> Right, this seems to be a placeholder for unsharing code.

Sorry, comment was left over from earlier code.  Unsharing is actually
done below, I forgot to remove comment.

> Though maybe we can do that earlier in pre_vma() hook?  That should be per-vma
> rather than handling one specific huge page here, hence more efficient imho.

Yes, let me look into that.  The code below is certianly not the most
efficient.

> this reminded me that I should also better move hugetlb_unshare_all_pmds() of
> my other patch into hugetlb.c, so that this code can call it.  Currently it's a
> static function in userfaultfd.c.
> 
>> +
>> +	pte = huge_ptep_get(ptep);
>> +	if (!pte_present(pte))
>> +		goto out;
>> +	if (unlikely(is_hugetlb_entry_hwpoisoned(pte)))
>> +		goto out;
>> +
>> +	if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
> 
> Maybe move this check into clear_refs_test_walk()?  We can bail out earlier if:
> 
>       (is_vm_hugetlb_page(vma) && (type != CLEAR_REFS_SOFT_DIRTY))
> 

Yes, we can do that.  I was patterning this after the other 'clear_refs'
routines.  But, they can clear things besides soft dirty.  Since soft
dirty is the only thing handled for hugetlb, we can bail earlier.

>> +		if (huge_pte_is_pinned(vma, addr, pte))
>> +			goto out;
> 
> Out of topic of this patchset, but it's definitely a pity that we can't track
> soft dirty for pinned pages.  Currently the assumption of the pte code path is:
> "if this page can be DMA written then we won't know whether data changed after
> all, then tracking dirty is meaningless", however that's prone to change when
> new hardwares coming, say, IOMMU could start to trap DMA writes already.
> 
> But again that's another story.. and we should just follow what we do with
> non-hugetlbfs for sure here, until some day if we'd like to revive soft dirty
> tracking with pinned pages.
> 
>> +
>> +		/*
>> +		 * soft dirty and pmd sharing do not work together as
>> +		 * per-process is tracked in ptes, and pmd sharing allows
>> +		 * processed to share ptes.  We unshare any pmds here.
>> +		 */
>> +		adjust_range_if_pmd_sharing_possible(vma, &adj_start, &adj_end);
> 
> Ideally when reach here, huge pmd sharing won't ever exist, right?  Then do we
> still need to adjust the range at all?

Right, we should be able to do it earlier.

Thanks again for taking a look at this.
-- 
Mike Kravetz

> 
> Thanks,
>