All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: akpm@linux-foundation.org, songmuchun@bytedance.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page
Date: Thu, 25 Aug 2022 09:25:31 +0200	[thread overview]
Message-ID: <887ca2e2-a7c5-93a7-46cb-185daccd4444@redhat.com> (raw)
In-Reply-To: <Ywa1jp/6naTmUh42@monkey>

> Is the primary concern the locking?  If so, I am not sure we have an issue.
> As mentioned in your commit message, current code will use
> pte_offset_map_lock().  pte_offset_map_lock uses pte_lockptr, and pte_lockptr
> will either be the mm wide lock or pmd_page lock.  To me, it seems that
> either would provide correct synchronization for CONT-PTE entries.  Am I
> missing something or misreading the code?
> 
> I started looking at code cleanup suggested by David.  Here is a quick
> patch (not tested and likely containing errors) to see if this is a step
> in the right direction.
> 
> I like it because we get rid of/combine all those follow_huge_p*d
> routines.
> 

Yes, see comments below.

> From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Wed, 24 Aug 2022 15:59:15 -0700
> Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in
>  follow_page_mask
> 
> At the beginning of follow_page_mask, there currently is a call to
> follow_huge_addr which 'may' handle hugetlb pages.  ia64 is the only
> architecture which (incorrectly) provides a follow_huge_addr routine
> that does not return error.  Instead, at each level of the page table a
> check is made for a hugetlb entry.  If a hugetlb entry is found, a call
> to a routine associated with that page table level such as
> follow_huge_pmd is made.
> 
> All the follow_huge_p*d routines are basically the same.  In addition
> huge page size can be derived from the vma, so we know where in the page
> table a huge page would reside.  So, replace follow_huge_addr with a
> new architecture independent routine which will provide the same
> functionality as the follow_huge_p*d routines.  We can then eliminate
> the p*d_huge checks in follow_page_mask page table walking as well as
> the follow_huge_p*d routines themselves.>
> follow_page_mask still has is_hugepd hugetlb checks during page table
> walking.  This is due to these checks and follow_huge_pd being
> architecture specific.  These can be eliminated if
> hugetlb_follow_page_mask can be overwritten by architectures (powerpc)
> that need to do follow_huge_pd processing.

But won't the

> +	/* hugetlb is special */
> +	if (is_vm_hugetlb_page(vma))
> +		return hugetlb_follow_page_mask(vma, address, flags);

code route everything via hugetlb_follow_page_mask() and all these
(beloved) hugepd checks would essentially be unreachable?

At least my understanding is that hugepd only applies to hugetlb.

Can't we move the hugepd handling code into hugetlb_follow_page_mask()
as well?

I mean, doesn't follow_hugetlb_page() also have to handle that hugepd
stuff already ... ?

[...]

>  
> +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> +				unsigned long address, unsigned int flags)
> +{
> +	struct hstate *h = hstate_vma(vma);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long haddr = address & huge_page_mask(h);
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
> +	pte_t *pte, entry;
> +
> +	/*
> +	 * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> +	 * follow_hugetlb_page().
> +	 */
> +	if (WARN_ON_ONCE(flags & FOLL_PIN))
> +		return NULL;
> +
> +	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> +	if (!pte)
> +		return NULL;
> +
> +retry:
> +	ptl = huge_pte_lock(h, mm, pte);
> +	entry = huge_ptep_get(pte);
> +	if (pte_present(entry)) {
> +		page = pte_page(entry);
> +		/*
> +		 * try_grab_page() should always succeed here, because we hold
> +		 * the ptl lock and have verified pte_present().
> +		 */
> +		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> +			page = NULL;
> +			goto out;
> +		}
> +	} else {
> +		if (is_hugetlb_entry_migration(entry)) {
> +			spin_unlock(ptl);
> +			__migration_entry_wait_huge(pte, ptl);
> +			goto retry;
> +		}
> +		/*
> +		 * hwpoisoned entry is treated as no_page_table in
> +		 * follow_page_mask().
> +		 */
> +	}
> +out:
> +	spin_unlock(ptl);
> +	return page;


This is neat and clean enough to not reuse follow_hugetlb_page(). I
wonder if we want to add some comment to the function how this differs
to follow_hugetlb_page().

... or do we maybe want to rename follow_hugetlb_page() to someting like
__hugetlb_get_user_pages() to make it clearer in which context it will
get called?


I guess it might be feasible in the future to eliminate
follow_hugetlb_page() and centralizing the faulting code. For now, this
certainly improves the situation.

-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2022-08-25  7:26 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-23  7:50 [PATCH v2 0/5] Fix some issues when looking up hugetlb page Baolin Wang
2022-08-23  7:50 ` [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size " Baolin Wang
2022-08-23  8:29   ` David Hildenbrand
2022-08-23 10:02     ` Baolin Wang
2022-08-23 10:23       ` David Hildenbrand
2022-08-23 23:55         ` Mike Kravetz
2022-08-24  2:06           ` Baolin Wang
2022-08-24  7:31             ` David Hildenbrand
2022-08-24  9:41               ` Baolin Wang
2022-08-24 11:55                 ` David Hildenbrand
2022-08-24 14:30                   ` Baolin Wang
2022-08-24 14:33                     ` David Hildenbrand
2022-08-24 15:06                       ` Baolin Wang
2022-08-24 15:13                         ` David Hildenbrand
2022-08-24 15:23                           ` Baolin Wang
2022-08-24 23:34                 ` Mike Kravetz
2022-08-25  1:43                   ` Baolin Wang
2022-08-25  7:10                     ` David Hildenbrand
2022-08-25  7:58                       ` Baolin Wang
2022-08-25 18:30                     ` Mike Kravetz
2022-08-25  7:25                   ` David Hildenbrand [this message]
2022-08-25 10:54                     ` Baolin Wang
2022-08-25 21:13                     ` Mike Kravetz
2022-08-26 22:40                       ` Mike Kravetz
2022-08-27 13:59                       ` Aneesh Kumar K.V
2022-08-29 18:30                         ` Mike Kravetz
2022-08-23  7:50 ` [PATCH v2 2/5] mm/hugetlb: use PTE page lock to protect CONT-PTE entries Baolin Wang
2022-08-23  7:50 ` [PATCH v2 3/5] mm/hugetlb: fix races when looking up a CONT-PMD size hugetlb page Baolin Wang
2022-08-23  7:50 ` [PATCH v2 4/5] mm/hugetlb: use PMD page lock to protect CONT-PTE entries Baolin Wang
2022-08-23  8:14   ` David Hildenbrand
2022-08-23 10:12     ` Baolin Wang
2022-08-23  7:50 ` [PATCH v2 5/5] mm/hugetlb: add FOLL_MIGRATION validation before waiting for a migration entry Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=887ca2e2-a7c5-93a7-46cb-185daccd4444@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.