All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Thomas Hellström (VMware)" <thomas_os@shipmail.org>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	torvalds@linux-foundation.org,
	"Thomas Hellstrom" <thellstrom@vmware.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Will Deacon" <will.deacon@arm.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Rik van Riel" <riel@surriel.com>,
	"Minchan Kim" <minchan@kernel.org>,
	"Michal Hocko" <mhocko@suse.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Jérôme Glisse" <jglisse@redhat.com>
Subject: Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code
Date: Thu, 3 Oct 2019 13:32:45 +0200	[thread overview]
Message-ID: <d336497b-3716-0748-d838-378902399439@shipmail.org> (raw)
In-Reply-To: <20191003111708.sttkkrhiidleivc6@box>

On 10/3/19 1:17 PM, Kirill A. Shutemov wrote:
> On Wed, Oct 02, 2019 at 03:47:25PM +0200, Thomas Hellström (VMware) wrote:
>> From: Thomas Hellstrom <thellstrom@vmware.com>
>>
>> For users that want to travers all page table entries pointing into a
>> region of a struct address_space mapping, introduce a walk_page_mapping()
>> function.
>>
>> The walk_page_mapping() function will be initially be used for dirty-
>> tracking in virtual graphics drivers.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
>> ---
>>   include/linux/pagewalk.h |  9 ++++
>>   mm/pagewalk.c            | 99 +++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 107 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index bddd9759bab9..6ec82e92c87f 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -24,6 +24,9 @@ struct mm_walk;
>>    *			"do page table walk over the current vma", returning
>>    *			a negative value means "abort current page table walk
>>    *			right now" and returning 1 means "skip the current vma"
>> + * @pre_vma:            if set, called before starting walk on a non-null vma.
>> + * @post_vma:           if set, called after a walk on a non-null vma, provided
>> + *                      that @pre_vma and the vma walk succeeded.
>>    */
>>   struct mm_walk_ops {
>>   	int (*pud_entry)(pud_t *pud, unsigned long addr,
>> @@ -39,6 +42,9 @@ struct mm_walk_ops {
>>   			     struct mm_walk *walk);
>>   	int (*test_walk)(unsigned long addr, unsigned long next,
>>   			struct mm_walk *walk);
>> +	int (*pre_vma)(unsigned long start, unsigned long end,
>> +		       struct mm_walk *walk);
>> +	void (*post_vma)(struct mm_walk *walk);
>>   };
>>   
>>   /**
>> @@ -62,5 +68,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>>   		void *private);
>>   int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>>   		void *private);
>> +int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>> +		      pgoff_t nr, const struct mm_walk_ops *ops,
>> +		      void *private);
>>   
>>   #endif /* _LINUX_PAGEWALK_H */
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index d48c2a986ea3..658d1e5ec428 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -253,13 +253,23 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>>   {
>>   	int err = 0;
>>   	struct vm_area_struct *vma = walk->vma;
>> +	const struct mm_walk_ops *ops = walk->ops;
>> +
>> +	if (vma && ops->pre_vma) {
>> +		err = ops->pre_vma(start, end, walk);
>> +		if (err)
>> +			return err;
>> +	}
>>   
>>   	if (vma && is_vm_hugetlb_page(vma)) {
>> -		if (walk->ops->hugetlb_entry)
>> +		if (ops->hugetlb_entry)
>>   			err = walk_hugetlb_range(start, end, walk);
>>   	} else
>>   		err = walk_pgd_range(start, end, walk);
>>   
>> +	if (vma && ops->post_vma)
>> +		ops->post_vma(walk);
>> +
>>   	return err;
>>   }
>>   
>> @@ -285,11 +295,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>>    *  - <0 : failed to handle the current entry, and return to the caller
>>    *         with error code.
>>    *
>> + *
>>    * Before starting to walk page table, some callers want to check whether
>>    * they really want to walk over the current vma, typically by checking
>>    * its vm_flags. walk_page_test() and @ops->test_walk() are used for this
>>    * purpose.
>>    *
>> + * If operations need to be staged before and committed after a vma is walked,
>> + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(),
>> + * since it is intended to handle commit-type operations, can't return any
>> + * errors.
>> + *
>>    * struct mm_walk keeps current values of some common data like vma and pmd,
>>    * which are useful for the access from callbacks. If you want to pass some
>>    * caller-specific data to callbacks, @private should be helpful.
>> @@ -376,3 +392,84 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>>   		return err;
>>   	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>>   }
>> +
>> +/**
>> + * walk_page_mapping - walk all memory areas mapped into a struct address_space.
>> + * @mapping: Pointer to the struct address_space
>> + * @first_index: First page offset in the address_space
>> + * @nr: Number of incremental page offsets to cover
>> + * @ops:	operation to call during the walk
>> + * @private:	private data for callbacks' usage
>> + *
>> + * This function walks all memory areas mapped into a struct address_space.
>> + * The walk is limited to only the given page-size index range, but if
>> + * the index boundaries cross a huge page-table entry, that entry will be
>> + * included.
>> + *
>> + * Also see walk_page_range() for additional information.
>> + *
>> + * Locking:
>> + *   This function can't require that the struct mm_struct::mmap_sem is held,
>> + *   since @mapping may be mapped by multiple processes. Instead
>> + *   @mapping->i_mmap_rwsem must be held. This might have implications in the
>> + *   callbacks, and it's up tho the caller to ensure that the
>> + *   struct mm_struct::mmap_sem is not needed.
>> + *
>> + *   Also this means that a caller can't rely on the struct
>> + *   vm_area_struct::vm_flags to be constant across a call,
>> + *   except for immutable flags. Callers requiring this shouldn't use
>> + *   this function.
>> + *
>> + *   If @mapping allows faulting of huge pmds and puds, it is desirable
>> + *   that its huge_fault() handler blocks while this function is running on
>> + *   @mapping. Otherwise a race may occur where the huge entry is split when
>> + *   it was intended to be handled in a huge entry callback. This requires an
>> + *   external lock, for example that @mapping->i_mmap_rwsem is held in
>> + *   write mode in the huge_fault() handlers.
> Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
> on read) to split PMD entry into PTE table. And it can happen not only
> from fault path.
>
> If you care about splitting compound page under you, take a pin or lock a
> page. It will block split_huge_page().
>
> Suggestion to block fault path is not viable (and it will not happen
> magically just because of this comment).
>
I was specifically thinking of this:

https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103

If a huge pud is concurrently faulted in here, it will immediatly get 
split without getting processed in pud_entry(). An external lock would 
protect against that, but that's perhaps a bug in the pagewalk code?  
For pmds the situation is not the same since when pte_entry is used, all 
pmds will unconditionally get split.

There's a similar more scary race in

https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931

It looks like if a concurrent thread faults in a huge pud just after the 
test for pud_none in that pmd_alloc, things might go pretty bad.

/Thomas



  reply	other threads:[~2019-10-03 11:32 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-02 13:47 [PATCH v3 0/7] Emulated coherent graphics memory take 2 Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 1/7] mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock() Thomas Hellström (VMware)
2019-10-03 11:02   ` Kirill A. Shutemov
2019-10-03 11:32     ` Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the pagewalk code Thomas Hellström (VMware)
2019-10-02 17:52   ` Linus Torvalds
2019-10-03 11:17   ` Kirill A. Shutemov
2019-10-03 11:32     ` Thomas Hellström (VMware) [this message]
2019-10-04 12:37       ` Kirill A. Shutemov
2019-10-04 12:58         ` Thomas Hellström (VMware)
2019-10-04 13:24           ` Kirill A. Shutemov
2019-10-02 13:47 ` [PATCH v3 3/7] mm: Add write-protect and clean utilities for address space ranges Thomas Hellström (VMware)
2019-10-02 18:06   ` Linus Torvalds
2019-10-02 18:13     ` Matthew Wilcox
2019-10-02 19:09     ` Thomas Hellström (VMware)
2019-10-02 20:27       ` Linus Torvalds
2019-10-03  7:56         ` Thomas Hellstrom
2019-10-03 16:55           ` Linus Torvalds
2019-10-03 18:03             ` Thomas Hellström (VMware)
2019-10-03 18:11               ` Linus Torvalds
2019-10-02 13:47 ` [PATCH v3 4/7] drm/vmwgfx: Implement an infrastructure for write-coherent resources Thomas Hellström (VMware)
2019-10-02 19:07   ` kbuild test robot
2019-10-02 19:17   ` kbuild test robot
2019-10-02 13:47 ` [PATCH v3 5/7] drm/vmwgfx: Use an RBtree instead of linked list for MOB resources Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 6/7] drm/vmwgfx: Implement an infrastructure for read-coherent resources Thomas Hellström (VMware)
2019-10-02 13:47 ` [PATCH v3 7/7] drm/vmwgfx: Add surface dirty-tracking callbacks Thomas Hellström (VMware)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d336497b-3716-0748-d838-378902399439@shipmail.org \
    --to=thomas_os@shipmail.org \
    --cc=akpm@linux-foundation.org \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=thellstrom@vmware.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.