Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

From: Ruan Shiyang <ruansy.fnst@cn.fujitsu.com>
To: Jan Kara <jack@suse.cz>
Cc: <linux-kernel@vger.kernel.org>, <linux-xfs@vger.kernel.org>,
	<linux-nvdimm@lists.01.org>, <linux-mm@kvack.org>,
	<linux-fsdevel@vger.kernel.org>, <linux-raid@vger.kernel.org>,
	<darrick.wong@oracle.com>, <dan.j.williams@intel.com>,
	<david@fromorbit.com>, <hch@lst.de>, <song@kernel.org>,
	<rgoldwyn@suse.de>, <qi.fuli@fujitsu.com>, <y-goto@fujitsu.com>
Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping
Date: Tue, 12 Jan 2021 10:55:34 +0800	[thread overview]
Message-ID: <75164044-bfdf-b2d6-dff0-d6a8d56d1f62@cn.fujitsu.com> (raw)
In-Reply-To: <20210106154132.GC29271@quack2.suse.cz>

On 2021/1/6 下午11:41, Jan Kara wrote:
> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:
>> The current memory_failure_dev_pagemap() can only handle single-mapped
>> dax page for fsdax mode.  The dax page could be mapped by multiple files
>> and offsets if we let reflink feature & fsdax mode work together.  So,
>> we refactor current implementation to support handle memory failure on
>> each file and offset.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
> 
> Overall this looks OK to me, a few comments below.
> 
>> ---
>>   fs/dax.c            | 21 +++++++++++
>>   include/linux/dax.h |  1 +
>>   include/linux/mm.h  |  9 +++++
>>   mm/memory-failure.c | 91 ++++++++++++++++++++++++++++++++++-----------
>>   4 files changed, 100 insertions(+), 22 deletions(-)

...

>>   
>> @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>>   	}
>>   
>>   	tk->addr = page_address_in_vma(p, vma);
>> -	if (is_zone_device_page(p))
>> -		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>> -	else
>> +	if (is_zone_device_page(p)) {
>> +		if (is_device_fsdax_page(p))
>> +			tk->addr = vma->vm_start +
>> +					((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> 
> It seems strange to use 'pgoff' for dax pages and not for any other page.
> Why? I'd rather pass correct pgoff from all callers of add_to_kill() and
> avoid this special casing...

Because one fsdax page can be shared by multiple pgoffs.  I have to pass 
each pgoff in each iteration to calculate the address in vma (for 
tk->addr).  Other kinds of pages don't need this.  They can get their 
unique address by calling "page_address_in_vma()".

So, I added this fsdax case here.  This patchset only implemented the 
fsdax case, other cases also need to be added here if to be implemented.

--
Thanks,
Ruan Shiyang.

> 
>> +		tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
>> +	} else
>>   		tk->size_shift = page_shift(compound_head(p));
>>   
>>   	/*
>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
>>   			if (!page_mapped_in_vma(page, vma))
>>   				continue;
>>   			if (vma->vm_mm == t->mm)
>> -				add_to_kill(t, page, vma, to_kill);
>> +				add_to_kill(t, page, NULL, 0, vma, to_kill);
>>   		}
>>   	}
>>   	read_unlock(&tasklist_lock);
>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
>>   /*
>>    * Collect processes when the error hit a file mapped page.
>>    */
>> -static void collect_procs_file(struct page *page, struct list_head *to_kill,
>> -				int force_early)
>> +static void collect_procs_file(struct page *page, struct address_space *mapping,
>> +		pgoff_t pgoff, struct list_head *to_kill, int force_early)
>>   {
>>   	struct vm_area_struct *vma;
>>   	struct task_struct *tsk;
>> -	struct address_space *mapping = page->mapping;
>> -	pgoff_t pgoff;
>>   
>>   	i_mmap_lock_read(mapping);
>>   	read_lock(&tasklist_lock);
>> -	pgoff = page_to_pgoff(page);
>>   	for_each_process(tsk) {
>>   		struct task_struct *t = task_early_kill(tsk, force_early);
>> -
>>   		if (!t)
>>   			continue;
>> -		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
>> -				      pgoff) {
>> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>   			/*
>>   			 * Send early kill signal to tasks where a vma covers
>>   			 * the page but the corrupted page is not necessarily
>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>>   			 * to be informed of all such data corruptions.
>>   			 */
>>   			if (vma->vm_mm == t->mm)
>> -				add_to_kill(t, page, vma, to_kill);
>> +				add_to_kill(t, page, mapping, pgoff, vma, to_kill);
>>   		}
>>   	}
>>   	read_unlock(&tasklist_lock);
>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page, struct list_head *tokill,
>>   	if (PageAnon(page))
>>   		collect_procs_anon(page, tokill, force_early);
>>   	else
>> -		collect_procs_file(page, tokill, force_early);
>> +		collect_procs_file(page, page->mapping, page_to_pgoff(page),
> 
> Why not use page_mapping() helper here? It would be safer for THPs if they
> ever get here...
> 
> 								Honza
>