Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

From: Ruan Shiyang <ruansy.fnst@cn.fujitsu.com>
To: zhong jiang <zhongjiang-ali@linux.alibaba.com>, Jan Kara <jack@suse.cz>
Cc: <linux-kernel@vger.kernel.org>, <linux-xfs@vger.kernel.org>,
	<linux-nvdimm@lists.01.org>, <linux-mm@kvack.org>,
	<linux-fsdevel@vger.kernel.org>, <linux-raid@vger.kernel.org>,
	<darrick.wong@oracle.com>, <dan.j.williams@intel.com>,
	<david@fromorbit.com>, <hch@lst.de>, <song@kernel.org>,
	<rgoldwyn@suse.de>, <qi.fuli@fujitsu.com>, <y-goto@fujitsu.com>
Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping
Date: Thu, 14 Jan 2021 09:44:14 +0800	[thread overview]
Message-ID: <ef29ba5c-96d7-d0bb-e405-c7472a518b32@cn.fujitsu.com> (raw)
In-Reply-To: <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com>

On 2021/1/13 下午6:04, zhong jiang wrote:
> 
> On 2021/1/12 10:55 上午, Ruan Shiyang wrote:
>>
>>
>> On 2021/1/6 下午11:41, Jan Kara wrote:
>>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:
>>>> The current memory_failure_dev_pagemap() can only handle single-mapped
>>>> dax page for fsdax mode.  The dax page could be mapped by multiple 
>>>> files
>>>> and offsets if we let reflink feature & fsdax mode work together.  So,
>>>> we refactor current implementation to support handle memory failure on
>>>> each file and offset.
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
>>>
>>> Overall this looks OK to me, a few comments below.
>>>
>>>> ---
>>>>   fs/dax.c            | 21 +++++++++++
>>>>   include/linux/dax.h |  1 +
>>>>   include/linux/mm.h  |  9 +++++
>>>>   mm/memory-failure.c | 91 
>>>> ++++++++++++++++++++++++++++++++++-----------
>>>>   4 files changed, 100 insertions(+), 22 deletions(-)
>>
>> ...
>>
>>>>   @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct 
>>>> *tsk, struct page *p,
>>>>       }
>>>>         tk->addr = page_address_in_vma(p, vma);
>>>> -    if (is_zone_device_page(p))
>>>> -        tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>>>> -    else
>>>> +    if (is_zone_device_page(p)) {
>>>> +        if (is_device_fsdax_page(p))
>>>> +            tk->addr = vma->vm_start +
>>>> +                    ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>
>>> It seems strange to use 'pgoff' for dax pages and not for any other 
>>> page.
>>> Why? I'd rather pass correct pgoff from all callers of add_to_kill() and
>>> avoid this special casing...
>>
>> Because one fsdax page can be shared by multiple pgoffs.  I have to 
>> pass each pgoff in each iteration to calculate the address in vma (for 
>> tk->addr).  Other kinds of pages don't need this. They can get their 
>> unique address by calling "page_address_in_vma()".
>>
> IMO,   an fsdax page can be shared by multiple files rather than 
> multiple pgoffs if fs query support reflink.   Because an page only 
> located in an mapping(page->mapping is exclusive),  hence it  only has 
> an pgoff or index pointing at the node.
> 
>   or  I miss something for the feature ?  thanks,

Yes, a fsdax page is shared by multiple files because of reflink.  I 
think my description of 'pgoff' here is not correct.  This 'pgoff' means 
the offset within the a file.  (We use rmap to find out all the sharing 
files and their offsets.)  So, I said that "can be shared by multiple 
pgoffs".  It's my bad.

I think I should name it another word to avoid misunderstandings.

--
Thanks,
Ruan Shiyang.

> 
>> So, I added this fsdax case here.  This patchset only implemented the 
>> fsdax case, other cases also need to be added here if to be implemented.
>>
>>
>> -- 
>> Thanks,
>> Ruan Shiyang.
>>
>>>
>>>> +        tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
>>>> +    } else
>>>>           tk->size_shift = page_shift(compound_head(p));
>>>>         /*
>>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page 
>>>> *page, struct list_head *to_kill,
>>>>               if (!page_mapped_in_vma(page, vma))
>>>>                   continue;
>>>>               if (vma->vm_mm == t->mm)
>>>> -                add_to_kill(t, page, vma, to_kill);
>>>> +                add_to_kill(t, page, NULL, 0, vma, to_kill);
>>>>           }
>>>>       }
>>>>       read_unlock(&tasklist_lock);
>>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page 
>>>> *page, struct list_head *to_kill,
>>>>   /*
>>>>    * Collect processes when the error hit a file mapped page.
>>>>    */
>>>> -static void collect_procs_file(struct page *page, struct list_head 
>>>> *to_kill,
>>>> -                int force_early)
>>>> +static void collect_procs_file(struct page *page, struct 
>>>> address_space *mapping,
>>>> +        pgoff_t pgoff, struct list_head *to_kill, int force_early)
>>>>   {
>>>>       struct vm_area_struct *vma;
>>>>       struct task_struct *tsk;
>>>> -    struct address_space *mapping = page->mapping;
>>>> -    pgoff_t pgoff;
>>>>         i_mmap_lock_read(mapping);
>>>>       read_lock(&tasklist_lock);
>>>> -    pgoff = page_to_pgoff(page);
>>>>       for_each_process(tsk) {
>>>>           struct task_struct *t = task_early_kill(tsk, force_early);
>>>> -
>>>>           if (!t)
>>>>               continue;
>>>> -        vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
>>>> -                      pgoff) {
>>>> +        vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, 
>>>> pgoff) {
>>>>               /*
>>>>                * Send early kill signal to tasks where a vma covers
>>>>                * the page but the corrupted page is not necessarily
>>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page 
>>>> *page, struct list_head *to_kill,
>>>>                * to be informed of all such data corruptions.
>>>>                */
>>>>               if (vma->vm_mm == t->mm)
>>>> -                add_to_kill(t, page, vma, to_kill);
>>>> +                add_to_kill(t, page, mapping, pgoff, vma, to_kill);
>>>>           }
>>>>       }
>>>>       read_unlock(&tasklist_lock);
>>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page, 
>>>> struct list_head *tokill,
>>>>       if (PageAnon(page))
>>>>           collect_procs_anon(page, tokill, force_early);
>>>>       else
>>>> -        collect_procs_file(page, tokill, force_early);
>>>> +        collect_procs_file(page, page->mapping, page_to_pgoff(page),
>>>
>>> Why not use page_mapping() helper here? It would be safer for THPs if 
>>> they
>>> ever get here...
>>>
>>>                                 Honza
>>>
>>
> 
>