From: Ruan Shiyang <ruansy.fnst@cn.fujitsu.com>
To: zhong jiang <zhongjiang-ali@linux.alibaba.com>, Jan Kara <jack@suse.cz>
Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-nvdimm@lists.01.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org,
darrick.wong@oracle.com, david@fromorbit.com, hch@lst.de,
song@kernel.org, rgoldwyn@suse.de, qi.fuli@fujitsu.com,
y-goto@fujitsu.com
Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping
Date: Thu, 14 Jan 2021 11:52:14 +0800 [thread overview]
Message-ID: <4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com> (raw)
In-Reply-To: <e2f7ad16-8162-4933-9091-72e690e9877e@linux.alibaba.com>
On 2021/1/14 上午11:26, zhong jiang wrote:
>
> On 2021/1/14 9:44 上午, Ruan Shiyang wrote:
>>
>>
>> On 2021/1/13 下午6:04, zhong jiang wrote:
>>>
>>> On 2021/1/12 10:55 上午, Ruan Shiyang wrote:
>>>>
>>>>
>>>> On 2021/1/6 下午11:41, Jan Kara wrote:
>>>>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:
>>>>>> The current memory_failure_dev_pagemap() can only handle
>>>>>> single-mapped
>>>>>> dax page for fsdax mode. The dax page could be mapped by multiple
>>>>>> files
>>>>>> and offsets if we let reflink feature & fsdax mode work together. So,
>>>>>> we refactor current implementation to support handle memory
>>>>>> failure on
>>>>>> each file and offset.
>>>>>>
>>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
>>>>>
>>>>> Overall this looks OK to me, a few comments below.
>>>>>
>>>>>> ---
>>>>>> fs/dax.c | 21 +++++++++++
>>>>>> include/linux/dax.h | 1 +
>>>>>> include/linux/mm.h | 9 +++++
>>>>>> mm/memory-failure.c | 91
>>>>>> ++++++++++++++++++++++++++++++++++-----------
>>>>>> 4 files changed, 100 insertions(+), 22 deletions(-)
>>>>
>>>> ...
>>>>
>>>>>> @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct
>>>>>> *tsk, struct page *p,
>>>>>> }
>>>>>> tk->addr = page_address_in_vma(p, vma);
>>>>>> - if (is_zone_device_page(p))
>>>>>> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>>>>>> - else
>>>>>> + if (is_zone_device_page(p)) {
>>>>>> + if (is_device_fsdax_page(p))
>>>>>> + tk->addr = vma->vm_start +
>>>>>> + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>
>>>>> It seems strange to use 'pgoff' for dax pages and not for any other
>>>>> page.
>>>>> Why? I'd rather pass correct pgoff from all callers of
>>>>> add_to_kill() and
>>>>> avoid this special casing...
>>>>
>>>> Because one fsdax page can be shared by multiple pgoffs. I have to
>>>> pass each pgoff in each iteration to calculate the address in vma
>>>> (for tk->addr). Other kinds of pages don't need this. They can get
>>>> their unique address by calling "page_address_in_vma()".
>>>>
>>> IMO, an fsdax page can be shared by multiple files rather than
>>> multiple pgoffs if fs query support reflink. Because an page only
>>> located in an mapping(page->mapping is exclusive), hence it only has
>>> an pgoff or index pointing at the node.
>>>
>>> or I miss something for the feature ? thanks,
>>
>> Yes, a fsdax page is shared by multiple files because of reflink. I
>> think my description of 'pgoff' here is not correct. This 'pgoff'
>> means the offset within the a file. (We use rmap to find out all the
>> sharing files and their offsets.) So, I said that "can be shared by
>> multiple pgoffs". It's my bad.
>>
>> I think I should name it another word to avoid misunderstandings.
>>
> IMO, All the sharing files should be the same offset to share the fsdax
> page. why not that ?
The dedupe operation can let different files share their same data
extent, though offsets are not same. So, files can share one fsdax page
at different offset.
> As you has said, a shared fadax page should be
> inserted to different mapping files. but page->index and page->mapping
> is exclusive. hence an page only should be placed in an mapping tree.
We can't use page->mapping and page->index here for reflink & fsdax.
And that's this patchset aims to solve. I introduced a series of
->corrupted_range(), from mm to pmem driver to block device and finally
to filesystem, to use rmap feature of filesystem to find out all files
sharing same data extent (fsdax page).
--
Thanks,
Ruan Shiyang.
>
> And In the current patch, we failed to found out that all process use
> the fsdax page shared by multiple files and kill them.
>
>
> Thanks,
>
>> --
>> Thanks,
>> Ruan Shiyang.
>>
>>>
>>>> So, I added this fsdax case here. This patchset only implemented
>>>> the fsdax case, other cases also need to be added here if to be
>>>> implemented.
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Ruan Shiyang.
>>>>
>>>>>
>>>>>> + tk->size_shift = dev_pagemap_mapping_shift(p, vma,
>>>>>> tk->addr);
>>>>>> + } else
>>>>>> tk->size_shift = page_shift(compound_head(p));
>>>>>> /*
>>>>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page
>>>>>> *page, struct list_head *to_kill,
>>>>>> if (!page_mapped_in_vma(page, vma))
>>>>>> continue;
>>>>>> if (vma->vm_mm == t->mm)
>>>>>> - add_to_kill(t, page, vma, to_kill);
>>>>>> + add_to_kill(t, page, NULL, 0, vma, to_kill);
>>>>>> }
>>>>>> }
>>>>>> read_unlock(&tasklist_lock);
>>>>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page
>>>>>> *page, struct list_head *to_kill,
>>>>>> /*
>>>>>> * Collect processes when the error hit a file mapped page.
>>>>>> */
>>>>>> -static void collect_procs_file(struct page *page, struct
>>>>>> list_head *to_kill,
>>>>>> - int force_early)
>>>>>> +static void collect_procs_file(struct page *page, struct
>>>>>> address_space *mapping,
>>>>>> + pgoff_t pgoff, struct list_head *to_kill, int force_early)
>>>>>> {
>>>>>> struct vm_area_struct *vma;
>>>>>> struct task_struct *tsk;
>>>>>> - struct address_space *mapping = page->mapping;
>>>>>> - pgoff_t pgoff;
>>>>>> i_mmap_lock_read(mapping);
>>>>>> read_lock(&tasklist_lock);
>>>>>> - pgoff = page_to_pgoff(page);
>>>>>> for_each_process(tsk) {
>>>>>> struct task_struct *t = task_early_kill(tsk, force_early);
>>>>>> -
>>>>>> if (!t)
>>>>>> continue;
>>>>>> - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
>>>>>> - pgoff) {
>>>>>> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
>>>>>> pgoff) {
>>>>>> /*
>>>>>> * Send early kill signal to tasks where a vma covers
>>>>>> * the page but the corrupted page is not necessarily
>>>>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page
>>>>>> *page, struct list_head *to_kill,
>>>>>> * to be informed of all such data corruptions.
>>>>>> */
>>>>>> if (vma->vm_mm == t->mm)
>>>>>> - add_to_kill(t, page, vma, to_kill);
>>>>>> + add_to_kill(t, page, mapping, pgoff, vma, to_kill);
>>>>>> }
>>>>>> }
>>>>>> read_unlock(&tasklist_lock);
>>>>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page,
>>>>>> struct list_head *tokill,
>>>>>> if (PageAnon(page))
>>>>>> collect_procs_anon(page, tokill, force_early);
>>>>>> else
>>>>>> - collect_procs_file(page, tokill, force_early);
>>>>>> + collect_procs_file(page, page->mapping, page_to_pgoff(page),
>>>>>
>>>>> Why not use page_mapping() helper here? It would be safer for THPs
>>>>> if they
>>>>> ever get here...
>>>>>
>>>>> Honza
>>>>>
>>>>
>>>
>>>
>>
>
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
next prev parent reply other threads:[~2021-01-14 3:52 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-30 16:55 [PATCH 00/10] fsdax: introduce fs query to support reflink Shiyang Ruan
2020-12-30 16:55 ` [PATCH 01/10] pagemap: Introduce ->memory_failure() Shiyang Ruan
2020-12-30 16:55 ` [PATCH 02/10] blk: Introduce ->corrupted_range() for block device Shiyang Ruan
2021-01-08 9:55 ` Christoph Hellwig
2021-01-08 19:09 ` Darrick J. Wong
2020-12-30 16:55 ` [PATCH 03/10] fs: Introduce ->corrupted_range() for superblock Shiyang Ruan
2021-01-08 9:56 ` Christoph Hellwig
2020-12-30 16:55 ` [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping Shiyang Ruan
2021-01-06 15:41 ` Jan Kara
2021-01-12 2:55 ` Ruan Shiyang
2021-01-13 10:04 ` zhong jiang
2021-01-14 1:44 ` Ruan Shiyang
2021-01-14 3:26 ` zhong jiang
2021-01-14 3:52 ` Ruan Shiyang [this message]
2021-01-14 9:38 ` zhong jiang
2021-01-14 17:20 ` Darrick J. Wong
2021-01-14 20:38 ` Dan Williams
2020-12-30 16:55 ` [PATCH 05/10] mm, pmem: Implement ->memory_failure() in pmem driver Shiyang Ruan
2021-01-06 15:55 ` Jan Kara
2020-12-30 16:55 ` [PATCH 06/10] pmem: Implement ->corrupted_range() for " Shiyang Ruan
2020-12-30 16:55 ` [PATCH 07/10] dm: Introduce ->rmap() to find bdev offset Shiyang Ruan
2020-12-30 16:55 ` [PATCH 08/10] md: Implement ->corrupted_range() Shiyang Ruan
2021-01-06 17:14 ` Jan Kara
2021-01-12 12:45 ` Ruan Shiyang
2020-12-30 16:56 ` [PATCH 09/10] xfs: Implement ->corrupted_range() for XFS Shiyang Ruan
2021-01-04 23:21 ` Darrick J. Wong
2020-12-30 16:56 ` [PATCH 10/10] fs/dax: remove useless functions Shiyang Ruan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com \
--to=ruansy.fnst@cn.fujitsu.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=qi.fuli@fujitsu.com \
--cc=rgoldwyn@suse.de \
--cc=song@kernel.org \
--cc=y-goto@fujitsu.com \
--cc=zhongjiang-ali@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).