From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17A4AC433DB for ; Thu, 14 Jan 2021 09:38:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7D3E023A1D for ; Thu, 14 Jan 2021 09:38:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7D3E023A1D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A96108D00CC; Thu, 14 Jan 2021 04:38:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A1F078D008E; Thu, 14 Jan 2021 04:38:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 90E588D00CC; Thu, 14 Jan 2021 04:38:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0190.hostedemail.com [216.40.44.190]) by kanga.kvack.org (Postfix) with ESMTP id 74AAF8D008E for ; Thu, 14 Jan 2021 04:38:41 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 37AE8824556B for ; Thu, 14 Jan 2021 09:38:41 +0000 (UTC) X-FDA: 77703880842.11.help35_100d5fa27525 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin11.hostedemail.com (Postfix) with ESMTP id 09AA018021BB3 for ; Thu, 14 Jan 2021 09:38:41 +0000 (UTC) X-HE-Tag: help35_100d5fa27525 X-Filterd-Recvd-Size: 10425 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Thu, 14 Jan 2021 09:38:39 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=zhongjiang-ali@linux.alibaba.com;NM=1;PH=DS;RN=16;SR=0;TI=SMTPD_---0ULi6zbO_1610617113; Received: from L-X1DSLVDL-1420.local(mailfrom:zhongjiang-ali@linux.alibaba.com fp:SMTPD_---0ULi6zbO_1610617113) by smtp.aliyun-inc.com(127.0.0.1); Thu, 14 Jan 2021 17:38:34 +0800 Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping To: Ruan Shiyang , Jan Kara Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, darrick.wong@oracle.com, dan.j.williams@intel.com, david@fromorbit.com, hch@lst.de, song@kernel.org, rgoldwyn@suse.de, qi.fuli@fujitsu.com, y-goto@fujitsu.com References: <20201230165601.845024-1-ruansy.fnst@cn.fujitsu.com> <20201230165601.845024-5-ruansy.fnst@cn.fujitsu.com> <20210106154132.GC29271@quack2.suse.cz> <75164044-bfdf-b2d6-dff0-d6a8d56d1f62@cn.fujitsu.com> <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com> <4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com> From: zhong jiang Message-ID: <53ecb7e2-8f59-d1a5-df75-4780620ce91f@linux.alibaba.com> Date: Thu, 14 Jan 2021 17:38:33 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:85.0) Gecko/20100101 Thunderbird/85.0 MIME-Version: 1.0 In-Reply-To: <4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2021/1/14 11:52 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: > > > On 2021/1/14 =E4=B8=8A=E5=8D=8811:26, zhong jiang wrote: >> >> On 2021/1/14 9:44 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: >>> >>> >>> On 2021/1/13 =E4=B8=8B=E5=8D=886:04, zhong jiang wrote: >>>> >>>> On 2021/1/12 10:55 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: >>>>> >>>>> >>>>> On 2021/1/6 =E4=B8=8B=E5=8D=8811:41, Jan Kara wrote: >>>>>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote: >>>>>>> The current memory_failure_dev_pagemap() can only handle=20 >>>>>>> single-mapped >>>>>>> dax page for fsdax mode.=C2=A0 The dax page could be mapped by=20 >>>>>>> multiple files >>>>>>> and offsets if we let reflink feature & fsdax mode work=20 >>>>>>> together. So, >>>>>>> we refactor current implementation to support handle memory=20 >>>>>>> failure on >>>>>>> each file and offset. >>>>>>> >>>>>>> Signed-off-by: Shiyang Ruan >>>>>> >>>>>> Overall this looks OK to me, a few comments below. >>>>>> >>>>>>> --- >>>>>>> =C2=A0 fs/dax.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 | 21 +++++++++++ >>>>>>> =C2=A0 include/linux/dax.h |=C2=A0 1 + >>>>>>> =C2=A0 include/linux/mm.h=C2=A0 |=C2=A0 9 +++++ >>>>>>> =C2=A0 mm/memory-failure.c | 91=20 >>>>>>> ++++++++++++++++++++++++++++++++++----------- >>>>>>> =C2=A0 4 files changed, 100 insertions(+), 22 deletions(-) >>>>> >>>>> ... >>>>> >>>>>>> =C2=A0 @@ -345,9 +348,12 @@ static void add_to_kill(struct=20 >>>>>>> task_struct *tsk, struct page *p, >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->addr =3D page_address_i= n_vma(p, vma); >>>>>>> -=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D de= v_pagemap_mapping_shift(p, vma); >>>>>>> -=C2=A0=C2=A0=C2=A0 else >>>>>>> +=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) { >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_device_fsdax_p= age(p)) >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= tk->addr =3D vma->vm_start + >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ((pgoff - vma->vm_pgoff)= << PAGE_SHIFT); >>>>>> >>>>>> It seems strange to use 'pgoff' for dax pages and not for any=20 >>>>>> other page. >>>>>> Why? I'd rather pass correct pgoff from all callers of=20 >>>>>> add_to_kill() and >>>>>> avoid this special casing... >>>>> >>>>> Because one fsdax page can be shared by multiple pgoffs. I have to=20 >>>>> pass each pgoff in each iteration to calculate the address in vma=20 >>>>> (for tk->addr).=C2=A0 Other kinds of pages don't need this. They ca= n=20 >>>>> get their unique address by calling "page_address_in_vma()". >>>>> >>>> IMO,=C2=A0=C2=A0 an fsdax page can be shared by multiple files rathe= r than=20 >>>> multiple pgoffs if fs query support reflink.=C2=A0=C2=A0 Because an = page only=20 >>>> located in an mapping(page->mapping is exclusive), hence it=C2=A0 on= ly=20 >>>> has an pgoff or index pointing at the node. >>>> >>>> =C2=A0=C2=A0or=C2=A0 I miss something for the feature ?=C2=A0 thanks= , >>> >>> Yes, a fsdax page is shared by multiple files because of reflink. I=20 >>> think my description of 'pgoff' here is not correct.=C2=A0 This 'pgof= f'=20 >>> means the offset within the a file. (We use rmap to find out all the=20 >>> sharing files and their offsets.)=C2=A0 So, I said that "can be share= d by=20 >>> multiple pgoffs".=C2=A0 It's my bad. >>> >>> I think I should name it another word to avoid misunderstandings. >>> >> IMO,=C2=A0 All the sharing files should be the same offset to share th= e=20 >> fsdax page.=C2=A0 why not that ?=20 > > The dedupe operation can let different files share their same data=20 > extent, though offsets are not same.=C2=A0 So, files can share one fsda= x=20 > page at different offset. Ok,=C2=A0 Get it. > >> As you has said,=C2=A0 a shared fadax page should be inserted to diffe= rent=20 >> mapping files.=C2=A0 but page->index and page->mapping is exclusive.=C2= =A0=20 >> hence an page only should be placed in an mapping tree. > > We can't use page->mapping and page->index here for reflink & fsdax.=20 > And that's this patchset aims to solve.=C2=A0 I introduced a series of=20 > ->corrupted_range(), from mm to pmem driver to block device and=20 > finally to filesystem, to use rmap feature of filesystem to find out=20 > all files sharing same data extent (fsdax page). From this patch,=C2=A0 each file has mapping tree,=C2=A0 the shared page= will be=20 inserted into multiple file mapping tree.=C2=A0 then filesystem use file = and=20 offset to get the killed process.=C2=A0=C2=A0 Is it correct? Thanks, > > > --=20 > Thanks, > Ruan Shiyang. > >> >> And In the current patch,=C2=A0 we failed to found out that all proces= s=20 >> use the fsdax page shared by multiple files and kill them. >> >> >> Thanks, >> >>> --=20 >>> Thanks, >>> Ruan Shiyang. >>> >>>> >>>>> So, I added this fsdax case here. This patchset only implemented=20 >>>>> the fsdax case, other cases also need to be added here if to be=20 >>>>> implemented. >>>>> >>>>> >>>>> --=20 >>>>> Thanks, >>>>> Ruan Shiyang. >>>>> >>>>>> >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D de= v_pagemap_mapping_shift(p, vma,=20 >>>>>>> tk->addr); >>>>>>> +=C2=A0=C2=A0=C2=A0 } else >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_s= hift =3D page_shift(compound_head(p)); >>>>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>>>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page=20 >>>>>>> *page, struct list_head *to_kill, >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (!page_mapped_in_vma(page, vma)) >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, NULL, 0, vma, to_kill); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>>>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page=20 >>>>>>> *page, struct list_head *to_kill, >>>>>>> =C2=A0 /* >>>>>>> =C2=A0=C2=A0 * Collect processes when the error hit a file mapped= page. >>>>>>> =C2=A0=C2=A0 */ >>>>>>> -static void collect_procs_file(struct page *page, struct=20 >>>>>>> list_head *to_kill, >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 int force_early) >>>>>>> +static void collect_procs_file(struct page *page, struct=20 >>>>>>> address_space *mapping, >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff_t pgoff, struct= list_head *to_kill, int force_early) >>>>>>> =C2=A0 { >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_struct *tsk; >>>>>>> -=C2=A0=C2=A0=C2=A0 struct address_space *mapping =3D page->mappi= ng; >>>>>>> -=C2=A0=C2=A0=C2=A0 pgoff_t pgoff; >>>>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i_mmap_lock_read(mapping); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_lock(&tasklist_lock); >>>>>>> -=C2=A0=C2=A0=C2=A0 pgoff =3D page_to_pgoff(page); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_process(tsk) { >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct tas= k_struct *t =3D task_early_kill(tsk,=20 >>>>>>> force_early); >>>>>>> - >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!t) >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 continue; >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_for= each(vma, &mapping->i_mmap, pgoff, >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff) { >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_for= each(vma, &mapping->i_mmap, pgoff,=20 >>>>>>> pgoff) { >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 /* >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * Send early kill signal to tasks where a vma covers >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * the page but the corrupted page is not necessarily >>>>>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page=20 >>>>>>> *page, struct list_head *to_kill, >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * to be informed of all such data corruptions. >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 */ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, mapping, pgoff, vma,=20 >>>>>>> to_kill); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>>>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page,=20 >>>>>>> struct list_head *tokill, >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page)) >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_pr= ocs_anon(page, tokill, force_early); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else >>>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(pa= ge, tokill, force_early); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(pa= ge, page->mapping,=20 >>>>>>> page_to_pgoff(page), >>>>>> >>>>>> Why not use page_mapping() helper here? It would be safer for=20 >>>>>> THPs if they >>>>>> ever get here... >>>>>> >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Honza >>>>>> >>>>> >>>> >>>> >>> >> >> >