From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBF9BC4338F for ; Tue, 27 Jul 2021 23:51:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8A80160F93 for ; Tue, 27 Jul 2021 23:51:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8A80160F93 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 158CD6B0036; Tue, 27 Jul 2021 19:51:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 109F56B005D; Tue, 27 Jul 2021 19:51:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F3A8C8D0001; Tue, 27 Jul 2021 19:51:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0198.hostedemail.com [216.40.44.198]) by kanga.kvack.org (Postfix) with ESMTP id DB58B6B0036 for ; Tue, 27 Jul 2021 19:51:58 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 8C606181DC553 for ; Tue, 27 Jul 2021 23:51:58 +0000 (UTC) X-FDA: 78410018316.08.4CCE74F Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf28.hostedemail.com (Postfix) with ESMTP id D59FA9000508 for ; Tue, 27 Jul 2021 23:51:56 +0000 (UTC) Received: by mail-pl1-f176.google.com with SMTP id a20so538739plm.0 for ; Tue, 27 Jul 2021 16:51:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=MWdRDaBfoW5PPzzR7Hq7j4SJL6MRL4DymqLhS8uwcpU=; b=uu+mARo/11EfZ3JgX9cseNPjnRu/OEw1UbYrBE+Pe///dniIaeVlWuUP8bC98QGNnk zsiIUgi4DOenKBbUoznrHWM1+E9rJ1KFvDcBoEertOz+G0oR7ZBajO2WQDxNc22KHBOM hKaIGiBfQxcXhplZULU3i+4U1Dt0IBSTP8RKw2DOGh1IMmoPyMJLDNI+KcLaTNsXmeuq Qi4YirUR1wh5crzK59kqnYkdMdlPXa97aq5os5vBUQeMTNNEFXapI5hUF83HQt7+g+Vw Pw28mqqfK0euq7E1FrLIZSJLLINgjlDR5oD94hc6/3T8/qR2GPePGqlzV2SkORmKdf/4 Z4VA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=MWdRDaBfoW5PPzzR7Hq7j4SJL6MRL4DymqLhS8uwcpU=; b=P0o9Z2LMQyveuzv0dfEw31Dq8acicavkB/zCrBxb73pdMf948fXExWpAG/YLMshosR YymViq6zquLgjDFaobUXfMHZHblL5PRY2W7TcOqpweQJpi/BknfSNuQxBb+I7/XKHC8l TvBO7gbTi4vTn7zpTxXRzOEDLVk8Qjvc7hYqUZTMWwujMYZBBUeqIpMKlI60zGMPpV9E qbkCeUKAiIiQ9NL8/4TxZhHlWQF/k8vIsgMCnYJwQE0OUstrYDzPERsPkpefu+PITvis I+YvUFj6jNLGipvLV62Nq9nIJkTeNnpeE8xqPRBNfbm3TAPpDfAqI2sZet9mBePG+txv kSNw== X-Gm-Message-State: AOAM533XaKzJuAa4vVozF610w+HRcxxauyoX7pL5hWzad0w8S7DrVFl/ 2+YnoqvQmYnfg6bCcF8nJ9MqA2E9X2D8bfirVfGagg== X-Google-Smtp-Source: ABdhPJx7m66XqlSlNSIxMIRPDrdT8mKgrEeYJQaSly6V49raVmDKExqC5qQXY0gqPfJK+81APv/4wb6J6saf9VnmR+8= X-Received: by 2002:a05:6a00:d53:b029:32a:2db6:1be3 with SMTP id n19-20020a056a000d53b029032a2db61be3mr25016662pfv.71.1627429914201; Tue, 27 Jul 2021 16:51:54 -0700 (PDT) MIME-Version: 1.0 References: <20210714193542.21857-1-joao.m.martins@oracle.com> <20210714193542.21857-13-joao.m.martins@oracle.com> In-Reply-To: From: Dan Williams Date: Tue, 27 Jul 2021 16:51:43 -0700 Message-ID: Subject: Re: [PATCH v3 12/14] device-dax: compound pagemap support To: Joao Martins Cc: Linux MM , Vishal Verma , Dave Jiang , Naoya Horiguchi , Matthew Wilcox , Jason Gunthorpe , John Hubbard , Jane Chu , Muchun Song , Mike Kravetz , Andrew Morton , Jonathan Corbet , Linux NVDIMM , Linux Doc Mailing List Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel-com.20150623.gappssmtp.com header.s=20150623 header.b="uu+mARo/"; spf=none (imf28.hostedemail.com: domain of dan.j.williams@intel.com has no SPF policy when checking 209.85.214.176) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=intel.com (policy=none) X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: D59FA9000508 X-Stat-Signature: 33m6t3988qtcy43tjkzkyfnpzfjbnixu X-HE-Tag: 1627429916-404271 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 15, 2021 at 5:01 AM Joao Martins wrote: > > On 7/15/21 12:36 AM, Dan Williams wrote: > > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins wrote: > >> > >> Use the newly added compound pagemap facility which maps the assigned dax > >> ranges as compound pages at a page size of @align. Currently, this means, > >> that region/namespace bootstrap would take considerably less, given that > >> you would initialize considerably less pages. > >> > >> On setups with 128G NVDIMMs the initialization with DRAM stored struct > >> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less > >> than a 1msec with 1G pages. > >> > >> dax devices are created with a fixed @align (huge page size) which is > >> enforced through as well at mmap() of the device. Faults, consequently > >> happen too at the specified @align specified at the creation, and those > >> don't change through out dax device lifetime. MCEs poisons a whole dax > >> huge page, as well as splits occurring at the configured page size. > >> > > > > Hi Joao, > > > > With this patch I'm hitting the following with the 'device-dax' test [1]. > > > Ugh, I can reproduce it too -- apologies for the oversight. No worries. > > This patch is not the culprit, the flaw is early in the series, specifically the fourth patch. > > It needs this chunk below change on the fourth patch due to the existing elevated page ref > count at zone device memmap init. put_page() called here in memunmap_pages(): > > for (i = 0; i < pgmap->nr_ranges; i++) > for_each_device_pfn(pfn, pgmap, i) > put_page(pfn_to_page(pfn)); > > ... on a zone_device compound memmap would otherwise always decrease head page refcount by > @geometry pfn amount (leading to the aforementioned splat you reported). > > diff --git a/mm/memremap.c b/mm/memremap.c > index b0e7b8cf3047..79a883af788e 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) > return (range->start + range_len(range)) >> PAGE_SHIFT; > } > > -static unsigned long pfn_next(unsigned long pfn) > +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn) > { > if (pfn % 1024 == 0) > cond_resched(); > - return pfn + 1; > + return pfn + pgmap_pfn_geometry(pgmap); The cond_resched() would need to be fixed up too to something like: if (pfn % (1024 << pgmap_geometry_order(pgmap))) cond_resched(); ...because the goal is to take a break every 1024 iterations, not every 1024 pfns. > } > > #define for_each_device_pfn(pfn, map, i) \ > - for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn)) > + for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn)) > > static void dev_pagemap_kill(struct dev_pagemap *pgmap) > { > > It could also get this hunk below, but it is sort of redundant provided we won't touch > tail page refcount through out the devmap pages lifetime. This setting of tail pages > refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes > from the page allocator (where tail pages are already zeroed in refcount). Wait, devmap pages never see the page allocator? > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 96975edac0a8..469a7aa5cf38 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned > long pfn, > __init_zone_device_page(page + i, pfn + i, zone_idx, > nid, pgmap); > prep_compound_tail(page, i); > + set_page_count(page + i, 0); Looks good to me and perhaps a for elevated tail page refcount at teardown as a sanity check that the tail pages was never pinned directly? > > /* > * The first and second tail pages need to