From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65041C4338F for ; Wed, 4 Aug 2021 18:49:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D6E5060F58 for ; Wed, 4 Aug 2021 18:49:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D6E5060F58 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 5C5408D0002; Wed, 4 Aug 2021 14:49:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54B868D0001; Wed, 4 Aug 2021 14:49:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C68B8D0002; Wed, 4 Aug 2021 14:49:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0227.hostedemail.com [216.40.44.227]) by kanga.kvack.org (Postfix) with ESMTP id 1B5078D0001 for ; Wed, 4 Aug 2021 14:49:21 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id AFC4C1813C40F for ; Wed, 4 Aug 2021 18:49:20 +0000 (UTC) X-FDA: 78438286080.38.3563244 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 2E3D0F00A2A4 for ; Wed, 4 Aug 2021 18:49:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628102959; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wIS1QzS1z5rPO1vnLNSXVbu6rEfqlxZ6uoU3oa+gFO8=; b=a9CkuzKHf1lhkL8tBKsoBX6dpD+s+CThT0diJh2xltlhes4aSMis2Aqs34YDlgaMDHjpuN BDizTZehluOMXQ7vicojPTm9k8oI6qQM2gMEpL/C0OwHnSD+8OUyKDw8Q/f802Qjs2uZ7+ OjhjgcpHZtiL90O6GIfu43NhTdSPF6Y= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-20-va29EgN-PEq5SEt44bQzAA-1; Wed, 04 Aug 2021 14:49:18 -0400 X-MC-Unique: va29EgN-PEq5SEt44bQzAA-1 Received: by mail-wr1-f70.google.com with SMTP id l14-20020a5d560e0000b029013e2b4ee624so1147680wrv.1 for ; Wed, 04 Aug 2021 11:49:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=wIS1QzS1z5rPO1vnLNSXVbu6rEfqlxZ6uoU3oa+gFO8=; b=Zi7W2VUUIxw3MGdHDG4pycG6kvNc6/SCnSK/ajIz7zZabktji539UUsdicwCRptUPK gj64KZZoYq231JwMEEHpF1hgvTZwBiyBWjI2xDjZStn/z6sUUyovUGX7GKXclOYY8xMb RDhBK03g9vYENlYvhZFUSnuVNAbc7wfPtrf7/o+GRD5QOK/HVkyCLEueu1z//6FnqJTr LKBVpaOTARcH62grl2oS4P4u1GRXLQstm1aDkrpkm4qhDFlFPplEieG0hgOpaXat+IEG BQ95BzM+v/pCuSdrTaQTcVrVrSB0EIwlJ1iWqcTvsyppkjajeSyYoisafQ/rZ8w6H3qR 60bQ== X-Gm-Message-State: AOAM530uVqOzp3HossYjjn3BQvdIFF4x97Ea+V2eu119GKyh74yVBSdo F1HHuhIuZi9kK0+NnU3lRo7sjWrQsKRo8DBTUAnZfPNCHRReeWF7NJdDlpg3ZcC6RK3zLwdgTcX +rnMsNZPIfQc= X-Received: by 2002:a5d:6107:: with SMTP id v7mr762608wrt.107.1628102956845; Wed, 04 Aug 2021 11:49:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxkGRiEHbgAzLaSSeRuv2og1EjDVGvwXPYpq1DOCWx9uW6w35PXJyhdwq9EteK8Oufq1WGU4Q== X-Received: by 2002:a5d:6107:: with SMTP id v7mr762568wrt.107.1628102956551; Wed, 04 Aug 2021 11:49:16 -0700 (PDT) Received: from [192.168.3.132] (p5b0c65d3.dip0.t-ipconnect.de. [91.12.101.211]) by smtp.gmail.com with ESMTPSA id f194sm6849056wmf.23.2021.08.04.11.49.15 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 Aug 2021 11:49:15 -0700 (PDT) To: Peter Xu , Tiberiu A Georgescu Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, christian.brauner@ubuntu.com, ebiederm@xmission.com, adobriyan@gmail.com, songmuchun@bytedance.com, axboe@kernel.dk, vincenzo.frascino@arm.com, catalin.marinas@arm.com, peterz@infradead.org, chinwen.chang@mediatek.com, linmiaohe@huawei.com, jannh@google.com, apopple@nvidia.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, ivan.teterevkov@nutanix.com, florian.schmidt@nutanix.com, carl.waldspurger@nutanix.com, jonathan.davies@nutanix.com References: <20210730160826.63785-1-tiberiu.georgescu@nutanix.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages Message-ID: <839e82f7-2c54-d1ef-8371-0a332a4cb447@redhat.com> Date: Wed, 4 Aug 2021 20:49:14 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=a9CkuzKH; spf=none (imf11.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 2E3D0F00A2A4 X-Stat-Signature: xkumc3sxqx6sg6tc3adn93nqy3bm7sn8 X-HE-Tag: 1628102960-518551 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 04.08.21 20:33, Peter Xu wrote: > Hi, Tiberiu, >=20 > On Fri, Jul 30, 2021 at 04:08:25PM +0000, Tiberiu A Georgescu wrote: >> This patch follows up on a previous RFC: >> 20210714152426.216217-1-tiberiu.georgescu@nutanix.com >> >> When a page allocated using the MAP_SHARED flag is swapped out, its pa= gemap >> entry is cleared. In many cases, there is no difference between swappe= d-out >> shared pages and newly allocated, non-dirty pages in the pagemap inter= face. >> >> Example pagemap-test code (Tested on Kernel Version 5.14-rc3): >> #define NPAGES (256) >> /* map 1MiB shared memory */ >> size_t pagesize =3D getpagesize(); >> char *p =3D mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE, >> MAP_ANONYMOUS | MAP_SHARED, -1, 0); >> /* Dirty new pages. */ >> for (i =3D 0; i < PAGES; i++) >> p[i * pagesize] =3D i; >> >> Run the above program in a small cgroup, which causes swapping: >> /* Initialise cgroup & run a program */ >> $ echo 512K > foo/memory.limit_in_bytes >> $ echo 60 > foo/memory.swappiness >> $ cgexec -g memory:foo ./pagemap-test >> >> Check the pagemap report. Example of the current expected output: >> $ dd if=3D/proc/$PID/pagemap ibs=3D8 skip=3D$(($VADDR / $PAGESIZE= )) count=3D$COUNT | hexdump -C >> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |....= ............| >> * >> 00000710 e1 6b 06 00 00 00 00 a1 9e eb 06 00 00 00 00 a1 |.k..= ............| >> 00000720 6b ee 06 00 00 00 00 a1 a5 a4 05 00 00 00 00 a1 |k...= ............| >> 00000730 5c bf 06 00 00 00 00 a1 90 b6 06 00 00 00 00 a1 |\...= ............| >> >> The first pagemap entries are reported as zeroes, indicating the pages= have >> never been allocated while they have actually been swapped out. >> >> This patch addresses the behaviour and modifies pte_to_pagemap_entry()= to >> make use of the XArray associated with the virtual memory area struct >> passed as an argument. The XArray contains the location of virtual pag= es in >> the page cache, swap cache or on disk. If they are in either of the ca= ches, >> then the original implementation still works. If not, then the missing >> information will be retrieved from the XArray. >> >> Performance >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> I measured the performance of the patch on a single socket Xeon E5-262= 0 >> machine, with 128GiB of RAM and 128GiB of swap storage. These were the >> steps taken: >> >> 1. Run example pagemap-test code on a cgroup >> a. Set up cgroup with limit_in_bytes=3D4GiB and swappiness=3D60; >> b. allocate 16GiB (about 4 million pages); >> c. dirty 0,50 or 100% of pages; >> d. do this for both private and shared memory. >> 2. Run `dd if=3D ibs=3D8 skip=3D$(($VADDR / $PAGESIZE)) co= unt=3D4194304` >> for each possible configuration above >> a. 3 times for warm up; >> b. 10 times to measure performance. >> Use `time` or another performance measuring tool. >> >> Results (averaged over 10 iterations): >> +--------+------------+------------+ >> | dirty% | pre patch | post patch | >> +--------+------------+------------+ >> private|anon | 0% | 8.15s | 8.40s | >> | 50% | 11.83s | 12.19s | >> | 100% | 12.37s | 12.20s | >> +--------+------------+------------+ >> shared|anon | 0% | 8.17s | 8.18s | >> | 50% | (*) 10.43s | 37.43s | >> | 100% | (*) 10.20s | 38.59s | >> +--------+------------+------------+ >> >> (*): reminder that pre-patch produces incorrect pagemap entries for sw= apped >> out pages. >> >> From run to run the above results are stable (mostly <1% stderr). >> >> The amount of time it takes for a full read of the pagemap depends on = the >> granularity used by dd to read the pagemap file. Even though the acces= s is >> sequential, the script only reads 8 bytes at a time, running pagemap_r= ead() >> COUNT times (one time for each page in a 16GiB area). >> >> To reduce overhead, we can use batching for large amounts of sequentia= l >> access. We can make dd read multiple page entries at a time, >> allowing the kernel to make optimisations and yield more throughput. >> >> Performance in real time (seconds) of >> `dd if=3D ibs=3D8*$BATCH skip=3D$(($VADDR / $PAGESIZE / $BATC= H)) >> count=3D$((4194304 / $BATCH))`: >> +---------------------------------+ +---------------------------------= + >> | Shared, Anon, 50% dirty | | Shared, Anon, 100% dirty = | >> +-------+------------+------------+ +-------+------------+------------= + >> | Batch | Pre-patch | Post-patch | | Batch | Pre-patch | Post-patch = | >> +-------+------------+------------+ +-------+------------+------------= + >> | 1 | (*) 10.43s | 37.43s | | 1 | (*) 10.20s | 38.59s = | >> | 2 | (*) 5.25s | 18.77s | | 2 | (*) 5.15s | 19.37s = | >> | 4 | (*) 2.63s | 9.42s | | 4 | (*) 2.63s | 9.74s = | >> | 8 | (*) 1.38s | 4.80s | | 8 | (*) 1.35s | 4.94s = | >> | 16 | (*) 0.73s | 2.46s | | 16 | (*) 0.72s | 2.54s = | >> | 32 | (*) 0.40s | 1.31s | | 32 | (*) 0.41s | 1.34s = | >> | 64 | (*) 0.25s | 0.72s | | 64 | (*) 0.24s | 0.74s = | >> | 128 | (*) 0.16s | 0.43s | | 128 | (*) 0.16s | 0.44s = | >> | 256 | (*) 0.12s | 0.28s | | 256 | (*) 0.12s | 0.29s = | >> | 512 | (*) 0.10s | 0.21s | | 512 | (*) 0.10s | 0.22s = | >> | 1024 | (*) 0.10s | 0.20s | | 1024 | (*) 0.10s | 0.21s = | >> +-------+------------+------------+ +-------+------------+------------= + >> >> To conclude, in order to make the most of the underlying mechanisms of >> pagemap and xarray, one should be using batching to achieve better >> performance. >=20 > So what I'm still a bit worried is whether it will regress some existin= g users. > Note that existing users can try to read pagemap in their own way; we c= an't > expect all the userspaces to change their behavior due to a kernel chan= ge. Then let's provide a way to enable the new behavior for a process if we=20 don't find another way to extract that information. I would actually=20 prefer finding a different interface for that, because with such things=20 the "pagemap" no longer expresses which pages are currently mapped.=20 Shared memory is weird. >=20 > Meanwhile, from the numbers, it seems to show a 4x speed down due to lo= oking up > the page cache no matter the size of ibs=3D. IOW I don't see a good wa= y to avoid > that overhead, so no way to have the userspace run as fast as before. >=20 > Also note that it's not only affecting the PM_SWAP users; it potentiall= y > affects all the /proc/pagemap users as long as there're file-backed mem= ory on > the read region of pagemap, which is very sane to happen. >=20 > That's why I think if we want to persist it, we should still consider s= tarting > from the pte marker idea. TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't=20 store any state information regarding shared memory in per-process page=20 tables: it just doesn't make too much sense. And this is similar to SOFTDIRTY or UFFD_WP bits: this information=20 actually belongs to the shared file ("did *someone* write to this page",=20 "is *someone* interested into changes to that page", "is there=20 something"). I know, that screams for a completely different design in=20 respect to these features. I guess we start learning the hard way that shared memory is just=20 different and requires different interfaces than per-process page table=20 interfaces we have (pagemap, userfaultfd). I didn't have time to explore any alternatives yet, but I wonder if=20 tracking such stuff per an actual fd/memfd and not via process page=20 tables is actually the right and clean approach. There are certainly=20 many issues to solve, but conceptually to me it feels more natural to=20 have these shared memory features not mangled into process page tables. >=20 > I do plan to move the pte marker idea forward unless that'll be NACKed = upstream > for some other reason, because that seems to be the only way for uffd-w= p to > support file based memories; no matter with a new swp type or with spec= ial swap > pte. I am even thinking about whether I should propose that with PM_SW= AP first > because that seems to be a simpler scenario than uffd-wp (which will ge= t the > rest uffd-wp patches involved then), then we can have a shared infrastr= ucture. > But haven't thought deeper than that. >=20 > Thanks, >=20 --=20 Thanks, David / dhildenb