From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFC85FA3741 for ; Fri, 28 Oct 2022 23:57:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230008AbiJ1X5t (ORCPT ); Fri, 28 Oct 2022 19:57:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230006AbiJ1X5k (ORCPT ); Fri, 28 Oct 2022 19:57:40 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26D511D3754 for ; Fri, 28 Oct 2022 16:57:37 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id u6so6132965plq.12 for ; Fri, 28 Oct 2022 16:57:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dL8ynURpMtdh0kmGkVrhnVStQaki4uWpAqyDuPgG8YE=; b=GzQMUs3Z324t8mnJ8BGeMc8+zTFpD3QpLW3anlsOhMUQ7HNardszqaYMyDcit6EYX7 j8xi/2FQGHyrRk9CaRVY8UzNF+EF8z9xj2D3EBhhAA+HJSMpML/eT9tuFkgUJtT1JSjc HysWAMRNtkuoliLgGaXccxmOw+y29VixR/MJjwyhQrYQEVCH1Fl0BfwtCyo3THeVinbk GRykLUgZShx6y+D2Bq/ANuhl0AvymD0RUel04UfyCTO9WxLsLbRPRzod4FXHO8sUHljD oUGNFW2Xs6wFpvvtT/O7Zn01KbfgjkIN7sA5OFCFEtoPL3/R/CGav4CWTJ9Y7oHVtKHW K86A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dL8ynURpMtdh0kmGkVrhnVStQaki4uWpAqyDuPgG8YE=; b=qhpMXfvYW8aZFrzEGzkkSSSm/Dv1FMh7Aa88lsOMXVodhdnKM7vVy8+j7Aq1tY+Iqg RR7qih/gBAj9d9+QzOVFYr8nUNRDXqTCfGuxylWnw8fnqy3YHVUKWow2TKFZkaRGwK25 CE1UCvYDQ/oHFntpnEDcrZ9dP8fnZpgDVYhwzoUyx0cZMWr6ra3fmxPVogw+0dAN3J0W oOGljXDp5khVAJCShjy8aBePpqeT3j0pJAy/1J5QozkASPGrS/DxonOVVQqEnch9qr0v VZUG9tDMm41/lISbZ0lrD8Fki7FM4uGAjJb5ztsXPj+Duc2YhZtndB8yoe42TIyFmLtt MSpQ== X-Gm-Message-State: ACrzQf1H/jSZmsgKyFJS14/Za4PUXqFVVNnAbEFv+JwmlLQK891T0ogD Hey+/cdV94ZinmEMETRB020= X-Google-Smtp-Source: AMsMyM4dq2JCZiaeX1ZMkDakggfXQIU8/QR5KwQl779jicrVW2qJFszu81cT4MPos2VSSbWZaesFlQ== X-Received: by 2002:a17:90a:2b42:b0:213:82b3:212b with SMTP id y2-20020a17090a2b4200b0021382b3212bmr1931225pjc.106.1667001456289; Fri, 28 Oct 2022 16:57:36 -0700 (PDT) Received: from smtpclient.apple ([66.170.99.95]) by smtp.gmail.com with ESMTPSA id a19-20020aa79713000000b0056bc1a41209sm36234pfg.33.2022.10.28.16.57.34 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 28 Oct 2022 16:57:35 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment From: Nadav Amit In-Reply-To: Date: Fri, 28 Oct 2022 16:57:32 -0700 Cc: Jann Horn , John Hubbard , X86 ML , Matthew Wilcox , Andrew Morton , kernel list , Linux-MM , Andrea Arcangeli , "Kirill A . Shutemov" , jroedel@suse.de, ubizjak@gmail.com, Linus Torvalds Content-Transfer-Encoding: quoted-printable Message-Id: References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <6C548A9A-3AF3-4EC1-B1E5-47A7FFBEB761@gmail.com> To: Peter Zijlstra X-Mailer: Apple Mail (2.3696.120.41.1.1) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Oct 27, 2022, at 2:44 PM, Nadav Amit wrote: > Anyhow, admittedly, I need to give it more thought. For instance, in = respect > to the code that you mentioned (in zap_pte_range), after reading it = again, > it seems strange: how is ok to defer the TLB flush after the rmap was > removed, even if it is done while the PTL is held. > folio_clear_dirty_for_io() would not sync on the PTL afterwards, so = the page > might be later re-dirtied using a stale cached PTE. Oh well. Peter, So it appears to be a problem - flushing after removing the rmap. I = attach a PoC that shows the problem. The problem is in the following code of zap_pte_range(): if (!PageAnon(page)) { if (pte_dirty(ptent)) { force_flush =3D 1; set_page_dirty(page); } =E2=80=A6 } page_remove_rmap(page, vma, false); Once we remove the rmap, rmap_walk() would not acquire the page-table = lock anymore. As a result, nothing prevents the kernel from performing = writeback and cleaning the page-struct dirty-bit before the TLB is actually = flushed. As a result, if there is an additional thread that has the dirty-PTE = cached in the TLB, it can keep writing to the page and nothing = (PTE/page-struct) will keep track that the page has been dirtied. In other words, writes to the memory mapped file after munmap()/MADV_DONTNEED started can be lost. This affects both munmap() and MADV_DONTNEED. One might argue that if = you don=E2=80=99t keep writing after munmap()/MADV_DONTNEED it=E2=80=99s = your fault (questionable), but if that=E2=80=99s the case why do we bother with = force_flush at all? If we want it to behave correctly - i.e., writes after = munmap/MADV_DONTNEED to propagate to the file - we need to collect dirty pages and remove = their rmap only after we flush the TLB (and before we release the ptl). = mmu_gather would probably need to be changed for this matter. Thoughts? --- Some details about the PoC (below): We got 3 threads that use 2MB range: 1. Maintains a counter in the first 4KB of the 2MB range and checks it actually updates the memory. 2. Dirties pages in 2MB range (just to make the race more likely). 3. Either (i) maps the same mapping at the first 4KB (to test munmap indirectly); or (ii) runs MADV_DONTNEED on the first 4KB. In addition, a child process runs msync and fdatasync to writeback the = first 4KB. The PoC should be run with a file on a block RAM device. It manages to trigger the issue relatively reliably (within 60 seconds) with munmap() = and slightly less reliably with MADV_DONTNEED. I have no idea if it works in = VM, deep C-states, etc.. -- >8 -- #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #define handle_error(msg) \ do { perror(msg); exit(EXIT_FAILURE); } while (0) void *p; volatile bool stop =3D false; pid_t flusher_pid; int fd; #define PAGE_SIZE (4096ul) #define PAGES_PER_PMD (512) #define HPAGE_SIZE (PAGE_SIZE * PAGES_PER_PMD) // Comment MUNMAP_TEST for MADV_DONTNEED test #define MUNMAP_TEST void *dirtying_thread(void *arg) { int i; while (!stop) { for (i =3D 1; i < PAGES_PER_PMD; i++) { *(volatile char *)(p + (i * PAGE_SIZE) + 64) =3D = 5; } } return NULL; } void *checking_thread(void *arg) { volatile unsigned long *ul_p =3D (volatile unsigned long*)p; unsigned long cnt =3D 0; while (!stop) { *ul_p =3D cnt; if (*ul_p !=3D cnt) { printf("FAILED: expected %ld, got %ld\n", cnt, = *ul_p); kill(flusher_pid, SIGTERM); exit(0); } cnt++; } return NULL; } void *remap_thread(void *arg) { void *ptr; struct timespec t =3D { .tv_nsec =3D 10000, }; while (!stop) { #ifdef MUNMAP_TEST ptr =3D mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0); if (ptr =3D=3D MAP_FAILED) handle_error("remap_thread"); #else if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0) handle_error("MADV_DONTNEED"); nanosleep(&t, NULL); #endif } return NULL; } void flushing_process(void) { // Remove the pages to speed up rmap_walk and allow to drop = caches. if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0) handle_error("MADV_DONTNEED"); while (true) { if (msync(p, PAGE_SIZE, MS_SYNC)) handle_error("msync"); if (posix_fadvise(fd, 0, PAGE_SIZE, = POSIX_FADV_DONTNEED)) handle_error("posix_fadvise"); } } int main(int argc, char *argv[]) { void *(*thread_funcs[])(void*) =3D { &dirtying_thread, &checking_thread, &remap_thread, };=09 int r, i; int rc1, rc2; unsigned long addr; void *ptr; char *page =3D malloc(PAGE_SIZE); int n_threads =3D sizeof(thread_funcs) / sizeof(*thread_funcs); pthread_t *threads =3D malloc(sizeof(pthread_t) * n_threads); pid_t pid; =09 if (argc < 2) { fprintf(stderr, "usages: %s [filename]\n", argv[0]); exit(EXIT_FAILURE); } fd =3D open(argv[1], O_RDWR|O_CREAT, 0666); if (fd =3D=3D -1) handle_error("open fd"); for (i =3D 0; i < PAGES_PER_PMD; i++) { if (write(fd, page, PAGE_SIZE) !=3D PAGE_SIZE) handle_error("write"); } free(page); ptr =3D mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, = MAP_PRIVATE|MAP_ANON, -1, 0); if (ptr =3D=3D MAP_FAILED) handle_error("mmap anon"); addr =3D (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - = 1); printf("starting...\n"); ptr =3D mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0); if (ptr =3D=3D MAP_FAILED) handle_error("mmap file - start"); =09 p =3D ptr; for (i =3D 0; i < n_threads; i++) { r =3D pthread_create(&threads[i], NULL, thread_funcs[i], = NULL); if (r) handle_error("pthread_create"); } // Run the flushing process in a different process, so msync() = would // not require mmap_lock. pid =3D fork(); if (pid =3D=3D 0) flushing_process(); flusher_pid =3D pid; sleep(60); stop =3D true; for (i =3D 0; i < n_threads; i++) pthread_join(threads[i], NULL); kill(flusher_pid, SIGTERM); printf("Finished without an error"); exit(0); }=