From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751944Ab2GULRD (ORCPT ); Sat, 21 Jul 2012 07:17:03 -0400 Received: from mail-qa0-f46.google.com ([209.85.216.46]:46247 "EHLO mail-qa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724Ab2GULRB (ORCPT ); Sat, 21 Jul 2012 07:17:01 -0400 MIME-Version: 1.0 In-Reply-To: References: <1331938267-13583-1-git-send-email-john.stultz@linaro.org> Date: Sat, 21 Jul 2012 13:17:00 +0200 Message-ID: Subject: Re: [PATCH 0/2] [RFC] Volatile ranges (v4) From: Dmitry Adamushko To: Dmitry Vyukov Cc: John Stultz , LKML Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [ cc: lkml ] >> > There is a property of shadow memory that I would like to exploit >> > - any region of shadow memory can be reset to zero at any point >> > w/o any bad consequences (it can lead to missed data >> > races, but it's better than OOM kill). >> > I've tried to execute madvise(MADV_DONTNEED) every X >> > seconds for whole shadow memory. It almost works. >> > The problem is that madvise() seems to be >> > not atomic, occasionally I see missed writes, that's not acceptable, >> >> Just to be sure, you mean that if you do, say >> >> *ptr = 1; // [1] >> ... >> value = *ptr; // [2] value is 1 here >> ... >> *ptr = 2; // possibly from another thread, but we can be certain that >> it's after [1], perhaps because we checked the content with [2] >> ... >> // madvise(..., MADV_DONTNEED); _might_ have been called >> ... >> value = *ptr; >> >> so here we expect 'value' to be either 2 or 0 (zero page iff madvise() >> did take place), but you get '1' occasionally? >> Is that what you mean or something else? > > > > Yes, that's what I mean. > Basically I observed inconsistent state of memory that must never happen. I > had no other explanations except that the madvise() call works as a time > machine. I executed madvise() every 3 seconds, and the inconsistencies > happened exactly at the same times. When I turned off madvise(), the > problems disappear. Did you try disabling swapping? The "time machine" (if it's not a problem somewhere else) should take old stuff from somewhere. mmap's man indicates "zero-fill-on-demand pages for mappings without an underlying file" and the comment in madvise_dontneed() says "Be sure to free swap resources too", but then looking at the code in zap_pte_ranges(), there are a few corner cases where swap entries seem to be left over intact. In any case, given that you can reproduce it easily, it'll be a quick check. Also, it's a MAP_PRIVATE mapping, isn't it? > > I am not sure whether it's a bug or not, because the man says "For the time > being, the application is finished with the given range", and we are > obviously do not since we have concurrent accesses. However I would be great > if it is "fixed". > > > >> > I need either zero pages or >> > consistent pages. >> > Your FADV_VOLATILE looks like it may be the solution. >> > To summarize: I have a huge region of memory that >> > I would like to mark as "volatile" at program >> > startup. The region is anonymous (not backed by any file). >> > The kernel must be able to take away >> > any pages in the range, and then return zero pages later. >> >> I guess that for the use-cases that people have considered so far, >> users are supposed to mark regions NONVOLATILE before accessing them >> again. If I understand correctly, that's not what you want to do... > > > No, it's not what I want to do. I can't do any tracking during accesses. > Ideally I just mark the range during startup, it's also possible to do some > work on a periodic basis. > > >> >> does it mean that your 'transactions' are always write-a-single-word? >> i.e. you don't need to make sure that, say, in >> >> ptr[0] = val_a; >> ptr[1] = val_b; >> ... // no accesses to ptr [0] and [1] >> c = ptr[0]; >> d = ptr[1]; >> >> either c == val_a and d == val_b or both are 0? > > > Exactly. Any transaction first issues N independent 8-byte atomic reads, and > then optionally 1 atomic 8-byte write. Value of 0 especially means "no data > here", because, well, I do not want to setup 40TB of memory to some other > value :) > > >> >> >> Also, the current implementation of volatile-ranges will try to 'zap' >> the whole volatile range... not just some of its pages. Perhaps, it's >> not something you need too. Of course, this can be addressed one way >> or another. > > > Well, yes, it's not ideal (but we had lived with MADV_DONTNEED for the whole > range for some time). Ideally, kernel just takes pages away as it needs > them. > > > >> Basically, in your specific case, the pages of your region should be, >> kind of, swapped out to /dev/zero :-)) meaning that once the system >> decides to swap such a page out, no actual swap is required and, once >> the area is being accessed again, a zero-page is mapped into it. > > > Yes, I believe that's how MADV_DONTNEED currently works (... or > zero-fill-on-demand pages for mappings without an underlying file). > Also the pages must be "swapped out" with higher prio. > I think what I want is somewhat similar to page cache. Kind of best effort > LRU caching. > -- Dmitry