Linus Torvalds wrote: > On Mon, 18 Dec 2006, Peter Zijlstra wrote: > >>This should be safe; page_mkclean walks the rmap and flips the pte's >>under the pte lock and records the dirty state while iterating. >>Concurrent faults will either do set_page_dirty() before we get around >>to doing it or vice versa, but dirty state is not lost. > > > Ok, I really liked this patch, but the more I thought about it, the more I > started to doubt the reasons for liking it. Well this implements my suggestion to redirty the page if there were dirty ptes. I think it is a good fix (whether or not it fixes Andrei's bug, it does fix a bug), though maybe _slightly_ suboptimal. > I think we have some core fundamental problem here that this patch is > needed at all. > > So let's think about this: we apparently have two cases of > "clear_page_dirty()": > > - the one that really wants to clear the bit unconditionally (Andrew > calls this the "must_clean_ptes" case, which I personally find to be a > really confusing name, but whatever) > > - the other case. The case that doesn't want to really clear the pte > dirty bits. I don't think this characterises it correctly. Think about how it worked before the page_mkclean went in there. We really _never_ want to just clear pte dirty bits, because that would be a data loss situation[*]. The only reason we clear PG_dirty is because some filesystem may have cleaned each buffer without realising it has cleaned the whole page. But if you have a dirty pte, then all bets are off: a buffer with a clear dirty bit can not be considered clean. Before the dirty page tracking, it was fine to clear PG_dirty here, because we would pick up the pte dirty info later on. After the page dirty tracking, clearing pte dirty is a bug here, and re-accounting the dirty page is arguably the minimal fix. [*] except in the truncate case where we are happy to throw out dirty data, but in that case there would be no ptes anyway. The only thing I would suggest is not applying Andrew's patch at all, and do the special casing in try_to_free_buffers(). I've attached a patch for comments. > and I thought your patch made sense, because it saved away the pte state > in the page dirty state, and that matches my mental model, but the more I > think about it, the less sense that whole "the other case" situation makes > AT ALL. > > Why does "the other case" exist at all? If you want to clear the dirty > page flag, what is _ever_ the reason for not wanting to drop PTE dirty > information? In other words, what possible reason can there ever be for > saying "I want this page to be clean", while at the same time saying "but > if it was dirty in the page tables, don't forget about that state". We never want to drop dirty data! (ignoring the truncate case, which is handled privately by truncate anyway) This whole exercise is not about cleaning or dirtying or fogetting the actual *data* in the page. It is about bringing the pagecache's notion of whether the page is dirty or clean in line with the (more uptodate) filesystem's notion. After dirty write accounting, we also threw in "the virtual memory manager's notion", but got that case slightly wrong. As unlikely as this race is for SMP systems, I think it is easily possible for PREEMPT kernels. And they have featured in all bug reports, AFAIKS. -- SUSE Labs, Novell Inc.