From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262817AbTJJQ0b (ORCPT ); Fri, 10 Oct 2003 12:26:31 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S263002AbTJJQ0b (ORCPT ); Fri, 10 Oct 2003 12:26:31 -0400 Received: from inet-mail4.oracle.com ([148.87.2.204]:41178 "EHLO inet-mail4.oracle.com") by vger.kernel.org with ESMTP id S262817AbTJJQ02 (ORCPT ); Fri, 10 Oct 2003 12:26:28 -0400 Date: Fri, 10 Oct 2003 09:26:07 -0700 From: Joel Becker To: Linus Torvalds Cc: Trond Myklebust , Ulrich Drepper , Linux Kernel Subject: Re: statfs() / statvfs() syscall ballsup... Message-ID: <20031010162606.GB28773@ca-server1.us.oracle.com> Mail-Followup-To: Linus Torvalds , Trond Myklebust , Ulrich Drepper , Linux Kernel References: <20031010152710.GA28773@ca-server1.us.oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Burt-Line: Trees are cool. X-Red-Smith: Ninety feet between bases is perhaps as close as man has ever come to perfection. User-Agent: Mutt/1.5.4i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 10, 2003 at 09:00:23AM -0700, Linus Torvalds wrote: > I'm hoping in-memory databases will just kill off the current crop > totally. That solves all the IO problems - the only thing that goes to > disk is the log and the backups, and both go there totally linearly unless > the designer was crazy. Memory is continuously too small and too expensive. Even if you can buy a machine with 10TB of RAM, the price is going to be prohibitive. And when 10TB of RAM costs better, the database is going to be 100TB. I'm not saying that in-memory is bad. Big databases do everything they can to make the workload look almost like in-memory. It's the only way to go. > But why do you think you need O_DIRECT with very bad semantics to handle > this? I don't need O_DIRECT with bad semantics. I need the semantics I need, I know that other OSes have O_DIRECT to provide those capabilities, and everyone loves portability. That said... > O_DIRECT throws the cache part away, but it throws out the baby with the > bathwater, and breaks the other parts. Which is why O_DIRECT breaks things > like disk scheduling in really subtle ways - think about writing and > reading to the same area on the disk, and re-ordering at all different > levels. Sure, but you don't do that. The breakage in mixing O_DIRECT with pagecache I/O to the same areas of the disk isn't even all that subtle. But you shouldn't be doing that, at least constantly. > And the thing is, uncaching is _trivial_. It's not like it is hard to say > "try to get rid of these pages if they aren't mapped anywhere" and "insert > this user page directly into the page cache". But people are so fixated > with "direct to disk" that they don't even think about it. I'm not fixated. "Use this user page for the page cache entry for this offset into the file", "Change this user page from representing this offset in this file to representing that offset in that file", and "whatever you do, always read/write from backing store for this page" are the semantics needed. For the latter, you'd have to have a way for the app to trigger a read or write out of the cache. You don't want to do it on every page modification or access, that's too often. The application knows the syncronization points, not the kernel. Joel -- "There is a country in Europe where multiple-choice tests are illegal." - Sigfried Hulzer Joel Becker Senior Member of Technical Staff Oracle Corporation E-mail: joel.becker@oracle.com Phone: (650) 506-8127