From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicolas Pitre Subject: Re: [PATCH] don't use mmap() to hash files Date: Mon, 15 Feb 2010 00:05:37 -0500 (EST) Message-ID: References: <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org> <37fcd2781002131409r4166e496h9d12d961a2330914@mail.gmail.com> <20100213223733.GP24809@gibbs.hungrycats.org> <20100214011812.GA2175@dpotapov.dyndns.org> <20100214024259.GB9704@dpotapov.dyndns.org> <37fcd2781002141106v761ce6e0kc5c5bdd5001f72a9@mail.gmail.com> <37fcd2781002141156n7e2b9673s1eb6c12869facdb2@mail.gmail.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Transfer-Encoding: 7BIT Cc: Johannes Schindelin , Zygo Blaxell , Ilari Liusvaara , Thomas Rast , Jonathan Nieder , git@vger.kernel.org To: Dmitry Potapov X-From: git-owner@vger.kernel.org Mon Feb 15 06:05:46 2010 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Ngt9V-0001MZ-Hb for gcvg-git-2@lo.gmane.org; Mon, 15 Feb 2010 06:05:45 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751729Ab0BOFFk (ORCPT ); Mon, 15 Feb 2010 00:05:40 -0500 Received: from relais.videotron.ca ([24.201.245.36]:31447 "EHLO relais.videotron.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751202Ab0BOFFj (ORCPT ); Mon, 15 Feb 2010 00:05:39 -0500 Received: from xanadu.home ([66.130.28.92]) by VL-MO-MR005.ip.videotron.ca (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug 3 2007; 32bit)) with ESMTP id <0KXV00MN4A5D2HH0@VL-MO-MR005.ip.videotron.ca> for git@vger.kernel.org; Mon, 15 Feb 2010 00:05:37 -0500 (EST) X-X-Sender: nico@xanadu.home In-reply-to: <37fcd2781002141156n7e2b9673s1eb6c12869facdb2@mail.gmail.com> User-Agent: Alpine 2.00 (LFD 1167 2008-08-23) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Sun, 14 Feb 2010, Dmitry Potapov wrote: > 1. to introduce a configuration parameter that will define whether to use > mmap() to hash files or not. It is a trivial change, but the real question > is what default value for this option (should we do some heuristic based > on filesize vs available memory?) I don't like such kind of heuristic. They're almost always wrong, and any issue is damn hard to reproduce. I tend to believe that mmap() works better by letting the OS paging in and out memory as needed while reading data into allocated memory is only going to force the system into swap. > 2. to stream files in chunks. It is better because it is faster, especially on > large files, as you calculate SHA-1 and zip data while they are in CPU > cache. However, it may be more difficult to implement, because we have > filters that should be apply to files that are put to the repository. So? "More difficult" when it is the right thing to do is no excuse not to do it and satisfy ourselves with an half solution. Barely replacing mmap() with read() has drawbacks while the advantages aren't that many. Gaining a few speed percentage while making it less robust when memory is tight isn't such a great compromize to me. BUT if you were to replace mmap() with read() and make the process chunked then you do improve both speed _and_ memory usage. As to huge file: we have that core.bigFileThreshold variable now, and anything that crosses it should be considered "stream in / stream out" without further considerations. That means no diff, no rename similarity estimates, no delta, no filter, no blame, no fancies. If you have source code files that big then you do have a bigger problem already anyway. Typical huge files are rarely manipulated, and when they do it is pretty unlikely to be compared with other versions using diff, and then that also means that you have the storage capacity and network bandwidth to deal with them. Hence repository tightness is not your top concern in that case, but repack/checkout speed most likely is. So big files should be streamed to a pack of their own at "git add" time. Then repack will simply "reuse pack data" without delta compression attempts, meaning that they will be streamed into a single huge pack with no issue (this particular case is already supported in the code). > 3. to improve Git to support huge files on computers with low memory. That comes for free with #2. Nicolas