git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dmitry Potapov <dpotapov@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Zygo Blaxell <zblaxell@esightcorp.com>,
	Ilari Liusvaara <ilari.liusvaara@elisanet.fi>,
	Thomas Rast <trast@student.ethz.ch>,
	Jonathan Nieder <jrnieder@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] Teach "git add" and friends to be paranoid
Date: Sun, 21 Feb 2010 10:21:42 +0300	[thread overview]
Message-ID: <20100221072142.GA5829@dpotapov.dyndns.org> (raw)
In-Reply-To: <7v8waniue8.fsf@alter.siamese.dyndns.org>

On Sat, Feb 20, 2010 at 11:23:11AM -0800, Junio C Hamano wrote:
> 
> There needs some benchmarking to justify it, I think.
> 
> So I'd split this hunk out when queuing.

I completely argee with your reasoning that it is a separate issue and
it needs some benchmarking to prove its usefulness. So, I have done it
today.

For that, I have created a repository up to 512Mb size and containing up
to 100,000 files (for small files the file number was the limiting
factor, for big files, the total size limit was used). I intentionally
limit the total size to 512Mb to make sure that they are in the FS cache
and no disk related effects.  The content of files have been generated
randomly by /dev/urandom and then used in all tests for one file size.
I have made 5 runs using mmap() (with git 1.7.0) and 5 runs with using
read() (after applying the patch below).

Below is the best result of 5 runs for each size. "Before" marks the
original version, which uses mmap(). "After" marks the modified version
using read(). The command used to measure time was:

cat list | time git hash-object --stdin-paths >/dev/null

So it is just calculating SHA-1 without deflating and writing the
result on the disk, which significantly depends on fsync() speed.

Tested on: Intel(R) Core(TM)2 Quad  CPU   Q9300  @ 2.50GHz

Here are results:

file size = 1Kb; Hashing 100000 files
Before:
0.63user 0.86system 0:01.49elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+100428minor)pagefaults 0swaps
After:
0.54user 0.53system 0:01.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+421minor)pagefaults 0swaps

file size = 2Kb; Hashing 100000 files
Before:
1.04user 0.79system 0:01.82elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+100428minor)pagefaults 0swaps
After:
0.95user 0.48system 0:01.43elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+422minor)pagefaults 0swaps

file size = 4Kb; Hashing 100000 files
Before:
1.73user 0.74system 0:02.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+100428minor)pagefaults 0swaps
After:
1.54user 0.57system 0:02.11elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+424minor)pagefaults 0swaps

file size = 8Kb; Hashing 64000 files
Before:
1.86user 0.63system 0:02.48elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.75user 0.50system 0:02.23elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+424minor)pagefaults 0swaps

file size = 16Kb; Hashing 32000 files
Before:
1.73user 0.41system 0:02.14elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.68user 0.32system 0:02.00elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+425minor)pagefaults 0swaps

file size = 32Kb; Hashing 16000 files
Before:
1.65user 0.32system 0:01.96elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.64user 0.24system 0:01.87elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+431minor)pagefaults 0swaps

file size = 64Kb; Hashing 8000 files
Before:
1.71user 0.17system 0:01.87elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.63user 0.18system 0:01.81elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+438minor)pagefaults 0swaps

file size = 128Kb; Hashing 4000 files
Before:
1.60user 0.20system 0:01.79elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.55user 0.20system 0:01.75elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+454minor)pagefaults 0swaps

file size = 256Kb; Hashing 2000 files
Before:
1.62user 0.15system 0:01.77elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128429minor)pagefaults 0swaps
After:
1.56user 0.16system 0:01.71elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+551minor)pagefaults 0swaps

file size = 512Kb; Hashing 1000 files
Before:
1.59user 0.17system 0:01.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128428minor)pagefaults 0swaps
After:
1.56user 0.15system 0:01.71elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+679minor)pagefaults 0swaps

file size = 1024Kb; Hashing 500 files
Before:
1.64user 0.15system 0:01.78elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+128429minor)pagefaults 0swaps
After:
1.61user 0.15system 0:01.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+934minor)pagefaults 0swaps


As you can see, in all tests the read() version performed better than
mmap() though the difference decreases with increase of the file size.
While for 1Kb files, the speed up is 39% (based on the elapsed time),
it is mere 1% for 1Mb file size.

Note: I do not use strbuf_read(), because it is suboptimal to deal with
this case, because we know the size ahead. (In fact, strbuf_read() is
not so good even for unknown size as it does redundant strbuf_grow()
almost in every use case, which probably needs to be fixed).

-- >8 --
From 6b3f8335dece7c9b9f810b1ab08f1bcb090e4d5e Mon Sep 17 00:00:00 2001
From: Dmitry Potapov <dpotapov@gmail.com>
Date: Sun, 21 Feb 2010 09:32:19 +0300
Subject: [PATCH] hash-object: don't use mmap() for small files

Using read() instead of mmap() can be 39% speed up for 1Kb files and is
1% speed up 1Mb files. For larger files, it is better to use mmap(),
because the difference between is not significant, and when there is not
enough memory, mmap() performs much better, because it avoids swapping.

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---
 sha1_file.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 657825e..8a83e56 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2434,6 +2434,8 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	return ret;
 }
 
+#define SMALL_FILE_SIZE (1024*1024)
+
 int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
@@ -2448,6 +2450,14 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 		else
 			ret = -1;
 		strbuf_release(&sbuf);
+	} else if (size <= SMALL_FILE_SIZE) {
+		char *buf = xmalloc(size);
+		if (size == read_in_full(fd, buf, size))
+			ret = index_mem(sha1, buf, size, write_object, type,
+					path);
+		else
+			ret = error("short read %s", strerror(errno));
+		free(buf);
 	} else if (size) {
 		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
 		ret = index_mem(sha1, buf, size, write_object, type, path);
-- 
1.7.0

-- >8 --


Thanks,
Dmitry

  reply	other threads:[~2010-02-21  7:22 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20100211234753.22574.48799.reportbug@gibbs.hungrycats.org>
2010-02-12  0:27 ` Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs Jonathan Nieder
2010-02-12  1:23   ` Zygo Blaxell
2010-02-13 12:12     ` Jonathan Nieder
2010-02-13 13:39       ` Ilari Liusvaara
2010-02-13 14:39         ` Thomas Rast
2010-02-13 16:29           ` Ilari Liusvaara
2010-02-13 22:09             ` Dmitry Potapov
2010-02-13 22:37               ` Zygo Blaxell
2010-02-14  1:18                 ` [PATCH] don't use mmap() to hash files Dmitry Potapov
2010-02-14  1:37                   ` Junio C Hamano
2010-02-14  2:18                     ` Dmitry Potapov
2010-02-14  3:14                       ` Junio C Hamano
2010-02-14 11:14                         ` Thomas Rast
2010-02-14 11:46                           ` Junio C Hamano
2010-02-14  1:53                   ` Johannes Schindelin
2010-02-14  2:00                     ` Junio C Hamano
2010-02-14  2:42                     ` Dmitry Potapov
2010-02-14 11:07                       ` Jakub Narebski
2010-02-14 11:55                       ` Paolo Bonzini
2010-02-14 18:10                       ` Johannes Schindelin
2010-02-14 19:06                         ` Dmitry Potapov
2010-02-14 19:22                           ` Johannes Schindelin
2010-02-14 19:28                             ` Johannes Schindelin
2010-02-14 19:56                               ` Dmitry Potapov
2010-02-14 23:52                                 ` Zygo Blaxell
2010-02-15  5:05                                 ` Nicolas Pitre
2010-02-15 12:23                                   ` Dmitry Potapov
2010-02-15  7:48                                 ` Paolo Bonzini
2010-02-15 12:25                                   ` Dmitry Potapov
2010-02-14 19:55                             ` Dmitry Potapov
2010-02-14 23:13                           ` Avery Pennarun
2010-02-15  4:16                             ` Nicolas Pitre
2010-02-15  5:01                               ` Avery Pennarun
2010-02-15  5:48                                 ` Nicolas Pitre
2010-02-15 19:19                                   ` Avery Pennarun
2010-02-15 19:29                                     ` Nicolas Pitre
2010-02-14  3:05                   ` [PATCH v2] " Dmitry Potapov
2010-02-18  1:16                   ` [PATCH] Teach "git add" and friends to be paranoid Junio C Hamano
2010-02-18  1:20                     ` Junio C Hamano
2010-02-18 15:32                       ` Zygo Blaxell
2010-02-19 17:51                         ` Junio C Hamano
2010-02-18  1:38                     ` Jeff King
2010-02-18  4:55                       ` Nicolas Pitre
2010-02-18  5:36                         ` Junio C Hamano
2010-02-18  7:27                           ` Wincent Colaiuta
2010-02-18 16:18                             ` Zygo Blaxell
2010-02-18 18:12                               ` Jonathan Nieder
2010-02-18 18:35                                 ` Junio C Hamano
2010-02-22 12:59                           ` Paolo Bonzini
2010-02-22 13:33                             ` Dmitry Potapov
2010-02-18 10:14                     ` Thomas Rast
2010-02-18 18:16                       ` Junio C Hamano
2010-02-18 19:58                         ` Nicolas Pitre
2010-02-18 20:11                           ` 16 gig, 350,000 file repository Bill Lear
2010-02-18 20:58                             ` Nicolas Pitre
2010-02-19  9:27                               ` Erik Faye-Lund
2010-02-22 22:20                               ` Bill Lear
2010-02-22 22:31                                 ` Nicolas Pitre
2010-02-18 20:14                           ` [PATCH] Teach "git add" and friends to be paranoid Peter Harris
2010-02-18 20:17                           ` Junio C Hamano
2010-02-18 21:30                             ` Nicolas Pitre
2010-02-19  1:04                               ` Jonathan Nieder
2010-02-19 15:26                                 ` Zygo Blaxell
2010-02-19 17:52                                   ` Junio C Hamano
2010-02-19 19:08                                     ` Zygo Blaxell
2010-02-19  8:28                     ` Dmitry Potapov
2010-02-19 17:52                       ` Junio C Hamano
2010-02-20 19:23                         ` Junio C Hamano
2010-02-21  7:21                           ` Dmitry Potapov [this message]
2010-02-21 19:32                             ` Junio C Hamano
2010-02-22  3:35                               ` Dmitry Potapov
2010-02-22  6:59                                 ` Junio C Hamano
2010-02-22 12:25                                   ` Dmitry Potapov
2010-02-22 15:40                                   ` Nicolas Pitre
2010-02-22 16:01                                     ` Dmitry Potapov
2010-02-22 17:31                                     ` Zygo Blaxell
2010-02-22 18:01                                       ` Nicolas Pitre
2010-02-22 19:56                                         ` Junio C Hamano
2010-02-22 20:52                                           ` Nicolas Pitre
2010-02-22 18:05                                       ` Dmitry Potapov
2010-02-22 18:14                                         ` Nicolas Pitre
2010-02-14  1:36   ` mmap with MAP_PRIVATE is useless (was Re: Bug#569505: git-core: 'git add' corrupts repository if the working directory is modified as it runs) Paolo Bonzini
2010-02-14  1:53     ` mmap with MAP_PRIVATE is useless Junio C Hamano
2010-02-14  2:11       ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100221072142.GA5829@dpotapov.dyndns.org \
    --to=dpotapov@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=ilari.liusvaara@elisanet.fi \
    --cc=jrnieder@gmail.com \
    --cc=trast@student.ethz.ch \
    --cc=zblaxell@esightcorp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).