* On data structures and parallelism
@ 2009-05-17 15:23 Heikki Orsila
2009-05-17 17:06 ` Linus Torvalds
0 siblings, 1 reply; 5+ messages in thread
From: Heikki Orsila @ 2009-05-17 15:23 UTC (permalink / raw)
To: git; +Cc: Linus Torvalds
There was an interesting discussion at
http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2
that involves DAGs and decompression in Git. The problem is achieving
parallelism. The following comment was made:
"And is it possible to store the block pointers from one object to
another in uncompressed form?"
Is there a case in Git where an "object" could store SHA1 in
uncompressed format, allowing prefetching the next object in chain
before uncompressing the current object? Prefetching could increase
parallelism (and speedup) in some cases.
A quick glance at Git's source code showed that commit objects are
compressed. Having even a single parent SHA1 in uncompressed format
would allow some prefetching. All but perhaps a few objects contain at
least one parent SHA1 :-)
--
Heikki Orsila
heikki.orsila@iki.fi
http://www.iki.fi/shd
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism
2009-05-17 15:23 On data structures and parallelism Heikki Orsila
@ 2009-05-17 17:06 ` Linus Torvalds
2009-05-17 17:46 ` Linus Torvalds
0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2009-05-17 17:06 UTC (permalink / raw)
To: Heikki Orsila; +Cc: git
On Sun, 17 May 2009, Heikki Orsila wrote:
>
> There was an interesting discussion at
>
> http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2
>
> that involves DAGs and decompression in Git. The problem is achieving
> parallelism. The following comment was made:
>
> "And is it possible to store the block pointers from one object to
> another in uncompressed form?"
For the biggest form of this, we actually already do.
The single biggest win of compression is the delta-compression: the
regular zlib compression is generally about a factor-of-two for unpacked
and "base" delta entries, but much less for already delta-compressed
entries. In comparison, the delta-compression is likely about a factor of
10 or more.
And the delta compression already has the SHA1 pointer to the delta base
entry uncompressed, but the compression is serialized by the fact that we
need to uncompress the base entry in order to then apply a delta on top of
it.
Now, there's no question that we could have higher levels of parallelism
by walking multiple such chains in parallel (we don't always have more
than one chain to walk, but sometimes we do). But as I point out in that
thread, we currently don't have any locking for the core object
datastructures, and adding that locking would likely slow down things more
than it speeds things up for the normal case.
For 'git fsck', we could speed things up a lot by doing parallel work -
and we wouldn't need to have anything else uncompressed, we could just
take advantage of the fact that we could try to uncompress the different
delta chains in parallel. And yes, fsck is really slow, but on the other
hand, it's something that most people never do, and I do about once a
month. The fact that it takes four minutes rather than one is not a big
deal.
There are other forms of compression where the SHA1 pointers are inside
the compressed data (the "regular" commit->{commit,tree} relationships and
tree->{anything} cases).
And yes, we could probably get rid of at least the zlib compression in
some of that. Much of that data doesn't even compress very well (SHA1's
are basically uncompressible), and the compression is done largely because
we have one unified interface for everything (so the basic object code
doesn't need to care about different object types or different formats:
it's all just binary data with a magic header to it).
But in that case, we'd probably not want to keep a separate uncompressed
tree, we'd just decide that "compression is too expensive to be worth it".
That said, on my laptops, CPU time really _never_ is the issue. Every
single time something is slow, the issue is a slow 4200rpm disk that may
get 25MB/s off it for linear things in the best case, but seeks take
milliseconds and any kind of random access will just kill performance.
So in the big picture, I suspect even the wasted CPU-time is worth it. It
makes some operations slower, but anything that makes the on-disk data
denser is good. Because the case I end up caring most about is always the
uncached case.
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism
2009-05-17 17:06 ` Linus Torvalds
@ 2009-05-17 17:46 ` Linus Torvalds
2009-05-17 19:31 ` david
0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2009-05-17 17:46 UTC (permalink / raw)
To: Heikki Orsila; +Cc: git
On Sun, 17 May 2009, Linus Torvalds wrote:
>
> That said, on my laptops, CPU time really _never_ is the issue. Every
> single time something is slow, the issue is a slow 4200rpm disk that may
> get 25MB/s off it for linear things in the best case, but seeks take
> milliseconds and any kind of random access will just kill performance.
Side note - I've several times desperately tried to see if IO parallelism
helps. It doesn't. Some drives do better if they get many independent
reads and can just do them concurrently. Sadly, that's pretty rare for
reads on rotational media, and impossible with legacy IDE drives (that
don't have the ability to do tagged queueing).
So when I try to do IO in parallel (which git does support for many
operations), that just makes the whole system come to a screeching halt
because it now seeks around the disk a lot more. A similar issue that
often kill parallelism on CPU's (bad cache behavior, and lots of
outstanding memory requests) kills parallelism on disks too - disk
performance simply is much _better_ if you do serial things than if you
try to parallelize the same work.
It would be different if I had a fancy high-end RAID system with tagged
queueing and lots of spare bandwidth that could be used in parallel. But
that's not what the git usage scenario often is. All the people pushing
multi-core seem to always ignore the big issues, and always working on
nice trivial problems with a small and well-behaved "kernel" that has no
IO and preferably didn't cache well even when single-threaded (ie
"streaming" data).
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism
2009-05-17 17:46 ` Linus Torvalds
@ 2009-05-17 19:31 ` david
2009-05-17 20:35 ` Linus Torvalds
0 siblings, 1 reply; 5+ messages in thread
From: david @ 2009-05-17 19:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Heikki Orsila, git
On Sun, 17 May 2009, Linus Torvalds wrote:
> On Sun, 17 May 2009, Linus Torvalds wrote:
>>
>> That said, on my laptops, CPU time really _never_ is the issue. Every
>> single time something is slow, the issue is a slow 4200rpm disk that may
>> get 25MB/s off it for linear things in the best case, but seeks take
>> milliseconds and any kind of random access will just kill performance.
>
> Side note - I've several times desperately tried to see if IO parallelism
> helps. It doesn't. Some drives do better if they get many independent
> reads and can just do them concurrently. Sadly, that's pretty rare for
> reads on rotational media, and impossible with legacy IDE drives (that
> don't have the ability to do tagged queueing).
>
> So when I try to do IO in parallel (which git does support for many
> operations), that just makes the whole system come to a screeching halt
> because it now seeks around the disk a lot more. A similar issue that
> often kill parallelism on CPU's (bad cache behavior, and lots of
> outstanding memory requests) kills parallelism on disks too - disk
> performance simply is much _better_ if you do serial things than if you
> try to parallelize the same work.
>
> It would be different if I had a fancy high-end RAID system with tagged
> queueing and lots of spare bandwidth that could be used in parallel. But
> that's not what the git usage scenario often is. All the people pushing
> multi-core seem to always ignore the big issues, and always working on
> nice trivial problems with a small and well-behaved "kernel" that has no
> IO and preferably didn't cache well even when single-threaded (ie
> "streaming" data).
do things change with SSDs? I've heard that even (especially??) with the
Intel SSDs you want to have several operations going in paralllel to get
the best out of them.
David Lang
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: On data structures and parallelism
2009-05-17 19:31 ` david
@ 2009-05-17 20:35 ` Linus Torvalds
0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2009-05-17 20:35 UTC (permalink / raw)
To: david; +Cc: Heikki Orsila, git
On Sun, 17 May 2009, david@lang.hm wrote:
>
> do things change with SSDs? I've heard that even (especially??) with the Intel
> SSDs you want to have several operations going in paralllel to get the best
> out of them.
There's a slight, but noticeable, improvement.
This is: "echo 3 > /proc/sys/vm/drop_caches; time git diff" run in a loop.
With 'core.preloadindex = true':
real 0m1.138s
real 0m1.116s
real 0m1.132s
real 0m1.120s
real 0m1.106s
real 0m1.132s
and with it set to 'false':
real 0m1.256s
real 0m1.258s
real 0m1.242s
real 0m1.240s
real 0m1.244s
real 0m1.242s
so it's about a 10% improvement. Which is pretty good, considering
that
(a) those disks are fast enough that even for that totally cache-cold
case, I get about 35% CPU utilization for the single-threaded case.
And that's despite this being a 3.2GHz Nehalem box, so 35% CPU is
really quite remarkably good. Om my (much slower) laptop with a
1.2GHz Core 2, I get 2-3% CPU-time (and the whole operation takes 20
seconds).
(b) Not all the IO ends up being parallelized, since there is a
per-directory mutex that means that even though we start 20 threads,
it probably gets a much smaller amount of real parallelism due to
locking.
in general, the IO parallelization obviously helps most when the IO is
slow _and_ overlaps perfectly. Perfect overlap doesn't end up happening
due to the per-directory lookup semaphore (think of it like a bank
conflict in trying to parallelize memory accesses), but with a slow NFS
connection you should get reasonably close to that optimal situation.
But with a single spindle, and rotating media, there really is sadly very
little room for optimization. I suspect a SATA with TCQ disk might be able
to do _somewhat_ better than my old PATA-only laptop (discounting the fact
that my PATA laptop harddisk is extra slow due to being just 4200rpm: any
desktop disk will be much faster), but I doubt the index preloading is
really all that noticeable.
In fact, I just tested on another machine, and saw no difference
what-so-ever. If anything, it was slightly slower. I suspect TCQ is a
bigger win with writes.
Linus
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-05-17 20:36 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-17 15:23 On data structures and parallelism Heikki Orsila
2009-05-17 17:06 ` Linus Torvalds
2009-05-17 17:46 ` Linus Torvalds
2009-05-17 19:31 ` david
2009-05-17 20:35 ` Linus Torvalds
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.