All of lore.kernel.org
 help / color / mirror / Atom feed
* On data structures and parallelism
@ 2009-05-17 15:23 Heikki Orsila
  2009-05-17 17:06 ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Heikki Orsila @ 2009-05-17 15:23 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

There was an interesting discussion at

http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2

that involves DAGs and decompression in Git. The problem is achieving 
parallelism. The following comment was made:

"And is it possible to store the block pointers from one object to 
another in uncompressed form?"

Is there a case in Git where an "object" could store SHA1 in 
uncompressed format, allowing prefetching the next object in chain 
before uncompressing the current object? Prefetching could increase 
parallelism (and speedup) in some cases.

A quick glance at Git's source code showed that commit objects are 
compressed. Having even a single parent SHA1 in uncompressed format 
would allow some prefetching. All but perhaps a few objects contain at 
least one parent SHA1 :-)

-- 
Heikki Orsila
heikki.orsila@iki.fi
http://www.iki.fi/shd

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On data structures and parallelism
  2009-05-17 15:23 On data structures and parallelism Heikki Orsila
@ 2009-05-17 17:06 ` Linus Torvalds
  2009-05-17 17:46   ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2009-05-17 17:06 UTC (permalink / raw)
  To: Heikki Orsila; +Cc: git



On Sun, 17 May 2009, Heikki Orsila wrote:
>
> There was an interesting discussion at
> 
> http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2
> 
> that involves DAGs and decompression in Git. The problem is achieving 
> parallelism. The following comment was made:
> 
> "And is it possible to store the block pointers from one object to 
> another in uncompressed form?"

For the biggest form of this, we actually already do.

The single biggest win of compression is the delta-compression: the 
regular zlib compression is generally about a factor-of-two for unpacked 
and "base" delta entries, but much less for already delta-compressed 
entries. In comparison, the delta-compression is likely about a factor of 
10 or more.

And the delta compression already has the SHA1 pointer to the delta base 
entry uncompressed, but the compression is serialized by the fact that we 
need to uncompress the base entry in order to then apply a delta on top of 
it.

Now, there's no question that we could have higher levels of parallelism 
by walking multiple such chains in parallel (we don't always have more 
than one chain to walk, but sometimes we do). But as I point out in that 
thread, we currently don't have any locking for the core object 
datastructures, and adding that locking would likely slow down things more 
than it speeds things up for the normal case.

For 'git fsck', we could speed things up a lot by doing parallel work - 
and we wouldn't need to have anything else uncompressed, we could just 
take advantage of the fact that we could try to uncompress the different 
delta chains in parallel. And yes, fsck is really slow, but on the other 
hand, it's something that most people never do, and I do about once a 
month. The fact that it takes four minutes rather than one is not a big 
deal.

There are other forms of compression where the SHA1 pointers are inside 
the compressed data (the "regular" commit->{commit,tree} relationships and 
tree->{anything} cases).

And yes, we could probably get rid of at least the zlib compression in 
some of that. Much of that data doesn't even compress very well (SHA1's 
are basically uncompressible), and the compression is done largely because 
we have one unified interface for everything (so the basic object code 
doesn't need to care about different object types or different formats: 
it's all just binary data with a magic header to it).

But in that case, we'd probably not want to keep a separate uncompressed 
tree, we'd just decide that "compression is too expensive to be worth it".

That said, on my laptops, CPU time really _never_ is the issue. Every 
single time something is slow, the issue is a slow 4200rpm disk that may 
get 25MB/s off it for linear things in the best case, but seeks take 
milliseconds and any kind of random access will just kill performance.

So in the big picture, I suspect even the wasted CPU-time is worth it. It 
makes some operations slower, but anything that makes the on-disk data 
denser is good. Because the case I end up caring most about is always the 
uncached case.

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On data structures and parallelism
  2009-05-17 17:06 ` Linus Torvalds
@ 2009-05-17 17:46   ` Linus Torvalds
  2009-05-17 19:31     ` david
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2009-05-17 17:46 UTC (permalink / raw)
  To: Heikki Orsila; +Cc: git



On Sun, 17 May 2009, Linus Torvalds wrote:
> 
> That said, on my laptops, CPU time really _never_ is the issue. Every 
> single time something is slow, the issue is a slow 4200rpm disk that may 
> get 25MB/s off it for linear things in the best case, but seeks take 
> milliseconds and any kind of random access will just kill performance.

Side note - I've several times desperately tried to see if IO parallelism 
helps. It doesn't. Some drives do better if they get many independent 
reads and can just do them concurrently. Sadly, that's pretty rare for 
reads on rotational media, and impossible with legacy IDE drives (that 
don't have the ability to do tagged queueing). 

So when I try to do IO in parallel (which git does support for many 
operations), that just makes the whole system come to a screeching halt 
because it now seeks around the disk a lot more. A similar issue that 
often kill parallelism on CPU's (bad cache behavior, and lots of 
outstanding memory requests) kills parallelism on disks too - disk 
performance simply is much _better_ if you do serial things than if you 
try to parallelize the same work.

It would be different if I had a fancy high-end RAID system with tagged 
queueing and lots of spare bandwidth that could be used in parallel. But 
that's not what the git usage scenario often is. All the people pushing 
multi-core seem to always ignore the big issues, and always working on 
nice trivial problems with a small and well-behaved "kernel" that has no 
IO and preferably didn't cache well even when single-threaded (ie 
"streaming" data).

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On data structures and parallelism
  2009-05-17 17:46   ` Linus Torvalds
@ 2009-05-17 19:31     ` david
  2009-05-17 20:35       ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: david @ 2009-05-17 19:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Heikki Orsila, git

On Sun, 17 May 2009, Linus Torvalds wrote:

> On Sun, 17 May 2009, Linus Torvalds wrote:
>>
>> That said, on my laptops, CPU time really _never_ is the issue. Every
>> single time something is slow, the issue is a slow 4200rpm disk that may
>> get 25MB/s off it for linear things in the best case, but seeks take
>> milliseconds and any kind of random access will just kill performance.
>
> Side note - I've several times desperately tried to see if IO parallelism
> helps. It doesn't. Some drives do better if they get many independent
> reads and can just do them concurrently. Sadly, that's pretty rare for
> reads on rotational media, and impossible with legacy IDE drives (that
> don't have the ability to do tagged queueing).
>
> So when I try to do IO in parallel (which git does support for many
> operations), that just makes the whole system come to a screeching halt
> because it now seeks around the disk a lot more. A similar issue that
> often kill parallelism on CPU's (bad cache behavior, and lots of
> outstanding memory requests) kills parallelism on disks too - disk
> performance simply is much _better_ if you do serial things than if you
> try to parallelize the same work.
>
> It would be different if I had a fancy high-end RAID system with tagged
> queueing and lots of spare bandwidth that could be used in parallel. But
> that's not what the git usage scenario often is. All the people pushing
> multi-core seem to always ignore the big issues, and always working on
> nice trivial problems with a small and well-behaved "kernel" that has no
> IO and preferably didn't cache well even when single-threaded (ie
> "streaming" data).

do things change with SSDs? I've heard that even (especially??) with the 
Intel SSDs you want to have several operations going in paralllel to get 
the best out of them.

David Lang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: On data structures and parallelism
  2009-05-17 19:31     ` david
@ 2009-05-17 20:35       ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2009-05-17 20:35 UTC (permalink / raw)
  To: david; +Cc: Heikki Orsila, git



On Sun, 17 May 2009, david@lang.hm wrote:
> 
> do things change with SSDs? I've heard that even (especially??) with the Intel
> SSDs you want to have several operations going in paralllel to get the best
> out of them.

There's a slight, but noticeable, improvement.

This is: "echo 3 > /proc/sys/vm/drop_caches; time git diff" run in a loop. 

With 'core.preloadindex = true':

	real	0m1.138s
	real	0m1.116s
	real	0m1.132s
	real	0m1.120s
	real	0m1.106s
	real	0m1.132s

and with it set to 'false':

	real	0m1.256s
	real	0m1.258s
	real	0m1.242s
	real	0m1.240s
	real	0m1.244s
	real	0m1.242s

so it's about a 10% improvement. Which is pretty good, considering 
that

 (a) those disks are fast enough that even for that totally cache-cold 
     case, I get about 35% CPU utilization for the single-threaded case.

     And that's despite this being a 3.2GHz Nehalem box, so 35% CPU is 
     really quite remarkably good. Om my (much slower) laptop with a 
     1.2GHz Core 2, I get 2-3% CPU-time (and the whole operation takes 20 
     seconds).

 (b) Not all the IO ends up being parallelized, since there is a 
     per-directory mutex that means that even though we start 20 threads, 
     it probably gets a much smaller amount of real parallelism due to 
     locking.

in general, the IO parallelization obviously helps most when the IO is 
slow _and_ overlaps perfectly. Perfect overlap doesn't end up happening 
due to the per-directory lookup semaphore (think of it like a bank 
conflict in trying to parallelize memory accesses), but with a slow NFS 
connection you should get reasonably close to that optimal situation.

But with a single spindle, and rotating media, there really is sadly very 
little room for optimization. I suspect a SATA with TCQ disk might be able 
to do _somewhat_ better than my old PATA-only laptop (discounting the fact 
that my PATA laptop harddisk is extra slow due to being just 4200rpm: any 
desktop disk will be much faster), but I doubt the index preloading is 
really all that noticeable.

In fact, I just tested on another machine, and saw no difference 
what-so-ever. If anything, it was slightly slower. I suspect TCQ is a 
bigger win with writes.

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-05-17 20:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-17 15:23 On data structures and parallelism Heikki Orsila
2009-05-17 17:06 ` Linus Torvalds
2009-05-17 17:46   ` Linus Torvalds
2009-05-17 19:31     ` david
2009-05-17 20:35       ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.