All of lore.kernel.org
 help / color / mirror / Atom feed
* Make the git codebase thread-safe
@ 2014-02-12  1:54 Stefan Zager
  2014-02-12  2:02 ` Robin H. Johnson
                   ` (4 more replies)
  0 siblings, 5 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-12  1:54 UTC (permalink / raw)
  To: git

We in the chromium project have a keen interest in adding threading to
git in the pursuit of performance for lengthy operations (checkout,
status, blame, ...).  Our motivation comes from hitting some
performance walls when working with repositories the size of chromium
and blink:

https://chromium.googlesource.com/chromium/src
https://chromium.googlesource.com/chromium/blink

We are particularly concerned with the performance of msysgit, and we
have already chalked up a significant performance gain by turning on
the threading code in pack-objects (which was already enabled for
posix platforms, but not on msysgit, owing to the lack of a correct
pread implementation).

To this end, I'd like to start submitting patches that make the code
base generally more thread-safe and thread-friendly.  Right after this
email, I'm going to send the first such patch, which makes the global
list of pack files (packed_git) internal to sha1_file.c.

I realize this may be a contentious topic, and I'd like to get
feedback on the general effort to add more threading to git.  I'd
appreciate any feedback you'd like to give up front.

Thanks!

Stefan Zager

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  1:54 Make the git codebase thread-safe Stefan Zager
@ 2014-02-12  2:02 ` Robin H. Johnson
  2014-02-12  3:43   ` Duy Nguyen
  2014-02-12  2:11 ` Duy Nguyen
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Robin H. Johnson @ 2014-02-12  2:02 UTC (permalink / raw)
  To: Stefan Zager, Git Mailing List

On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
> We in the chromium project have a keen interest in adding threading to
> git in the pursuit of performance for lengthy operations (checkout,
> status, blame, ...).  Our motivation comes from hitting some
> performance walls when working with repositories the size of chromium
> and blink:
+1 from Gentoo on performance improvements for large repos.

The main repository in the ongoing Git migration project looks to be in
the 1.5GB range (and for those that want to propose splitting it up, we
have explored that option and found it lacking), with very deep history
(but no branches of note, and very few tags).

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  1:54 Make the git codebase thread-safe Stefan Zager
  2014-02-12  2:02 ` Robin H. Johnson
@ 2014-02-12  2:11 ` Duy Nguyen
  2014-02-12 18:12   ` Stefan Zager
  2014-02-12 11:59 ` Erik Faye-Lund
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Duy Nguyen @ 2014-02-12  2:11 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Git Mailing List

On Wed, Feb 12, 2014 at 8:54 AM, Stefan Zager <szager@chromium.org> wrote:
> We in the chromium project have a keen interest in adding threading to
> git in the pursuit of performance for lengthy operations (checkout,
> status, blame, ...).  Our motivation comes from hitting some
> performance walls when working with repositories the size of chromium
> and blink:
>
> https://chromium.googlesource.com/chromium/src
> https://chromium.googlesource.com/chromium/blink

I have no comments about thread safety improvements (well, not yet).
If you have investigated about git performance on chromium
repositories, could you please sum it up? Threading may be an option
to improve performance, but it's probably not the only option.
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  2:02 ` Robin H. Johnson
@ 2014-02-12  3:43   ` Duy Nguyen
  2014-02-12 11:00     ` Karsten Blees
  2014-02-12 18:15     ` Stefan Zager
  0 siblings, 2 replies; 55+ messages in thread
From: Duy Nguyen @ 2014-02-12  3:43 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Stefan Zager, Git Mailing List

On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
>> We in the chromium project have a keen interest in adding threading to
>> git in the pursuit of performance for lengthy operations (checkout,
>> status, blame, ...).  Our motivation comes from hitting some
>> performance walls when working with repositories the size of chromium
>> and blink:
> +1 from Gentoo on performance improvements for large repos.
>
> The main repository in the ongoing Git migration project looks to be in
> the 1.5GB range (and for those that want to propose splitting it up, we
> have explored that option and found it lacking), with very deep history
> (but no branches of note, and very few tags).

>From v1.9 shallow clone should work for all push/pull/clone... so
history depth does not matter (on the client side). As for
gentoo-x86's large worktree, using index v4 and avoid full-tree
operations (e.g. "status .", not "status"..) should make all
operations reasonably fast. I plan to make "status" fast even without
path limiting with the help of inotify, but that's not going to be
finished soon. Did I miss anything else?
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  3:43   ` Duy Nguyen
@ 2014-02-12 11:00     ` Karsten Blees
  2014-02-12 23:03       ` Mike Hommey
  2014-02-12 18:15     ` Stefan Zager
  1 sibling, 1 reply; 55+ messages in thread
From: Karsten Blees @ 2014-02-12 11:00 UTC (permalink / raw)
  To: Duy Nguyen, Robin H. Johnson, Stefan Zager; +Cc: Git Mailing List

Am 12.02.2014 04:43, schrieb Duy Nguyen:
> On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
>> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
>>> We in the chromium project have a keen interest in adding threading to
>>> git in the pursuit of performance for lengthy operations (checkout,
>>> status, blame, ...).  Our motivation comes from hitting some
>>> performance walls when working with repositories the size of chromium
>>> and blink:
>> +1 from Gentoo on performance improvements for large repos.
>>
>> The main repository in the ongoing Git migration project looks to be in
>> the 1.5GB range (and for those that want to propose splitting it up, we
>> have explored that option and found it lacking), with very deep history
>> (but no branches of note, and very few tags).
> 
> From v1.9 shallow clone should work for all push/pull/clone... so
> history depth does not matter (on the client side). As for
> gentoo-x86's large worktree, using index v4 and avoid full-tree
> operations (e.g. "status .", not "status"..) should make all
> operations reasonably fast. I plan to make "status" fast even without
> path limiting with the help of inotify, but that's not going to be
> finished soon. Did I miss anything else?
> 

Regarding git-status on msysgit, enable core.preloadindex and core.fscache (as of 1.8.5.2).

There's no inotify on Windows, and I gave up using ReadDirectoryChangesW to keep fscache up to date, as it _may_ report DOS file names (e.g. C:\PROGRA~1 instead of C:\Program Files).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  1:54 Make the git codebase thread-safe Stefan Zager
  2014-02-12  2:02 ` Robin H. Johnson
  2014-02-12  2:11 ` Duy Nguyen
@ 2014-02-12 11:59 ` Erik Faye-Lund
  2014-02-12 18:20   ` Stefan Zager
  2014-02-13  1:42 ` brian m. carlson
  2019-04-02  0:52 ` Matheus Tavares
  4 siblings, 1 reply; 55+ messages in thread
From: Erik Faye-Lund @ 2014-02-12 11:59 UTC (permalink / raw)
  To: Stefan Zager; +Cc: GIT Mailing-list

On Wed, Feb 12, 2014 at 2:54 AM, Stefan Zager <szager@chromium.org> wrote:
> We in the chromium project have a keen interest in adding threading to
> git in the pursuit of performance for lengthy operations (checkout,
> status, blame, ...).  Our motivation comes from hitting some
> performance walls when working with repositories the size of chromium
> and blink:
>
> https://chromium.googlesource.com/chromium/src
> https://chromium.googlesource.com/chromium/blink
>
> We are particularly concerned with the performance of msysgit, and we
> have already chalked up a significant performance gain by turning on
> the threading code in pack-objects (which was already enabled for
> posix platforms, but not on msysgit, owing to the lack of a correct
> pread implementation).

How did you manage to do this? I'm not aware of any way to implement
pread on Windows (without going down the insanity-path of wrapping and
potentially locking inside every IO operation)...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  2:11 ` Duy Nguyen
@ 2014-02-12 18:12   ` Stefan Zager
  2014-02-12 18:33     ` Matthieu Moy
                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 18:12 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Stefan Zager, Git Mailing List

On Tue, Feb 11, 2014 at 6:11 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>
> I have no comments about thread safety improvements (well, not yet).
> If you have investigated about git performance on chromium
> repositories, could you please sum it up? Threading may be an option
> to improve performance, but it's probably not the only option.

Well, the painful operations that we use frequently are pack-objects,
checkout, status, and blame.  Anything on Windows that touches a lot
of files is miserable due to the usual file system slowness on
Windows, and luafv.sys (the UAC file virtualization driver) seems to
make it much worse.

With threading turned on, pack-objects on Windows now takes about
twice as long as on Linux, which is still more than a 2x improvement
over the non-threaded operation.

Checkout is really bad on Windows.  The blink repository is ~200K
files, and a full clean checkout from the index takes about 10 seconds
on Linux, and about 3:30 on Windows.  I used the Very Sleepy profiler
to see where all the time was spent on Windows: 55% of the time was
spent in OpenFile, and 25% in CloseFile (both in win32).  My immediate
goal is to add threading to checkout, so those file system calls can
be done in parallel.

Enabling the fscache speeds up status quite a bit.  I'm optimistic
that parallelizing the stat calls will yield a further improvement.
Beyond that, it may not be possible to do much more without using a
file system watcher daemon, like facebook does with mercurial.
(https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/)

Blame is something that chromium and blink developers use heavily, and
it is not unusual for a blame invocation on the blink repository to
run for 30 seconds.  It seems like it should be possible to
parallelize blame, but it requires pack file operations to be
thread-safe.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  3:43   ` Duy Nguyen
  2014-02-12 11:00     ` Karsten Blees
@ 2014-02-12 18:15     ` Stefan Zager
  1 sibling, 0 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 18:15 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Robin H. Johnson, Stefan Zager, Git Mailing List

On Tue, Feb 11, 2014 at 7:43 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>
> From v1.9 shallow clone should work for all push/pull/clone... so
> history depth does not matter (on the client side). As for
> gentoo-x86's large worktree, using index v4 and avoid full-tree
> operations (e.g. "status .", not "status"..) should make all
> operations reasonably fast. I plan to make "status" fast even without
> path limiting with the help of inotify, but that's not going to be
> finished soon. Did I miss anything else?

Chromium developers frequently want to run status over their entire
checkout, and a lot of them run 'git commit -a'.  We want to do
everything possible to speed this up.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 11:59 ` Erik Faye-Lund
@ 2014-02-12 18:20   ` Stefan Zager
  2014-02-12 18:27     ` Erik Faye-Lund
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 18:20 UTC (permalink / raw)
  To: kusmabite; +Cc: Stefan Zager, GIT Mailing-list

On Wed, Feb 12, 2014 at 3:59 AM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
> On Wed, Feb 12, 2014 at 2:54 AM, Stefan Zager <szager@chromium.org> wrote:
>>
>> We are particularly concerned with the performance of msysgit, and we
>> have already chalked up a significant performance gain by turning on
>> the threading code in pack-objects (which was already enabled for
>> posix platforms, but not on msysgit, owing to the lack of a correct
>> pread implementation).
>
> How did you manage to do this? I'm not aware of any way to implement
> pread on Windows (without going down the insanity-path of wrapping and
> potentially locking inside every IO operation)...

I don't want to steal the thunder of my coworker, who wrote the
implementation.  He plans to submit it upstream soon-ish.  It relies
on using the lpOverlapped argument to ReadFile(), with some additional
tomfoolery to make sure that the implicit position pointer for the
file descriptor doesn't get modified.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:20   ` Stefan Zager
@ 2014-02-12 18:27     ` Erik Faye-Lund
  2014-02-12 18:34       ` Stefan Zager
  0 siblings, 1 reply; 55+ messages in thread
From: Erik Faye-Lund @ 2014-02-12 18:27 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Stefan Zager, GIT Mailing-list

On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager@google.com> wrote:
> On Wed, Feb 12, 2014 at 3:59 AM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>> On Wed, Feb 12, 2014 at 2:54 AM, Stefan Zager <szager@chromium.org> wrote:
>>>
>>> We are particularly concerned with the performance of msysgit, and we
>>> have already chalked up a significant performance gain by turning on
>>> the threading code in pack-objects (which was already enabled for
>>> posix platforms, but not on msysgit, owing to the lack of a correct
>>> pread implementation).
>>
>> How did you manage to do this? I'm not aware of any way to implement
>> pread on Windows (without going down the insanity-path of wrapping and
>> potentially locking inside every IO operation)...
>
> I don't want to steal the thunder of my coworker, who wrote the
> implementation.  He plans to submit it upstream soon-ish.  It relies
> on using the lpOverlapped argument to ReadFile(), with some additional
> tomfoolery to make sure that the implicit position pointer for the
> file descriptor doesn't get modified.

Is the code available somewhere? I'm especially interested in the
"additional tomfoolery to make sure that the implicit position pointer
for the file descriptor doesn't get modified"-part, as this was what I
ended up butting my head into when trying to do this myself.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:12   ` Stefan Zager
@ 2014-02-12 18:33     ` Matthieu Moy
  2014-02-12 18:39       ` Stefan Zager
  2014-02-12 18:50     ` David Kastrup
  2014-02-12 20:06     ` Junio C Hamano
  2 siblings, 1 reply; 55+ messages in thread
From: Matthieu Moy @ 2014-02-12 18:33 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Duy Nguyen, Git Mailing List

Stefan Zager <szager@chromium.org> writes:

> I'm optimistic that parallelizing the stat calls will yield a further
> improvement.

It has already been mentionned in the thread, but in case you overlooked
it: did you look at core.preloadindex? It seems at least very close to
what you want.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:27     ` Erik Faye-Lund
@ 2014-02-12 18:34       ` Stefan Zager
  2014-02-12 18:37         ` Erik Faye-Lund
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 18:34 UTC (permalink / raw)
  To: kusmabite; +Cc: Stefan Zager, GIT Mailing-list

On Wed, Feb 12, 2014 at 10:27 AM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
> On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager@google.com> wrote:
>>
>> I don't want to steal the thunder of my coworker, who wrote the
>> implementation.  He plans to submit it upstream soon-ish.  It relies
>> on using the lpOverlapped argument to ReadFile(), with some additional
>> tomfoolery to make sure that the implicit position pointer for the
>> file descriptor doesn't get modified.
>
> Is the code available somewhere? I'm especially interested in the
> "additional tomfoolery to make sure that the implicit position pointer
> for the file descriptor doesn't get modified"-part, as this was what I
> ended up butting my head into when trying to do this myself.

https://chromium-review.googlesource.com/#/c/186104/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:34       ` Stefan Zager
@ 2014-02-12 18:37         ` Erik Faye-Lund
  2014-02-12 19:22           ` Karsten Blees
  0 siblings, 1 reply; 55+ messages in thread
From: Erik Faye-Lund @ 2014-02-12 18:37 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Stefan Zager, GIT Mailing-list

On Wed, Feb 12, 2014 at 7:34 PM, Stefan Zager <szager@google.com> wrote:
> On Wed, Feb 12, 2014 at 10:27 AM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>> On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager@google.com> wrote:
>>>
>>> I don't want to steal the thunder of my coworker, who wrote the
>>> implementation.  He plans to submit it upstream soon-ish.  It relies
>>> on using the lpOverlapped argument to ReadFile(), with some additional
>>> tomfoolery to make sure that the implicit position pointer for the
>>> file descriptor doesn't get modified.
>>
>> Is the code available somewhere? I'm especially interested in the
>> "additional tomfoolery to make sure that the implicit position pointer
>> for the file descriptor doesn't get modified"-part, as this was what I
>> ended up butting my head into when trying to do this myself.
>
> https://chromium-review.googlesource.com/#/c/186104/

ReOpenFile, that's fantastic. Thanks a lot!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:33     ` Matthieu Moy
@ 2014-02-12 18:39       ` Stefan Zager
  0 siblings, 0 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 18:39 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

On Wed, Feb 12, 2014 at 10:33 AM, Matthieu Moy
<Matthieu.Moy@grenoble-inp.fr> wrote:
> Stefan Zager <szager@chromium.org> writes:
>
>> I'm optimistic that parallelizing the stat calls will yield a further
>> improvement.
>
> It has already been mentionned in the thread, but in case you overlooked
> it: did you look at core.preloadindex? It seems at least very close to
> what you want.

Ah yes, sorry, I overlooked that.  We have indeed turned on
core.preloadindex, and it does indeed speed up status.  That speedup
is reflected in my previous comments about our observations working
with chromium and blink.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:12   ` Stefan Zager
  2014-02-12 18:33     ` Matthieu Moy
@ 2014-02-12 18:50     ` David Kastrup
  2014-02-12 19:02       ` Stefan Zager
  2014-02-12 20:06     ` Junio C Hamano
  2 siblings, 1 reply; 55+ messages in thread
From: David Kastrup @ 2014-02-12 18:50 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Duy Nguyen, Git Mailing List

Stefan Zager <szager@chromium.org> writes:

> On Tue, Feb 11, 2014 at 6:11 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>>
>> I have no comments about thread safety improvements (well, not yet).
>> If you have investigated about git performance on chromium
>> repositories, could you please sum it up? Threading may be an option
>> to improve performance, but it's probably not the only option.
>
> Well, the painful operations that we use frequently are pack-objects,
> checkout, status, and blame.

Have you checked the patch in
<URL:http://thread.gmane.org/gmane.comp.version-control.git/241448> and
followups,
Message-ID: <1391454849-26558-1-git-send-email-dak@gnu.org>?

While this does not yet support -M and -C options, it's conceivable that
you don't use them in your server/scripts.

> Anything on Windows that touches a lot of files is miserable due to
> the usual file system slowness on Windows, and luafv.sys (the UAC file
> virtualization driver) seems to make it much worse.

There is an obvious solution here...  Dedicated hardware is not that
expensive.  Virtualization will always have a price.

> Blame is something that chromium and blink developers use heavily, and
> it is not unusual for a blame invocation on the blink repository to
> run for 30 seconds.  It seems like it should be possible to
> parallelize blame, but it requires pack file operations to be
> thread-safe.

Really, give the above patch a try.  I am taking longer to finish it
than anticipated (with a lot due to procrastination but that is,
unfortunately, a large part of my workflow), and it's cutting into my
"paychecks" (voluntary donations which to a good degree depend on timely
and nontrivial progress reports for my freely available work on GNU
LilyPond).

Note that it looks like the majority of the remaining time on GNU/Linux
tends to be spent in system time: I/O time, memory management.  And I
have an SSD drive.  When using packed repositories of considerable size,
decompression comes into play as well.  I don't think that you can hope
to get noticeably higher I/O throughput by multithreading, so really,
really, really consider dedicated hardware running on a native Linux
file system.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:50     ` David Kastrup
@ 2014-02-12 19:02       ` Stefan Zager
  2014-02-12 19:15         ` David Kastrup
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 19:02 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <dak@gnu.org> wrote:
> Stefan Zager <szager@chromium.org> writes:
>
>> Anything on Windows that touches a lot of files is miserable due to
>> the usual file system slowness on Windows, and luafv.sys (the UAC file
>> virtualization driver) seems to make it much worse.
>
> There is an obvious solution here...  Dedicated hardware is not that
> expensive.  Virtualization will always have a price.

Not sure I follow you.  We need to support people developing,
building, and testing on natively Windows machines.  And we need to
support users with reasonable hardware, including spinning disks.  If
we were only interested in optimizing for Google employees, each of
whom has one or more small nuclear reactors under their desk, this
would be easy.

>> Blame is something that chromium and blink developers use heavily, and
>> it is not unusual for a blame invocation on the blink repository to
>> run for 30 seconds.  It seems like it should be possible to
>> parallelize blame, but it requires pack file operations to be
>> thread-safe.
>
> Really, give the above patch a try.  I am taking longer to finish it
> than anticipated (with a lot due to procrastination but that is,
> unfortunately, a large part of my workflow), and it's cutting into my
> "paychecks" (voluntary donations which to a good degree depend on timely
> and nontrivial progress reports for my freely available work on GNU
> LilyPond).

I will give that a try.  How much of a performance improvement have you clocked?

> Note that it looks like the majority of the remaining time on GNU/Linux
> tends to be spent in system time: I/O time, memory management.  And I
> have an SSD drive.  When using packed repositories of considerable size,
> decompression comes into play as well.  I don't think that you can hope
> to get noticeably higher I/O throughput by multithreading, so really,
> really, really consider dedicated hardware running on a native Linux
> file system.

I have a background in hardware, and I have much more faith in modern
disk schedulers :)

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 19:02       ` Stefan Zager
@ 2014-02-12 19:15         ` David Kastrup
  2014-02-12 23:09           ` Mike Hommey
  2014-02-13  8:30           ` David Kastrup
  0 siblings, 2 replies; 55+ messages in thread
From: David Kastrup @ 2014-02-12 19:15 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Duy Nguyen, Git Mailing List

Stefan Zager <szager@chromium.org> writes:

> On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <dak@gnu.org> wrote:
>
>> Really, give the above patch a try.  I am taking longer to finish it
>> than anticipated (with a lot due to procrastination but that is,
>> unfortunately, a large part of my workflow), and it's cutting into my
>> "paychecks" (voluntary donations which to a good degree depend on timely
>> and nontrivial progress reports for my freely available work on GNU
>> LilyPond).
>
> I will give that a try.  How much of a performance improvement have
> you clocked?

Depends on file type and size.  With large files with lots of small
changes, performance improvements get more impressive.

Some ugly real-world examples are the Emacs repository, src/xdisp.c
(performance improvement about a factor of 3), a large file in the style
of /usr/share/dict/words clocking in at a factor of about 5.

Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
are no improvements in system time (I/O) except for patch 4 of the
series which helps perhaps 20% or so.

So the benefits of the patch will come into play mostly for big, bad
files on Windows: other than that, the I/O time is likely to be the
dominant player anyway.

If you have benchmarked the stuff, for annoying cases expect I/O time to
go down maybe 10-20%, and user time to drop by a factor of 4.  Under
GNU/Linux, that makes for a significant overall improvement.  On
Windows, the payback is likely quite less because of the worse I/O
performance.  Pity.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:37         ` Erik Faye-Lund
@ 2014-02-12 19:22           ` Karsten Blees
  2014-02-12 19:30             ` Stefan Zager
  2014-02-13 18:38             ` Zachary Turner
  0 siblings, 2 replies; 55+ messages in thread
From: Karsten Blees @ 2014-02-12 19:22 UTC (permalink / raw)
  To: kusmabite, Stefan Zager; +Cc: Stefan Zager, GIT Mailing-list

Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
> On Wed, Feb 12, 2014 at 7:34 PM, Stefan Zager <szager@google.com> wrote:
>> On Wed, Feb 12, 2014 at 10:27 AM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>>> On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager@google.com> wrote:
>>>>
>>>> I don't want to steal the thunder of my coworker, who wrote the
>>>> implementation.  He plans to submit it upstream soon-ish.  It relies
>>>> on using the lpOverlapped argument to ReadFile(), with some additional
>>>> tomfoolery to make sure that the implicit position pointer for the
>>>> file descriptor doesn't get modified.
>>>
>>> Is the code available somewhere? I'm especially interested in the
>>> "additional tomfoolery to make sure that the implicit position pointer
>>> for the file descriptor doesn't get modified"-part, as this was what I
>>> ended up butting my head into when trying to do this myself.
>>
>> https://chromium-review.googlesource.com/#/c/186104/
> 
> ReOpenFile, that's fantastic. Thanks a lot!

...but should be loaded dynamically via GetProcAddress, or are we ready to drop XP support?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 19:22           ` Karsten Blees
@ 2014-02-12 19:30             ` Stefan Zager
  2014-02-13  8:27               ` Johannes Sixt
  2014-02-13 18:38             ` Zachary Turner
  1 sibling, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 19:30 UTC (permalink / raw)
  To: Karsten Blees; +Cc: kusmabite, Stefan Zager, GIT Mailing-list

On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>> On Wed, Feb 12, 2014 at 7:34 PM, Stefan Zager <szager@google.com> wrote:
>>> On Wed, Feb 12, 2014 at 10:27 AM, Erik Faye-Lund <kusmabite@gmail.com> wrote:
>>>> On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager@google.com> wrote:
>>>>>
>>>>> I don't want to steal the thunder of my coworker, who wrote the
>>>>> implementation.  He plans to submit it upstream soon-ish.  It relies
>>>>> on using the lpOverlapped argument to ReadFile(), with some additional
>>>>> tomfoolery to make sure that the implicit position pointer for the
>>>>> file descriptor doesn't get modified.
>>>>
>>>> Is the code available somewhere? I'm especially interested in the
>>>> "additional tomfoolery to make sure that the implicit position pointer
>>>> for the file descriptor doesn't get modified"-part, as this was what I
>>>> ended up butting my head into when trying to do this myself.
>>>
>>> https://chromium-review.googlesource.com/#/c/186104/
>>
>> ReOpenFile, that's fantastic. Thanks a lot!
>
> ...but should be loaded dynamically via GetProcAddress, or are we ready to drop XP support?

Right, that is an issue.  From our perspective, it's well past time to
drop XP support.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 18:12   ` Stefan Zager
  2014-02-12 18:33     ` Matthieu Moy
  2014-02-12 18:50     ` David Kastrup
@ 2014-02-12 20:06     ` Junio C Hamano
  2014-02-12 20:27       ` Stefan Zager
  2 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2014-02-12 20:06 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Duy Nguyen, Git Mailing List

Stefan Zager <szager@chromium.org> writes:

> ...  I used the Very Sleepy profiler
> to see where all the time was spent on Windows: 55% of the time was
> spent in OpenFile, and 25% in CloseFile (both in win32).

This is somewhat interesting.

When we check things out, checkout_paths() has a list of paths to be
checked out, and iterates over them and call checkout_entry().

I wonder if you can:

 - introduce a version of checkout_entry() that takes file
   descriptors to write to;

 - have an asynchronous helper threads that pre-open the paths to be
   written out and feed <ce, file descriptor to be written> to a
   queue;

 - restructure that loop so that it reads the <ce, file descriptor
   to be written> from the queue, performs the actual writing out,
   and then feeds <file descriptor to be closed> to another queue; and

 - have another asynchronous helper threads that reads <file
   descriptor to be closed> from the queue and close them.

Calls to write (and preparation of data to be written) will then
remain single-threaded, but it sounds like that codepath is not the
bottleneck in your measurement, so....

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 20:06     ` Junio C Hamano
@ 2014-02-12 20:27       ` Stefan Zager
  2014-02-12 23:05         ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-12 20:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

On Wed, Feb 12, 2014 at 12:06 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Zager <szager@chromium.org> writes:
>
>> ...  I used the Very Sleepy profiler
>> to see where all the time was spent on Windows: 55% of the time was
>> spent in OpenFile, and 25% in CloseFile (both in win32).
>
> This is somewhat interesting.
>
> When we check things out, checkout_paths() has a list of paths to be
> checked out, and iterates over them and call checkout_entry().
>
> I wonder if you can:
>
>  - introduce a version of checkout_entry() that takes file
>    descriptors to write to;
>
>  - have an asynchronous helper threads that pre-open the paths to be
>    written out and feed <ce, file descriptor to be written> to a
>    queue;
>
>  - restructure that loop so that it reads the <ce, file descriptor
>    to be written> from the queue, performs the actual writing out,
>    and then feeds <file descriptor to be closed> to another queue; and
>
>  - have another asynchronous helper threads that reads <file
>    descriptor to be closed> from the queue and close them.
>
> Calls to write (and preparation of data to be written) will then
> remain single-threaded, but it sounds like that codepath is not the
> bottleneck in your measurement, so....

Yes, I considered that as well.  At a minimum, that would still
require attr.c to implement thread locking, since attribute files must
be parsed to look for stream filters.  I have already done that work.

But I'm not sure it's the best long-term approach to add convoluted
custom threading solutions to each git operation as it appears on the
performance radar.  I'm hoping to make the entire code base more
thread-friendly, so that threading can be added in a more natural and
idiomatic (and less painful) way.

For example, the most natural way to add threading to checkout would
be in the loops over the index in check_updates() in unpack-trees.c.
If attr.c and sha1_file.c were thread-safe, then it would be possible
to thread checkout entirely in check_updates(), with a pretty compact
code change.  I have already done the work in attr.c; sha1_file.c is
hairier, but do-able.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 11:00     ` Karsten Blees
@ 2014-02-12 23:03       ` Mike Hommey
  2014-02-13  0:06         ` Karsten Blees
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Hommey @ 2014-02-12 23:03 UTC (permalink / raw)
  To: Karsten Blees
  Cc: Duy Nguyen, Robin H. Johnson, Stefan Zager, Git Mailing List

On Wed, Feb 12, 2014 at 12:00:19PM +0100, Karsten Blees wrote:
> Am 12.02.2014 04:43, schrieb Duy Nguyen:
> > On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
> >> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
> >>> We in the chromium project have a keen interest in adding threading to
> >>> git in the pursuit of performance for lengthy operations (checkout,
> >>> status, blame, ...).  Our motivation comes from hitting some
> >>> performance walls when working with repositories the size of chromium
> >>> and blink:
> >> +1 from Gentoo on performance improvements for large repos.
> >>
> >> The main repository in the ongoing Git migration project looks to be in
> >> the 1.5GB range (and for those that want to propose splitting it up, we
> >> have explored that option and found it lacking), with very deep history
> >> (but no branches of note, and very few tags).
> > 
> > From v1.9 shallow clone should work for all push/pull/clone... so
> > history depth does not matter (on the client side). As for
> > gentoo-x86's large worktree, using index v4 and avoid full-tree
> > operations (e.g. "status .", not "status"..) should make all
> > operations reasonably fast. I plan to make "status" fast even without
> > path limiting with the help of inotify, but that's not going to be
> > finished soon. Did I miss anything else?
> > 
> 
> Regarding git-status on msysgit, enable core.preloadindex and core.fscache (as of 1.8.5.2).
> 
> There's no inotify on Windows, and I gave up using ReadDirectoryChangesW to
> keep fscache up to date, as it _may_ report DOS file names (e.g. C:\PROGRA~1
> instead of C:\Program Files).

You can use GetLongPathNameW to get the latter from the former.

Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 20:27       ` Stefan Zager
@ 2014-02-12 23:05         ` Junio C Hamano
  0 siblings, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2014-02-12 23:05 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Duy Nguyen, Git Mailing List

On Wed, Feb 12, 2014 at 12:27 PM, Stefan Zager <szager@chromium.org> wrote:
> On Wed, Feb 12, 2014 at 12:06 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Stefan Zager <szager@chromium.org> writes:
>>
>> Calls to write (and preparation of data to be written) will then
>> remain single-threaded, but it sounds like that codepath is not the
>> bottleneck in your measurement, so....
>
> Yes, I considered that as well.  At a minimum, that would still
> require attr.c to implement thread locking, since attribute files must
> be parsed to look for stream filters.  I have already done that work.

I would have imagined that use of the attribute system belongs to "write and
preparation of data to be written" category, i.e. the single threaded
part of the
kludge I outlined.

> But I'm not sure it's the best long-term approach to add convoluted
> custom threading solutions to each git operation as it appears on the
> performance radar.

Yeah, it depends on how clean and non-intrusive an abstraction we can make.
The kludge I outlined is certainly not very pretty.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 19:15         ` David Kastrup
@ 2014-02-12 23:09           ` Mike Hommey
  2014-02-13  6:04             ` David Kastrup
  2014-02-13  8:30           ` David Kastrup
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Hommey @ 2014-02-12 23:09 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> Stefan Zager <szager@chromium.org> writes:
> 
> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <dak@gnu.org> wrote:
> >
> >> Really, give the above patch a try.  I am taking longer to finish it
> >> than anticipated (with a lot due to procrastination but that is,
> >> unfortunately, a large part of my workflow), and it's cutting into my
> >> "paychecks" (voluntary donations which to a good degree depend on timely
> >> and nontrivial progress reports for my freely available work on GNU
> >> LilyPond).
> >
> > I will give that a try.  How much of a performance improvement have
> > you clocked?
> 
> Depends on file type and size.  With large files with lots of small
> changes, performance improvements get more impressive.
> 
> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> (performance improvement about a factor of 3), a large file in the style
> of /usr/share/dict/words clocking in at a factor of about 5.
> 
> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> are no improvements in system time (I/O) except for patch 4 of the
> series which helps perhaps 20% or so.
> 
> So the benefits of the patch will come into play mostly for big, bad
> files on Windows: other than that, the I/O time is likely to be the
> dominant player anyway.

How much fragmentation does that add to the files, though?

Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 23:03       ` Mike Hommey
@ 2014-02-13  0:06         ` Karsten Blees
  0 siblings, 0 replies; 55+ messages in thread
From: Karsten Blees @ 2014-02-13  0:06 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Duy Nguyen, Robin H. Johnson, Stefan Zager, Git Mailing List

Am 13.02.2014 00:03, schrieb Mike Hommey:
> On Wed, Feb 12, 2014 at 12:00:19PM +0100, Karsten Blees wrote:
>> Am 12.02.2014 04:43, schrieb Duy Nguyen:
>>> On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
>>>> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
>>>>> We in the chromium project have a keen interest in adding threading to
>>>>> git in the pursuit of performance for lengthy operations (checkout,
>>>>> status, blame, ...).  Our motivation comes from hitting some
>>>>> performance walls when working with repositories the size of chromium
>>>>> and blink:
>>>> +1 from Gentoo on performance improvements for large repos.
>>>>
>>>> The main repository in the ongoing Git migration project looks to be in
>>>> the 1.5GB range (and for those that want to propose splitting it up, we
>>>> have explored that option and found it lacking), with very deep history
>>>> (but no branches of note, and very few tags).
>>>
>>> From v1.9 shallow clone should work for all push/pull/clone... so
>>> history depth does not matter (on the client side). As for
>>> gentoo-x86's large worktree, using index v4 and avoid full-tree
>>> operations (e.g. "status .", not "status"..) should make all
>>> operations reasonably fast. I plan to make "status" fast even without
>>> path limiting with the help of inotify, but that's not going to be
>>> finished soon. Did I miss anything else?
>>>
>>
>> Regarding git-status on msysgit, enable core.preloadindex and core.fscache (as of 1.8.5.2).
>>
>> There's no inotify on Windows, and I gave up using ReadDirectoryChangesW to
>> keep fscache up to date, as it _may_ report DOS file names (e.g. C:\PROGRA~1
>> instead of C:\Program Files).
> 
> You can use GetLongPathNameW to get the latter from the former.
> 
> Mike
> 

Except if its a delete or rename notification...my final ReadDirectoryChangesW version cached the files by their long _and_ short names, but was so complex that it slowed most commands down rather than speeding them up :-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  1:54 Make the git codebase thread-safe Stefan Zager
                   ` (2 preceding siblings ...)
  2014-02-12 11:59 ` Erik Faye-Lund
@ 2014-02-13  1:42 ` brian m. carlson
  2019-04-02  0:52 ` Matheus Tavares
  4 siblings, 0 replies; 55+ messages in thread
From: brian m. carlson @ 2014-02-13  1:42 UTC (permalink / raw)
  To: Stefan Zager; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 828 bytes --]

On Tue, Feb 11, 2014 at 05:54:51PM -0800, Stefan Zager wrote:
> To this end, I'd like to start submitting patches that make the code
> base generally more thread-safe and thread-friendly.  Right after this
> email, I'm going to send the first such patch, which makes the global
> list of pack files (packed_git) internal to sha1_file.c.

I'm definitely interested in this if it also works on POSIX systems.  At
work, we have a 7.6 GiB repo (packed)[0], so while performance is not
bad, I certainly wouldn't object if it were better.

[0] Using du -sh.  For comparison, the Linux kernel repo is 1.4 GiB.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 23:09           ` Mike Hommey
@ 2014-02-13  6:04             ` David Kastrup
  2014-02-13  9:34               ` Mike Hommey
  0 siblings, 1 reply; 55+ messages in thread
From: David Kastrup @ 2014-02-13  6:04 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

Mike Hommey <mh@glandium.org> writes:

> On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
>> Stefan Zager <szager@chromium.org> writes:
>> 
>> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <dak@gnu.org> wrote:
>> >
>> >> Really, give the above patch a try.  I am taking longer to finish it
>> >> than anticipated (with a lot due to procrastination but that is,
>> >> unfortunately, a large part of my workflow), and it's cutting into my
>> >> "paychecks" (voluntary donations which to a good degree depend on timely
>> >> and nontrivial progress reports for my freely available work on GNU
>> >> LilyPond).
>> >
>> > I will give that a try.  How much of a performance improvement have
>> > you clocked?
>> 
>> Depends on file type and size.  With large files with lots of small
>> changes, performance improvements get more impressive.
>> 
>> Some ugly real-world examples are the Emacs repository, src/xdisp.c
>> (performance improvement about a factor of 3), a large file in the style
>> of /usr/share/dict/words clocking in at a factor of about 5.
>> 
>> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
>> are no improvements in system time (I/O) except for patch 4 of the
>> series which helps perhaps 20% or so.
>> 
>> So the benefits of the patch will come into play mostly for big, bad
>> files on Windows: other than that, the I/O time is likely to be the
>> dominant player anyway.
>
> How much fragmentation does that add to the files, though?

Uh, git-blame is a read-only operation.  It does not add fragmentation
to any file.  The patch will add a diff of probably a few dozen hunks to
builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
that I expect even

    git blame builtin/blame.c

to be faster than before.  But that interpretation of your question
probably tries to make too much sense out of what is just nonsense in
the given context.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 19:30             ` Stefan Zager
@ 2014-02-13  8:27               ` Johannes Sixt
  2014-02-13  8:38                 ` David Kastrup
  2014-02-13 18:40                 ` Stefan Zager
  0 siblings, 2 replies; 55+ messages in thread
From: Johannes Sixt @ 2014-02-13  8:27 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Karsten Blees, kusmabite, Stefan Zager, GIT Mailing-list

Am 2/12/2014 20:30, schrieb Stefan Zager:
> On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
>> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>>> ReOpenFile, that's fantastic. Thanks a lot!
>>
>> ...but should be loaded dynamically via GetProcAddress, or are we ready to drop XP support?
> 
> Right, that is an issue.  From our perspective, it's well past time to
> drop XP support.

Not from mine.

-- Hannes

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 19:15         ` David Kastrup
  2014-02-12 23:09           ` Mike Hommey
@ 2014-02-13  8:30           ` David Kastrup
  1 sibling, 0 replies; 55+ messages in thread
From: David Kastrup @ 2014-02-13  8:30 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Duy Nguyen, Git Mailing List

David Kastrup <dak@gnu.org> writes:

> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> are no improvements in system time (I/O) except for patch 4 of the
> series which helps perhaps 20% or so.
>
> So the benefits of the patch will come into play mostly for big, bad
> files on Windows: other than that, the I/O time is likely to be the
> dominant player anyway.
>
> If you have benchmarked the stuff, for annoying cases expect I/O time
> to go down maybe 10-20%, and user time to drop by a factor of 4.
> Under GNU/Linux, that makes for a significant overall improvement.  On
> Windows, the payback is likely quite less because of the worse I/O
> performance.  Pity.

But of course, you can significantly reduce the relevant file
open/close/search times by running

    git gc --aggressive

While this does not actually help with performance in GNU/Linux (though
with file space), dealing with few but compressed files under Windows is
likely a reasonably big win since the uncompression happens in user
space and cannot be bungled by Microsoft (apart from bad memory
management strategies).

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13  8:27               ` Johannes Sixt
@ 2014-02-13  8:38                 ` David Kastrup
  2014-02-13 18:40                 ` Stefan Zager
  1 sibling, 0 replies; 55+ messages in thread
From: David Kastrup @ 2014-02-13  8:38 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Stefan Zager, Karsten Blees, kusmabite, Stefan Zager, GIT Mailing-list

Johannes Sixt <j.sixt@viscovery.net> writes:

> Am 2/12/2014 20:30, schrieb Stefan Zager:
>> On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
>>> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>>>> ReOpenFile, that's fantastic. Thanks a lot!
>>>
>>> ...but should be loaded dynamically via GetProcAddress, or are we
>>> ready to drop XP support?
>> 
>> Right, that is an issue.  From our perspective, it's well past time to
>> drop XP support.
>
> Not from mine.

XP has not even reached end of life yet.  As a point of comparison,
there are tensions on the Emacs developer list several times a decade
because some people suggest it might be time to drop the MSDOS port
and/or the associated restriction of having filenames be unique in the
8+3 naming scheme.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13  6:04             ` David Kastrup
@ 2014-02-13  9:34               ` Mike Hommey
  2014-02-13  9:48                 ` Mike Hommey
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Hommey @ 2014-02-13  9:34 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

On Thu, Feb 13, 2014 at 07:04:02AM +0100, David Kastrup wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> > On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> >> Stefan Zager <szager@chromium.org> writes:
> >> 
> >> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <dak@gnu.org> wrote:
> >> >
> >> >> Really, give the above patch a try.  I am taking longer to finish it
> >> >> than anticipated (with a lot due to procrastination but that is,
> >> >> unfortunately, a large part of my workflow), and it's cutting into my
> >> >> "paychecks" (voluntary donations which to a good degree depend on timely
> >> >> and nontrivial progress reports for my freely available work on GNU
> >> >> LilyPond).
> >> >
> >> > I will give that a try.  How much of a performance improvement have
> >> > you clocked?
> >> 
> >> Depends on file type and size.  With large files with lots of small
> >> changes, performance improvements get more impressive.
> >> 
> >> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> >> (performance improvement about a factor of 3), a large file in the style
> >> of /usr/share/dict/words clocking in at a factor of about 5.
> >> 
> >> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> >> are no improvements in system time (I/O) except for patch 4 of the
> >> series which helps perhaps 20% or so.
> >> 
> >> So the benefits of the patch will come into play mostly for big, bad
> >> files on Windows: other than that, the I/O time is likely to be the
> >> dominant player anyway.
> >
> > How much fragmentation does that add to the files, though?
> 
> Uh, git-blame is a read-only operation.  It does not add fragmentation
> to any file.  The patch will add a diff of probably a few dozen hunks to
> builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
> that I expect even
> 
>     git blame builtin/blame.c
> 
> to be faster than before.  But that interpretation of your question
> probably tries to make too much sense out of what is just nonsense in
> the given context.

Sorry, I thought you were talking about write operations, not reads.

Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13  9:34               ` Mike Hommey
@ 2014-02-13  9:48                 ` Mike Hommey
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Hommey @ 2014-02-13  9:48 UTC (permalink / raw)
  To: David Kastrup; +Cc: Stefan Zager, Duy Nguyen, Git Mailing List

On Thu, Feb 13, 2014 at 06:34:39PM +0900, Mike Hommey wrote:
> On Thu, Feb 13, 2014 at 07:04:02AM +0100, David Kastrup wrote:
> > Mike Hommey <mh@glandium.org> writes:
> > 
> > > On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> > >> Stefan Zager <szager@chromium.org> writes:
> > >> 
> > >> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <dak@gnu.org> wrote:
> > >> >
> > >> >> Really, give the above patch a try.  I am taking longer to finish it
> > >> >> than anticipated (with a lot due to procrastination but that is,
> > >> >> unfortunately, a large part of my workflow), and it's cutting into my
> > >> >> "paychecks" (voluntary donations which to a good degree depend on timely
> > >> >> and nontrivial progress reports for my freely available work on GNU
> > >> >> LilyPond).
> > >> >
> > >> > I will give that a try.  How much of a performance improvement have
> > >> > you clocked?
> > >> 
> > >> Depends on file type and size.  With large files with lots of small
> > >> changes, performance improvements get more impressive.
> > >> 
> > >> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> > >> (performance improvement about a factor of 3), a large file in the style
> > >> of /usr/share/dict/words clocking in at a factor of about 5.
> > >> 
> > >> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> > >> are no improvements in system time (I/O) except for patch 4 of the
> > >> series which helps perhaps 20% or so.
> > >> 
> > >> So the benefits of the patch will come into play mostly for big, bad
> > >> files on Windows: other than that, the I/O time is likely to be the
> > >> dominant player anyway.
> > >
> > > How much fragmentation does that add to the files, though?
> > 
> > Uh, git-blame is a read-only operation.  It does not add fragmentation
> > to any file.  The patch will add a diff of probably a few dozen hunks to
> > builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
> > that I expect even
> > 
> >     git blame builtin/blame.c
> > 
> > to be faster than before.  But that interpretation of your question
> > probably tries to make too much sense out of what is just nonsense in
> > the given context.
> 
> Sorry, I thought you were talking about write operations, not reads.

Specifically, I thought you were talking about git checkout.

Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12 19:22           ` Karsten Blees
  2014-02-12 19:30             ` Stefan Zager
@ 2014-02-13 18:38             ` Zachary Turner
  2014-02-13 22:51               ` Karsten Blees
  1 sibling, 1 reply; 55+ messages in thread
From: Zachary Turner @ 2014-02-13 18:38 UTC (permalink / raw)
  To: git

Karsten Blees <karsten.blees <at> gmail.com> writes:

> 
> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
> > On Wed, Feb 12, 2014 at 7:34 PM, Stefan Zager <szager <at> google.com> 
wrote:
> >> On Wed, Feb 12, 2014 at 10:27 AM, Erik Faye-Lund <kusmabite <at> 
gmail.com> wrote:
> >>> On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager <at> google.com> 
wrote:
> >>>>
> >>>> I don't want to steal the thunder of my coworker, who wrote the
> >>>> implementation.  He plans to submit it upstream soon-ish.  It relies
> >>>> on using the lpOverlapped argument to ReadFile(), with some 
additional
> >>>> tomfoolery to make sure that the implicit position pointer for the
> >>>> file descriptor doesn't get modified.
> >>>
> >>> Is the code available somewhere? I'm especially interested in the
> >>> "additional tomfoolery to make sure that the implicit position pointer
> >>> for the file descriptor doesn't get modified"-part, as this was what I
> >>> ended up butting my head into when trying to do this myself.
> >>
> >> https://chromium-review.googlesource.com/#/c/186104/
> > 
> > ReOpenFile, that's fantastic. Thanks a lot!
> 
> ...but should be loaded dynamically via GetProcAddress, or are we ready to 
drop XP support?
> 
> 

Original patch author here.  In trying to prepare this patch to use 
GetProcAddress to load dynamically, I've run into a bit of a snag.  
NO_THREAD_SAFE_PREAD is a compile-time flag, which will be incompatible with 
any attempt to make this a runtime decision a la LoadLibrary / 
GetProcAddress.  On XP, we would need to fallback to the single-threaded 
path, and on Vista+ we would use the thread-able path, and obviously this 
decision could not be made until runtime.

If MinGW were the only configuration using NO_THREAD_SAFE_PREAD, I would 
just remove it entirely, but it appears Cygwin configuration uses it also.

Suggestions?  

One possibility is to disallow (by convention, perhaps), the use of pread() 
and read() against the same fd.  The only reason ReOpenFile is necessary at 
all is because some code somewhere is mixing read-styles against the same 
fd.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13  8:27               ` Johannes Sixt
  2014-02-13  8:38                 ` David Kastrup
@ 2014-02-13 18:40                 ` Stefan Zager
  1 sibling, 0 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-13 18:40 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Karsten Blees, Erik Faye-Lund, Stefan Zager, GIT Mailing-list

On Thu, Feb 13, 2014 at 12:27 AM, Johannes Sixt <j.sixt@viscovery.net> wrote:
> Am 2/12/2014 20:30, schrieb Stefan Zager:
>> On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
>>> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>>>> ReOpenFile, that's fantastic. Thanks a lot!
>>>
>>> ...but should be loaded dynamically via GetProcAddress, or are we ready to drop XP support?
>>
>> Right, that is an issue.  From our perspective, it's well past time to
>> drop XP support.
>
> Not from mine.

All this really means is that the build config will test WIN_VER, and
there will need to be an additional binary distribution of msysgit for
newer Windows.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13 18:38             ` Zachary Turner
@ 2014-02-13 22:51               ` Karsten Blees
  2014-02-13 22:53                 ` Stefan Zager
  0 siblings, 1 reply; 55+ messages in thread
From: Karsten Blees @ 2014-02-13 22:51 UTC (permalink / raw)
  To: Zachary Turner, git

Am 13.02.2014 19:38, schrieb Zachary Turner:

> The only reason ReOpenFile is necessary at 
> all is because some code somewhere is mixing read-styles against the same 
> fd.
> 

I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).

I tried without ReOpenFile and it seems to work like a charm, or am I missing something?

----8<----
ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
{
	DWORD bytes_read;
	OVERLAPPED overlapped;
	memset(&overlapped, 0, sizeof(overlapped));
	overlapped.Offset = (DWORD) offset;
	overlapped.OffsetHigh = (DWORD) (offset >> 32);

	if (!ReadFile((HANDLE) _get_osfhandle(fd), buf, count, &bytes_read,
	    &overlapped)) {
		errno = err_win_to_posix(GetLastError());
		return -1;
	}
	return (ssize_t) bytes_read;
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13 22:51               ` Karsten Blees
@ 2014-02-13 22:53                 ` Stefan Zager
  2014-02-13 23:09                   ` Zachary Turner
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-13 22:53 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Zachary Turner, Git Mailing List

On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
> Am 13.02.2014 19:38, schrieb Zachary Turner:
>
>> The only reason ReOpenFile is necessary at
>> all is because some code somewhere is mixing read-styles against the same
>> fd.
>>
>
> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).

That is, apparently, a bald-faced lie in the ReadFile API doc.  First
implementation didn't use ReOpenFile, and it crashed all over the
place.  ReOpenFile fixed it.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13 22:53                 ` Stefan Zager
@ 2014-02-13 23:09                   ` Zachary Turner
  2014-02-14 19:04                     ` Karsten Blees
  0 siblings, 1 reply; 55+ messages in thread
From: Zachary Turner @ 2014-02-13 23:09 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Karsten Blees, Git Mailing List

To elaborate a little bit more, you can verify with a sample program
that ReadFile with OVERLAPPED does in fact modify the HANDLE's file
position.  The documentation doesn't actually state one way or
another.   My original attempt at a patch didn't have the ReOpenFile,
and we experienced regular read corruption.  We scratched our heads
over it for a bit, and then hypothesized that someone must be mixing
read styles, which led to this ReOpenFile workaround, which
incidentally also solved the corruption problems.  We wrote a similar
sample program to verify that when using ReOpenHandle, and changing
the file pointer of the duplicated handle, that the file pointer of
the original handle is not modified.

We did not actually try to identify the source of the mixed read
styles, but it seems like the only possible explanation.

On Thu, Feb 13, 2014 at 2:53 PM, Stefan Zager <szager@google.com> wrote:
> On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>> Am 13.02.2014 19:38, schrieb Zachary Turner:
>>
>>> The only reason ReOpenFile is necessary at
>>> all is because some code somewhere is mixing read-styles against the same
>>> fd.
>>>
>>
>> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).
>
> That is, apparently, a bald-faced lie in the ReadFile API doc.  First
> implementation didn't use ReOpenFile, and it crashed all over the
> place.  ReOpenFile fixed it.
>
> Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-13 23:09                   ` Zachary Turner
@ 2014-02-14 19:04                     ` Karsten Blees
       [not found]                       ` <CAAErz9g7ND1htfk=yxRJJLbSEgBi4EV_AHC9uDRptugGWFWcXw@mail.gmail.com>
  2014-02-14 21:49                       ` Stefan Zager
  0 siblings, 2 replies; 55+ messages in thread
From: Karsten Blees @ 2014-02-14 19:04 UTC (permalink / raw)
  To: Zachary Turner, Stefan Zager; +Cc: Git Mailing List

Am 14.02.2014 00:09, schrieb Zachary Turner:
> To elaborate a little bit more, you can verify with a sample program
> that ReadFile with OVERLAPPED does in fact modify the HANDLE's file
> position.  The documentation doesn't actually state one way or
> another.   My original attempt at a patch didn't have the ReOpenFile,
> and we experienced regular read corruption.  We scratched our heads
> over it for a bit, and then hypothesized that someone must be mixing
> read styles, which led to this ReOpenFile workaround, which
> incidentally also solved the corruption problems.  We wrote a similar
> sample program to verify that when using ReOpenHandle, and changing
> the file pointer of the duplicated handle, that the file pointer of
> the original handle is not modified.
> 
> We did not actually try to identify the source of the mixed read
> styles, but it seems like the only possible explanation.
> 
> On Thu, Feb 13, 2014 at 2:53 PM, Stefan Zager <szager@google.com> wrote:
>> On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>>> Am 13.02.2014 19:38, schrieb Zachary Turner:
>>>
>>>> The only reason ReOpenFile is necessary at
>>>> all is because some code somewhere is mixing read-styles against the same
>>>> fd.
>>>>
>>>
>>> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).
>>
>> That is, apparently, a bald-faced lie in the ReadFile API doc.  First
>> implementation didn't use ReOpenFile, and it crashed all over the
>> place.  ReOpenFile fixed it.
>>
>> Stefan

Damn...you're right, multi-threaded git-index-pack works fine, but some tests fail badly. Mixed reads would have to be from git_mmap, which is the only other caller of pread().

A simple alternative to ReOpenHandle is to reset the file pointer to its original position, as in compat/pread.c::git_pread. Thus single-theaded code can mix read()/pread() at will, but multi-threaded code has to use pread() exclusively (which is usually the case anyway). A main thread using read() and background threads using pread() (which is technically allowed by POSIX) will fail with this solution.

This version passes the test suite on msysgit:

----8<----
ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
{
	DWORD bytes_read;
	OVERLAPPED overlapped;
	off64_t current;
	memset(&overlapped, 0, sizeof(overlapped));
	overlapped.Offset = (DWORD) offset;
	overlapped.OffsetHigh = (DWORD) (offset >> 32);

	current = lseek64(fd, 0, SEEK_CUR);

	if (!ReadFile((HANDLE)_get_osfhandle(fd), buf, count, &bytes_read, &overlapped)) {
		errno = err_win_to_posix(GetLastError());
		return -1;
	}

	lseek64(fd, current, SEEK_SET);

	return (ssize_t) bytes_read;
}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
       [not found]                       ` <CAAErz9g7ND1htfk=yxRJJLbSEgBi4EV_AHC9uDRptugGWFWcXw@mail.gmail.com>
@ 2014-02-14 19:16                         ` Zachary Turner
  2014-02-14 23:10                           ` Karsten Blees
  2014-02-15  0:45                           ` Duy Nguyen
  2014-02-14 19:52                         ` Stefan Zager
  1 sibling, 2 replies; 55+ messages in thread
From: Zachary Turner @ 2014-02-14 19:16 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Stefan Zager, Git Mailing List

(Gah, sorry if you're receiving multiple emails to your personal
addresses, I need to get used to manually setting Plain-text mode
every time I send a message).

For the mixed read, we wouldn't be looking for another caller of
pread() (since it doesn't care what the file pointer is), but instead
a caller of read() or lseek() (since those do depend on the current
file pointer).  In index-pack.c, I see two possible culprits:

1) A call to xread() from inside fill()
2) A call to lseek in parse_pack_objects()

Do you think these could be related?  If so, maybe that opens up some
other solutions?

BTW, the version you posted isn't thread safe.  Suppose thread A and
thread B execute this function at the same time.  A executes through
the ReadFile(), but does not yet reset the second lseek64.  B then
executes the first lseek64(), storing off the modified file pointer.
Then A finishes, then B finishes.  At the end, the file pointer is
still modified.

On Fri, Feb 14, 2014 at 11:15 AM, Zachary Turner <zturner@chromium.org> wrote:
> For the mixed read, we wouldn't be looking for another caller of pread()
> (since it doesn't care what the file pointer is), but instead a caller of
> read() or lseek().  In index-pack.c, I see two possible culprits:
>
> 1) A call to xread() from inside fill()
> 2) A call to lseek in parse_pack_objects()
>
> Do you think these could be related?  If so, maybe that opens up some other
> solutions?
>
> BTW, the version you posted isn't thread safe.  Suppose thread A and thread
> B execute this function at the same time.  A executes through the
> ReadFile(), but does not yet reset the second lseek64.  B then executes the
> first lseek64(), storing off the modified file pointer.  Then A finishes,
> then B finishes.  At the end, the file pointer is still modified.
>
>
>
> On Fri, Feb 14, 2014 at 11:04 AM, Karsten Blees <karsten.blees@gmail.com>
> wrote:
>>
>> Am 14.02.2014 00:09, schrieb Zachary Turner:
>> > To elaborate a little bit more, you can verify with a sample program
>> > that ReadFile with OVERLAPPED does in fact modify the HANDLE's file
>> > position.  The documentation doesn't actually state one way or
>> > another.   My original attempt at a patch didn't have the ReOpenFile,
>> > and we experienced regular read corruption.  We scratched our heads
>> > over it for a bit, and then hypothesized that someone must be mixing
>> > read styles, which led to this ReOpenFile workaround, which
>> > incidentally also solved the corruption problems.  We wrote a similar
>> > sample program to verify that when using ReOpenHandle, and changing
>> > the file pointer of the duplicated handle, that the file pointer of
>> > the original handle is not modified.
>> >
>> > We did not actually try to identify the source of the mixed read
>> > styles, but it seems like the only possible explanation.
>> >
>> > On Thu, Feb 13, 2014 at 2:53 PM, Stefan Zager <szager@google.com> wrote:
>> >> On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees
>> >> <karsten.blees@gmail.com> wrote:
>> >>> Am 13.02.2014 19:38, schrieb Zachary Turner:
>> >>>
>> >>>> The only reason ReOpenFile is necessary at
>> >>>> all is because some code somewhere is mixing read-styles against the
>> >>>> same
>> >>>> fd.
>> >>>>
>> >>>
>> >>> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify
>> >>> the HANDLE's file position, so you should be able to mix read()/pread()
>> >>> however you like (as long as read() is only called from one thread).
>> >>
>> >> That is, apparently, a bald-faced lie in the ReadFile API doc.  First
>> >> implementation didn't use ReOpenFile, and it crashed all over the
>> >> place.  ReOpenFile fixed it.
>> >>
>> >> Stefan
>>
>> Damn...you're right, multi-threaded git-index-pack works fine, but some
>> tests fail badly. Mixed reads would have to be from git_mmap, which is the
>> only other caller of pread().
>>
>> A simple alternative to ReOpenHandle is to reset the file pointer to its
>> original position, as in compat/pread.c::git_pread. Thus single-theaded code
>> can mix read()/pread() at will, but multi-threaded code has to use pread()
>> exclusively (which is usually the case anyway). A main thread using read()
>> and background threads using pread() (which is technically allowed by POSIX)
>> will fail with this solution.
>>
>> This version passes the test suite on msysgit:
>>
>> ----8<----
>> ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
>> {
>>         DWORD bytes_read;
>>         OVERLAPPED overlapped;
>>         off64_t current;
>>         memset(&overlapped, 0, sizeof(overlapped));
>>         overlapped.Offset = (DWORD) offset;
>>         overlapped.OffsetHigh = (DWORD) (offset >> 32);
>>
>>         current = lseek64(fd, 0, SEEK_CUR);
>>
>>         if (!ReadFile((HANDLE)_get_osfhandle(fd), buf, count, &bytes_read,
>> &overlapped)) {
>>                 errno = err_win_to_posix(GetLastError());
>>                 return -1;
>>         }
>>
>>         lseek64(fd, current, SEEK_SET);
>>
>>         return (ssize_t) bytes_read;
>> }
>>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
       [not found]                       ` <CAAErz9g7ND1htfk=yxRJJLbSEgBi4EV_AHC9uDRptugGWFWcXw@mail.gmail.com>
  2014-02-14 19:16                         ` Zachary Turner
@ 2014-02-14 19:52                         ` Stefan Zager
  1 sibling, 0 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-14 19:52 UTC (permalink / raw)
  To: Zachary Turner; +Cc: Karsten Blees, Git Mailing List

On Fri, Feb 14, 2014 at 11:15 AM, Zachary Turner <zturner@chromium.org> wrote:
> For the mixed read, we wouldn't be looking for another caller of pread()
> (since it doesn't care what the file pointer is), but instead a caller of
> read() or lseek().  In index-pack.c, I see two possible culprits:
>
> 1) A call to xread() from inside fill()
> 2) A call to lseek in parse_pack_objects()
>
> Do you think these could be related?  If so, maybe that opens up some other
> solutions?

>From my observations, it's not that simple.  As you pointed out to me
before, fill() is only called before the threading part of the code,
and lseek is only called after the threading part; and the lseek() is
lseek(fd, 0, SEEK_CUR), so it's purely advisory.

Also, here is the error output we got before you added ReOpenFile:

remote: Total 2514467 (delta 1997300), reused 2513040 (delta 1997113)
Checking connectivity... error: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
does not match index
warning: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
cannot be accessed
error: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
does not match index
warning: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
cannot be accessed
error: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
does not match index
warning: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
cannot be accessed
fatal: bad object e0f9f23f765a45e6d80863a8f881ee735c9347fe


The 'Checking connectivity...' message comes from builtin/clone.c,
which runs in a separate process from builtin/index-pack.c.  What this
suggests to me is that file descriptors for the loose object files are
not being flushed or closed properly before index-pack finishes.


> BTW, the version you posted isn't thread safe.  Suppose thread A and thread
> B execute this function at the same time.  A executes through the
> ReadFile(), but does not yet reset the second lseek64.  B then executes the
> first lseek64(), storing off the modified file pointer.  Then A finishes,
> then B finishes.  At the end, the file pointer is still modified.

Yes, that.  I would also point out that in our experiments, ReOpenFile
is not nearly as expensive as its name might suggest.  Since the
solution using ReOpenFile is pretty solidly thread-safe (at least as
far as we can tell), I'm in favor of using it unless or until we
properly root-case the failure.


Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-14 19:04                     ` Karsten Blees
       [not found]                       ` <CAAErz9g7ND1htfk=yxRJJLbSEgBi4EV_AHC9uDRptugGWFWcXw@mail.gmail.com>
@ 2014-02-14 21:49                       ` Stefan Zager
  1 sibling, 0 replies; 55+ messages in thread
From: Stefan Zager @ 2014-02-14 21:49 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Zachary Turner, Git Mailing List

On Fri, Feb 14, 2014 at 11:04 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
>
> Damn...you're right, multi-threaded git-index-pack works fine, but some tests fail badly. Mixed reads would have to be from git_mmap, which is the only other caller of pread().

msysgit used git_mmap() as defined in compat/win32mmap.c, which does
not use pread.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-14 19:16                         ` Zachary Turner
@ 2014-02-14 23:10                           ` Karsten Blees
  2014-02-15  0:45                           ` Duy Nguyen
  1 sibling, 0 replies; 55+ messages in thread
From: Karsten Blees @ 2014-02-14 23:10 UTC (permalink / raw)
  To: Zachary Turner; +Cc: Stefan Zager, Git Mailing List

Am 14.02.2014 20:16, schrieb Zachary Turner:
> For the mixed read, we wouldn't be looking for another caller of
> pread() (since it doesn't care what the file pointer is), but instead
> a caller of read() or lseek() (since those do depend on the current
> file pointer).  In index-pack.c, I see two possible culprits:
> 
> 1) A call to xread() from inside fill()
> 2) A call to lseek in parse_pack_objects()
> 
> Do you think these could be related?  If so, maybe that opens up some
> other solutions?
> 

Yeah, I think that's it. The problem is that the single-threaded part (parse_pack_objects/parse_pack_header) _also_ calls pread (via sha1_object -> get_data_from_pack -> unpack_data). So a pread() that modifies the file position would naturally be bad in this single-threaded scenario. Incidentally, that's exactly what the lstat64 in the version below fixes (similar to git_pread).

> BTW, the version you posted isn't thread safe.

It is true that, in a multi-threaded scenario, my version modifies the file position in some indeterministic way. However, as you noted above, the file position is irrelevant to pread(), so that's perfectly thread-safe, as long as all threads use pread() exclusively.

Using [x]read() in one of the threads would _not_ be thread-safe, but we're not doing that here. Both fill()/xread() and parse_pack_objects()/lseek() are unreachable from threaded_second_pass(), and the main thread just waits for the background threads to complete...

>>> A simple alternative to ReOpenHandle is to reset the file pointer to its
>>> original position, as in compat/pread.c::git_pread. Thus single-theaded code
>>> can mix read()/pread() at will, but multi-threaded code has to use pread()
>>> exclusively (which is usually the case anyway). A main thread using read()
>>> and background threads using pread() (which is technically allowed by POSIX)
>>> will fail with this solution.
>>>
>>> This version passes the test suite on msysgit:
>>>
>>> ----8<----
>>> ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
>>> {
>>>         DWORD bytes_read;
>>>         OVERLAPPED overlapped;
>>>         off64_t current;
>>>         memset(&overlapped, 0, sizeof(overlapped));
>>>         overlapped.Offset = (DWORD) offset;
>>>         overlapped.OffsetHigh = (DWORD) (offset >> 32);
>>>
>>>         current = lseek64(fd, 0, SEEK_CUR);
>>>
>>>         if (!ReadFile((HANDLE)_get_osfhandle(fd), buf, count, &bytes_read,
>>> &overlapped)) {
>>>                 errno = err_win_to_posix(GetLastError());
>>>                 return -1;
>>>         }
>>>
>>>         lseek64(fd, current, SEEK_SET);
>>>
>>>         return (ssize_t) bytes_read;
>>> }
>>>
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-14 19:16                         ` Zachary Turner
  2014-02-14 23:10                           ` Karsten Blees
@ 2014-02-15  0:45                           ` Duy Nguyen
  2014-02-15  0:50                             ` Stefan Zager
  1 sibling, 1 reply; 55+ messages in thread
From: Duy Nguyen @ 2014-02-15  0:45 UTC (permalink / raw)
  To: Zachary Turner; +Cc: Karsten Blees, Stefan Zager, Git Mailing List

On Sat, Feb 15, 2014 at 2:16 AM, Zachary Turner <zturner@chromium.org> wrote:
> (Gah, sorry if you're receiving multiple emails to your personal
> addresses, I need to get used to manually setting Plain-text mode
> every time I send a message).
>
> For the mixed read, we wouldn't be looking for another caller of
> pread() (since it doesn't care what the file pointer is), but instead
> a caller of read() or lseek() (since those do depend on the current
> file pointer).  In index-pack.c, I see two possible culprits:
>
> 1) A call to xread() from inside fill()
> 2) A call to lseek in parse_pack_objects()
>
> Do you think these could be related?  If so, maybe that opens up some
> other solutions?

For index-pack alone, what's wrong with open one file handle per thread?
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-15  0:45                           ` Duy Nguyen
@ 2014-02-15  0:50                             ` Stefan Zager
  2014-02-15  0:56                               ` Duy Nguyen
  0 siblings, 1 reply; 55+ messages in thread
From: Stefan Zager @ 2014-02-15  0:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Zachary Turner, Karsten Blees, Git Mailing List

On Fri, Feb 14, 2014 at 4:45 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sat, Feb 15, 2014 at 2:16 AM, Zachary Turner <zturner@chromium.org> wrote:
>> (Gah, sorry if you're receiving multiple emails to your personal
>> addresses, I need to get used to manually setting Plain-text mode
>> every time I send a message).
>>
>> For the mixed read, we wouldn't be looking for another caller of
>> pread() (since it doesn't care what the file pointer is), but instead
>> a caller of read() or lseek() (since those do depend on the current
>> file pointer).  In index-pack.c, I see two possible culprits:
>>
>> 1) A call to xread() from inside fill()
>> 2) A call to lseek in parse_pack_objects()
>>
>> Do you think these could be related?  If so, maybe that opens up some
>> other solutions?
>
> For index-pack alone, what's wrong with open one file handle per thread?

Nothing wrong with that, except that it would mean either using
thread-local storage (which the code doesn't currently use); or
plumbing pack_fd through the call stack, which doesn't sound very fun.

Stefan

> --
> Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-15  0:50                             ` Stefan Zager
@ 2014-02-15  0:56                               ` Duy Nguyen
  2014-02-15  1:15                                 ` Zachary Turner
  0 siblings, 1 reply; 55+ messages in thread
From: Duy Nguyen @ 2014-02-15  0:56 UTC (permalink / raw)
  To: Stefan Zager; +Cc: Zachary Turner, Karsten Blees, Git Mailing List

On Sat, Feb 15, 2014 at 7:50 AM, Stefan Zager <szager@google.com> wrote:
> On Fri, Feb 14, 2014 at 4:45 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Sat, Feb 15, 2014 at 2:16 AM, Zachary Turner <zturner@chromium.org> wrote:
>>> (Gah, sorry if you're receiving multiple emails to your personal
>>> addresses, I need to get used to manually setting Plain-text mode
>>> every time I send a message).
>>>
>>> For the mixed read, we wouldn't be looking for another caller of
>>> pread() (since it doesn't care what the file pointer is), but instead
>>> a caller of read() or lseek() (since those do depend on the current
>>> file pointer).  In index-pack.c, I see two possible culprits:
>>>
>>> 1) A call to xread() from inside fill()
>>> 2) A call to lseek in parse_pack_objects()
>>>
>>> Do you think these could be related?  If so, maybe that opens up some
>>> other solutions?
>>
>> For index-pack alone, what's wrong with open one file handle per thread?
>
> Nothing wrong with that, except that it would mean either using
> thread-local storage (which the code doesn't currently use); or
> plumbing pack_fd through the call stack, which doesn't sound very fun.

Current code does use thread-local storage (struct thread_local and
get_thread_data). Adding a new file handle when NO_THREAD_SAFE_PREAD
is defined is simpler imo.
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-15  0:56                               ` Duy Nguyen
@ 2014-02-15  1:15                                 ` Zachary Turner
  2014-02-15  1:39                                   ` Duy Nguyen
  0 siblings, 1 reply; 55+ messages in thread
From: Zachary Turner @ 2014-02-15  1:15 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Stefan Zager, Karsten Blees, Git Mailing List

Even if we make that change to use TLS for this case, the
implementation of pread() will still change in such a way that the
semantics of pread() are different on Windows.  Is that ok?

Just to summarize, here's the viable approaches I've seen discussed so far:

1) Use _WINVER at compile time to select either a thread-safe or
non-thread-safe implementation of pread.  This is the easiest possible
code change, but would necessitate 2 binary distributions of git for
windows.
2) Use TLS as you suggest and have one fd per pack thread.  Probably
the most complicated code change (at least for me, being a first-time
contributor)
3) Use Karsten's suggested implementation from earlier in the thread.
Seems to work, but it's a little confusing from a readability
standpoint since the implementation is not-thread safe except in this
specific usage context.

On Fri, Feb 14, 2014 at 4:56 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sat, Feb 15, 2014 at 7:50 AM, Stefan Zager <szager@google.com> wrote:
>> On Fri, Feb 14, 2014 at 4:45 PM, Duy Nguyen <pclouds@gmail.com> wrote:
>>> On Sat, Feb 15, 2014 at 2:16 AM, Zachary Turner <zturner@chromium.org> wrote:
>>>> (Gah, sorry if you're receiving multiple emails to your personal
>>>> addresses, I need to get used to manually setting Plain-text mode
>>>> every time I send a message).
>>>>
>>>> For the mixed read, we wouldn't be looking for another caller of
>>>> pread() (since it doesn't care what the file pointer is), but instead
>>>> a caller of read() or lseek() (since those do depend on the current
>>>> file pointer).  In index-pack.c, I see two possible culprits:
>>>>
>>>> 1) A call to xread() from inside fill()
>>>> 2) A call to lseek in parse_pack_objects()
>>>>
>>>> Do you think these could be related?  If so, maybe that opens up some
>>>> other solutions?
>>>
>>> For index-pack alone, what's wrong with open one file handle per thread?
>>
>> Nothing wrong with that, except that it would mean either using
>> thread-local storage (which the code doesn't currently use); or
>> plumbing pack_fd through the call stack, which doesn't sound very fun.
>
> Current code does use thread-local storage (struct thread_local and
> get_thread_data). Adding a new file handle when NO_THREAD_SAFE_PREAD
> is defined is simpler imo.
> --
> Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-15  1:15                                 ` Zachary Turner
@ 2014-02-15  1:39                                   ` Duy Nguyen
  2014-02-18 17:55                                     ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Duy Nguyen @ 2014-02-15  1:39 UTC (permalink / raw)
  To: Zachary Turner; +Cc: Stefan Zager, Karsten Blees, Git Mailing List

On Sat, Feb 15, 2014 at 8:15 AM, Zachary Turner <zturner@chromium.org> wrote:
> Even if we make that change to use TLS for this case, the
> implementation of pread() will still change in such a way that the
> semantics of pread() are different on Windows.  Is that ok?
>
> Just to summarize, here's the viable approaches I've seen discussed so far:
>
> 1) Use _WINVER at compile time to select either a thread-safe or
> non-thread-safe implementation of pread.  This is the easiest possible
> code change, but would necessitate 2 binary distributions of git for
> windows.
> 2) Use TLS as you suggest and have one fd per pack thread.  Probably
> the most complicated code change (at least for me, being a first-time
> contributor)

It's not so complicated. I suggested a patch [1] before (surprise!).

> 3) Use Karsten's suggested implementation from earlier in the thread.
> Seems to work, but it's a little confusing from a readability
> standpoint since the implementation is not-thread safe except in this
> specific usage contex

[1] http://article.gmane.org/gmane.comp.version-control.git/196042
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-15  1:39                                   ` Duy Nguyen
@ 2014-02-18 17:55                                     ` Junio C Hamano
  2014-02-18 18:14                                       ` Zachary Turner
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2014-02-18 17:55 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Zachary Turner, Stefan Zager, Karsten Blees, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Feb 15, 2014 at 8:15 AM, Zachary Turner <zturner@chromium.org> wrote:
> ...
>> 2) Use TLS as you suggest and have one fd per pack thread.  Probably
>> the most complicated code change (at least for me, being a first-time
>> contributor)
>
> It's not so complicated. I suggested a patch [1] before (surprise!).
> ...
> [1] http://article.gmane.org/gmane.comp.version-control.git/196042

That message is at the tail end of the discussion. I wonder why
nothing came out of it back then.

While I do not see anything glaringly wrong with the change from a
quick glance over it, it would be nice to hear how well it performs
on their platform from Windows folks.

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-18 17:55                                     ` Junio C Hamano
@ 2014-02-18 18:14                                       ` Zachary Turner
  0 siblings, 0 replies; 55+ messages in thread
From: Zachary Turner @ 2014-02-18 18:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Stefan Zager, Karsten Blees, Git Mailing List

It shouldn't be hard for us to run some tests with this patch applied.
 Will report back in a day or two.

On Tue, Feb 18, 2014 at 9:55 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
>> On Sat, Feb 15, 2014 at 8:15 AM, Zachary Turner <zturner@chromium.org> wrote:
>> ...
>>> 2) Use TLS as you suggest and have one fd per pack thread.  Probably
>>> the most complicated code change (at least for me, being a first-time
>>> contributor)
>>
>> It's not so complicated. I suggested a patch [1] before (surprise!).
>> ...
>> [1] http://article.gmane.org/gmane.comp.version-control.git/196042
>
> That message is at the tail end of the discussion. I wonder why
> nothing came out of it back then.
>
> While I do not see anything glaringly wrong with the change from a
> quick glance over it, it would be nice to hear how well it performs
> on their platform from Windows folks.
>
> Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2014-02-12  1:54 Make the git codebase thread-safe Stefan Zager
                   ` (3 preceding siblings ...)
  2014-02-13  1:42 ` brian m. carlson
@ 2019-04-02  0:52 ` Matheus Tavares
  2019-04-02  1:07   ` Duy Nguyen
  4 siblings, 1 reply; 55+ messages in thread
From: Matheus Tavares @ 2019-04-02  0:52 UTC (permalink / raw)
  To: git
  Cc: Stefan Zager, Stefan Zager, Zachary Turner, Robin H . Johnson,
	brian m . carlson, David Kastrup, Duy Nguyen, Alfredo Goldman,
	Thomas Gummerer,
	Оля
	Тележная,
	Elijah Newren, Christian Couder

Hi,

I am planning to work on making pack access thread-safe as my GSoC
project, and after that, parallelize git blame or checkout. Or even use
the thread-safe pack access to improve the already parallel grep or
pack-objects.

With this in mind, I would like to know if the problem discussed in this
thread[1] is still an issue on the repos you folks work with (gentoo,
chromium, etc.). And also, could you please let me know which git
commands did you find to me more problematic in them, nowadays?

I downloaded chromium to give it a try and got (on a machine with i7 and
SSD, running Manjaro Linux):

- 17s on blame for a file with long history[2]
- 2m on blame for a huge file[3]
- 15s on log for both [2] and [3]
- 1s for git status

It seems quite a lot, especially with SSD, IMO.

[1] https://public-inbox.org/git/CA+TurHgyUK5sfCKrK+3xY8AeOg0t66vEvFxX=JiA9wXww7eZXQ@mail.gmail.com/
[2] ./chrome/browser/about_flags.cc (same with ./DEPS)
[3] third_party/sqlite/amalgamation/sqlite3.c (7.5M)

Best,
Matheus Tavares

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2019-04-02  0:52 ` Matheus Tavares
@ 2019-04-02  1:07   ` Duy Nguyen
  2019-04-02 10:30     ` David Kastrup
  2019-04-02 19:06     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 55+ messages in thread
From: Duy Nguyen @ 2019-04-02  1:07 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Stefan Zager, Stefan Zager, Zachary Turner,
	Robin H . Johnson, brian m . carlson, David Kastrup,
	Alfredo Goldman, Thomas Gummerer,
	Оля
	Тележная,
	Elijah Newren, Christian Couder

On Tue, Apr 2, 2019 at 7:52 AM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
> I downloaded chromium to give it a try and got (on a machine with i7 and
> SSD, running Manjaro Linux):
>
> - 17s on blame for a file with long history[2]
> - 2m on blame for a huge file[3]
> - 15s on log for both [2] and [3]
> - 1s for git status
>
> It seems quite a lot, especially with SSD, IMO.

There have been a couple of optimizations that are probably still not
enabled by default because they only benefit large repos. So you may
want to check and turn them on before measuring anything:
commit-graph, pack bitmap, untracked cache or fsmonitor. All these
should be mentioned in 'git help config' (as starting point). Also
search "threads" in that man page because some commands may have multi
threads support but disabled by default for the same reason.

From your command list though, I think you might get the same results
(maybe with a bit faster 'git status') even with all optimizations on.
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2019-04-02  1:07   ` Duy Nguyen
@ 2019-04-02 10:30     ` David Kastrup
  2019-04-02 11:35       ` Duy Nguyen
  2019-04-02 19:06     ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 55+ messages in thread
From: David Kastrup @ 2019-04-02 10:30 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Matheus Tavares, Git Mailing List, Stefan Zager, Stefan Zager,
	Zachary Turner, Robin H . Johnson, brian m . carlson,
	Alfredo Goldman, Thomas Gummerer,
	Оля
	Тележная,
	Elijah Newren, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 929 bytes --]

Duy Nguyen <pclouds@gmail.com> writes:

> On Tue, Apr 2, 2019 at 7:52 AM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>> I downloaded chromium to give it a try and got (on a machine with i7 and
>> SSD, running Manjaro Linux):
>>
>> - 17s on blame for a file with long history[2]
>> - 2m on blame for a huge file[3]
>> - 15s on log for both [2] and [3]
>> - 1s for git status
>>
>> It seems quite a lot, especially with SSD, IMO.
>
> There have been a couple of optimizations that are probably still not
> enabled by default because they only benefit large repos.

I've proposed a trivial change in 2014 that could have cut down typical
blame times significantly but nobody was interested in testing and
committing it, and it is conceivable that in limited-memory situations
it might warrant some accounting/mitigation for weird histories (not
that there isn't other code like that).

Rebased/appended.

-- 
David Kastrup

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-blame.c-don-t-drop-origin-blobs-as-eagerly.patch --]
[-- Type: text/x-diff, Size: 1095 bytes --]

From a076daf13d144cb74ae5fd7250445bbfb4669a05 Mon Sep 17 00:00:00 2001
From: David Kastrup <dak@gnu.org>
Date: Sun, 2 Feb 2014 18:33:35 +0100
Subject: [PATCH] blame.c: don't drop origin blobs as eagerly

When a parent blob already has chunks queued up for blaming, dropping
the blob at the end of one blame step will cause it to get reloaded
right away, doubling the amount of I/O and unpacking when processing a
linear history.

Keeping such parent blobs in memory seems like a reasonable optimization
that should incur additional memory pressure mostly when processing the
merges from old branches.
---
 blame.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/blame.c b/blame.c
index 5c07dec190..c11c516921 100644
--- a/blame.c
+++ b/blame.c
@@ -1562,7 +1562,8 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
 	}
 	for (i = 0; i < num_sg; i++) {
 		if (sg_origin[i]) {
-			drop_origin_blob(sg_origin[i]);
+			if (!sg_origin[i]->suspects)
+				drop_origin_blob(sg_origin[i]);
 			blame_origin_decref(sg_origin[i]);
 		}
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2019-04-02 10:30     ` David Kastrup
@ 2019-04-02 11:35       ` Duy Nguyen
  2019-04-02 11:52         ` David Kastrup
  0 siblings, 1 reply; 55+ messages in thread
From: Duy Nguyen @ 2019-04-02 11:35 UTC (permalink / raw)
  To: David Kastrup
  Cc: Matheus Tavares, Git Mailing List, Stefan Zager, Stefan Zager,
	Zachary Turner, Robin H . Johnson, brian m . carlson,
	Alfredo Goldman, Thomas Gummerer,
	Оля
	Тележная,
	Elijah Newren, Christian Couder

On Tue, Apr 2, 2019 at 5:30 PM David Kastrup <dak@gnu.org> wrote:
>
> Duy Nguyen <pclouds@gmail.com> writes:
>
> > On Tue, Apr 2, 2019 at 7:52 AM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >> I downloaded chromium to give it a try and got (on a machine with i7 and
> >> SSD, running Manjaro Linux):
> >>
> >> - 17s on blame for a file with long history[2]
> >> - 2m on blame for a huge file[3]
> >> - 15s on log for both [2] and [3]
> >> - 1s for git status
> >>
> >> It seems quite a lot, especially with SSD, IMO.
> >
> > There have been a couple of optimizations that are probably still not
> > enabled by default because they only benefit large repos.
>
> I've proposed a trivial change in 2014 that could have cut down typical
> blame times significantly but nobody was interested in testing and
> committing it, and it is conceivable that in limited-memory situations
> it might warrant some accounting/mitigation for weird histories (not
> that there isn't other code like that).

I didn't really read the patch (I don't know much about blame.c to
really contribute anything there). But a quick "git blame --show-stats
unpack-trees.c" shows this

Without the patch:

num read blob: 767
num get patch: 425
num commits: 343

With the patch:

num read blob: 419
num get patch: 425
num commits: 343

That's a nice reduction of blob reading. On a typical small file, the
actual time saving might be not much. But it could really help when
you blame a large file.

Perhaps you could resubmit it again for inclusion? (at least a
sign-off-by is missing then)

> Rebased/appended.
>
> --
> David Kastrup
-- 
Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2019-04-02 11:35       ` Duy Nguyen
@ 2019-04-02 11:52         ` David Kastrup
  0 siblings, 0 replies; 55+ messages in thread
From: David Kastrup @ 2019-04-02 11:52 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Matheus Tavares, Git Mailing List, Stefan Zager, Stefan Zager,
	Zachary Turner, Robin H . Johnson, brian m . carlson,
	Alfredo Goldman, Thomas Gummerer,
	Оля
	Тележная,
	Elijah Newren, Christian Couder

Duy Nguyen <pclouds@gmail.com> writes:

> On Tue, Apr 2, 2019 at 5:30 PM David Kastrup <dak@gnu.org> wrote:
>>
>> Duy Nguyen <pclouds@gmail.com> writes:
>>
>> > On Tue, Apr 2, 2019 at 7:52 AM Matheus Tavares
>> > <matheus.bernardino@usp.br> wrote:
>> >> I downloaded chromium to give it a try and got (on a machine with i7 and
>> >> SSD, running Manjaro Linux):
>> >>
>> >> - 17s on blame for a file with long history[2]
>> >> - 2m on blame for a huge file[3]
>> >> - 15s on log for both [2] and [3]
>> >> - 1s for git status
>> >>
>> >> It seems quite a lot, especially with SSD, IMO.
>> >
>> > There have been a couple of optimizations that are probably still not
>> > enabled by default because they only benefit large repos.
>>
>> I've proposed a trivial change in 2014 that could have cut down typical
>> blame times significantly but nobody was interested in testing and
>> committing it, and it is conceivable that in limited-memory situations
>> it might warrant some accounting/mitigation for weird histories (not
>> that there isn't other code like that).
>
> I didn't really read the patch (I don't know much about blame.c to
> really contribute anything there). But a quick "git blame --show-stats
> unpack-trees.c" shows this
>
> Without the patch:
>
> num read blob: 767
> num get patch: 425
> num commits: 343
>
> With the patch:
>
> num read blob: 419
> num get patch: 425
> num commits: 343
>
> That's a nice reduction of blob reading. On a typical small file, the
> actual time saving might be not much. But it could really help when
> you blame a large file.
>
> Perhaps you could resubmit it again for inclusion? (at least a
> sign-off-by is missing then)

I don't expect it to go anywhere but will do.  Feel free to herd it.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Make the git codebase thread-safe
  2019-04-02  1:07   ` Duy Nguyen
  2019-04-02 10:30     ` David Kastrup
@ 2019-04-02 19:06     ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 55+ messages in thread
From: Matheus Tavares Bernardino @ 2019-04-02 19:06 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Stefan Zager, Stefan Zager, Zachary Turner,
	Robin H . Johnson, brian m . carlson, David Kastrup,
	Alfredo Goldman, Thomas Gummerer,
	Оля
	Тележная,
	Elijah Newren, Christian Couder

On Mon, Apr 1, 2019 at 10:07 PM Duy Nguyen <pclouds@gmail.com> wrote:
>
> On Tue, Apr 2, 2019 at 7:52 AM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> > I downloaded chromium to give it a try and got (on a machine with i7 and
> > SSD, running Manjaro Linux):
> >
> > - 17s on blame for a file with long history[2]
> > - 2m on blame for a huge file[3]
> > - 15s on log for both [2] and [3]
> > - 1s for git status
> >
> > It seems quite a lot, especially with SSD, IMO.
>
> There have been a couple of optimizations that are probably still not
> enabled by default because they only benefit large repos. So you may
> want to check and turn them on before measuring anything:
> commit-graph, pack bitmap, untracked cache or fsmonitor. All these
> should be mentioned in 'git help config' (as starting point). Also
> search "threads" in that man page because some commands may have multi
> threads support but disabled by default for the same reason.

Nice, thanks for the suggestions!

> From your command list though, I think you might get the same results
> (maybe with a bit faster 'git status') even with all optimizations on.

Yes, you were right. With the optimizations on, I got the following
times on those same files:

- 17~18s on blame for about_flags.cc
- 1m50s~2m on blame for sqlite3.c
- 15s on log for both
- 0.3~0.5s on git status

> --
> Duy

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2019-04-02 19:07 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-12  1:54 Make the git codebase thread-safe Stefan Zager
2014-02-12  2:02 ` Robin H. Johnson
2014-02-12  3:43   ` Duy Nguyen
2014-02-12 11:00     ` Karsten Blees
2014-02-12 23:03       ` Mike Hommey
2014-02-13  0:06         ` Karsten Blees
2014-02-12 18:15     ` Stefan Zager
2014-02-12  2:11 ` Duy Nguyen
2014-02-12 18:12   ` Stefan Zager
2014-02-12 18:33     ` Matthieu Moy
2014-02-12 18:39       ` Stefan Zager
2014-02-12 18:50     ` David Kastrup
2014-02-12 19:02       ` Stefan Zager
2014-02-12 19:15         ` David Kastrup
2014-02-12 23:09           ` Mike Hommey
2014-02-13  6:04             ` David Kastrup
2014-02-13  9:34               ` Mike Hommey
2014-02-13  9:48                 ` Mike Hommey
2014-02-13  8:30           ` David Kastrup
2014-02-12 20:06     ` Junio C Hamano
2014-02-12 20:27       ` Stefan Zager
2014-02-12 23:05         ` Junio C Hamano
2014-02-12 11:59 ` Erik Faye-Lund
2014-02-12 18:20   ` Stefan Zager
2014-02-12 18:27     ` Erik Faye-Lund
2014-02-12 18:34       ` Stefan Zager
2014-02-12 18:37         ` Erik Faye-Lund
2014-02-12 19:22           ` Karsten Blees
2014-02-12 19:30             ` Stefan Zager
2014-02-13  8:27               ` Johannes Sixt
2014-02-13  8:38                 ` David Kastrup
2014-02-13 18:40                 ` Stefan Zager
2014-02-13 18:38             ` Zachary Turner
2014-02-13 22:51               ` Karsten Blees
2014-02-13 22:53                 ` Stefan Zager
2014-02-13 23:09                   ` Zachary Turner
2014-02-14 19:04                     ` Karsten Blees
     [not found]                       ` <CAAErz9g7ND1htfk=yxRJJLbSEgBi4EV_AHC9uDRptugGWFWcXw@mail.gmail.com>
2014-02-14 19:16                         ` Zachary Turner
2014-02-14 23:10                           ` Karsten Blees
2014-02-15  0:45                           ` Duy Nguyen
2014-02-15  0:50                             ` Stefan Zager
2014-02-15  0:56                               ` Duy Nguyen
2014-02-15  1:15                                 ` Zachary Turner
2014-02-15  1:39                                   ` Duy Nguyen
2014-02-18 17:55                                     ` Junio C Hamano
2014-02-18 18:14                                       ` Zachary Turner
2014-02-14 19:52                         ` Stefan Zager
2014-02-14 21:49                       ` Stefan Zager
2014-02-13  1:42 ` brian m. carlson
2019-04-02  0:52 ` Matheus Tavares
2019-04-02  1:07   ` Duy Nguyen
2019-04-02 10:30     ` David Kastrup
2019-04-02 11:35       ` Duy Nguyen
2019-04-02 11:52         ` David Kastrup
2019-04-02 19:06     ` Matheus Tavares Bernardino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.