All of lore.kernel.org
 help / color / mirror / Atom feed
* Organizing (large) test data in git
@ 2007-02-27 17:58 Bill Lear
  2007-02-27 19:52 ` Johannes Schindelin
  0 siblings, 1 reply; 9+ messages in thread
From: Bill Lear @ 2007-02-27 17:58 UTC (permalink / raw)
  To: git

In my company we generate test data that we want coupled with test
code, and despite the size, we have historically kept our test data
with our code base.

This is becoming a problem.

95% of the size of our 500 meg "code" base is actually test data, and
the size of the test data is likely to increase, perhaps radically.
We are contemplating files on the order of 500 megabytes a piece.

Many of our developers have multiple copies of our code base checked
out, duplicating the test data, so we would like to come up with a
solution to this that minimizes the amount of data we have to check
out.

Personally, I dislike having separate test data and code repos.
Keeping the two synchronized seems like a real pain.  I like to be
able to do things like:

cd component_x
[muck muck muck on part "y"]
mkdir testsuite/component_x.part_y
cd testsuite/component_x.part_y
[muck muck muck]
git commit -a -m "Finished mucking with part y of component x"

Where the directory structure is, essentially:

      component_x/
          testsuite/component_x.part_y

If we separate out the test data, for the above I would have to do
two commits in two repos, switching directories, etc.  And then, there
is the issue of ensuring that checkouts of code also get the associated
data needed.  I can see this being a potential nightmare.

Have others on the list grappled with this and come up with good
solutions with git?

I know there was some talk of sub-modules, but not sure if that is
working or even a viable option here.


Bill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 17:58 Organizing (large) test data in git Bill Lear
@ 2007-02-27 19:52 ` Johannes Schindelin
  2007-02-27 20:00   ` Bill Lear
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Schindelin @ 2007-02-27 19:52 UTC (permalink / raw)
  To: Bill Lear; +Cc: git

Hi,

On Tue, 27 Feb 2007, Bill Lear wrote:

> We are contemplating files on the order of 500 megabytes a piece.

I recommend splitting the files so that no file is that large (but the sum 
of them can be). But I think that you really wanted to say that.

I think the problem of large packs is tackled right now by Troy, Shawn and 
Nico. Troy had exactly the same problem AFAIU, and Nico and Shawn are 
working on a new pack file format, which would lift the 4GB limit on packs 
while at it.

This should solve your problems.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 19:52 ` Johannes Schindelin
@ 2007-02-27 20:00   ` Bill Lear
  2007-02-27 20:14     ` Johannes Schindelin
  0 siblings, 1 reply; 9+ messages in thread
From: Bill Lear @ 2007-02-27 20:00 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Tuesday, February 27, 2007 at 20:52:38 (+0100) Johannes Schindelin writes:
>Hi,
>
>On Tue, 27 Feb 2007, Bill Lear wrote:
>
>> We are contemplating files on the order of 500 megabytes a piece.
>
>I recommend splitting the files so that no file is that large (but the sum 
>of them can be). But I think that you really wanted to say that.
>
>I think the problem of large packs is tackled right now by Troy, Shawn and 
>Nico. Troy had exactly the same problem AFAIU, and Nico and Shawn are 
>working on a new pack file format, which would lift the 4GB limit on packs 
>while at it.
>
>This should solve your problems.

Welll... it's not really a matter of capacity, though I do agree that
lifting that limit will help.  We are more concerned with time to
clone the repos over the (often very slow) corporate network, for
example.  With future ratios of about 1% code to 99% test data, we
really would like to have a light-weight code repo that we can throw
hither and yon with little care, and a monster data repo that is
(somehow) sanely managed with git as well.  I was just curious if
others had run into the management problems that I mentioned with
separating test data from code and what they may have done to surmount
them.


Bill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 20:00   ` Bill Lear
@ 2007-02-27 20:14     ` Johannes Schindelin
  2007-02-27 20:18       ` Bill Lear
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Schindelin @ 2007-02-27 20:14 UTC (permalink / raw)
  To: Bill Lear; +Cc: git

Hi,

On Tue, 27 Feb 2007, Bill Lear wrote:

> On Tuesday, February 27, 2007 at 20:52:38 (+0100) Johannes Schindelin writes:
> >Hi,
> >
> >On Tue, 27 Feb 2007, Bill Lear wrote:
> >
> >> We are contemplating files on the order of 500 megabytes a piece.
> >
> >I recommend splitting the files so that no file is that large (but the sum 
> >of them can be). But I think that you really wanted to say that.
> >
> >I think the problem of large packs is tackled right now by Troy, Shawn and 
> >Nico. Troy had exactly the same problem AFAIU, and Nico and Shawn are 
> >working on a new pack file format, which would lift the 4GB limit on packs 
> >while at it.
> >
> >This should solve your problems.
> 
> Welll... it's not really a matter of capacity, though I do agree that
> lifting that limit will help.  We are more concerned with time to
> clone the repos over the (often very slow) corporate network, for
> example.  With future ratios of about 1% code to 99% test data, we
> hither and yon with little care, and a monster data repo that is
> (somehow) sanely managed with git as well.  I was just curious if
> others had run into the management problems that I mentioned with
> separating test data from code and what they may have done to surmount
> them.

Okay I misunderstood, then.

Do shallow clones help you?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 20:14     ` Johannes Schindelin
@ 2007-02-27 20:18       ` Bill Lear
  2007-02-27 20:22         ` Johannes Schindelin
  0 siblings, 1 reply; 9+ messages in thread
From: Bill Lear @ 2007-02-27 20:18 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Tuesday, February 27, 2007 at 21:14:07 (+0100) Johannes Schindelin writes:
>> 
>> Welll... it's not really a matter of capacity, though I do agree that
>> lifting that limit will help.  We are more concerned with time to
>> clone the repos over the (often very slow) corporate network, for
>> example.  With future ratios of about 1% code to 99% test data, we
>> hither and yon with little care, and a monster data repo that is
>> (somehow) sanely managed with git as well.  I was just curious if
>> others had run into the management problems that I mentioned with
>> separating test data from code and what they may have done to surmount
>> them.
>
>Okay I misunderstood, then.
>
>Do shallow clones help you?

Hmm, can't answer now, since I don't know what these are.  I shall
investigate and see if they do.


Bill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 20:18       ` Bill Lear
@ 2007-02-27 20:22         ` Johannes Schindelin
  2007-02-27 20:41           ` Bill Lear
  2007-02-27 20:49           ` Junio C Hamano
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Schindelin @ 2007-02-27 20:22 UTC (permalink / raw)
  To: Bill Lear; +Cc: git

Hi,

On Tue, 27 Feb 2007, Bill Lear wrote:

> On Tuesday, February 27, 2007 at 21:14:07 (+0100) Johannes Schindelin writes:
> >> 
> >> Welll... it's not really a matter of capacity, though I do agree that
> >> lifting that limit will help.  We are more concerned with time to
> >> clone the repos over the (often very slow) corporate network, for
> >> example.  With future ratios of about 1% code to 99% test data, we
> >> hither and yon with little care, and a monster data repo that is
> >> (somehow) sanely managed with git as well.  I was just curious if
> >> others had run into the management problems that I mentioned with
> >> separating test data from code and what they may have done to surmount
> >> them.
> >
> >Okay I misunderstood, then.
> >
> >Do shallow clones help you?
> 
> Hmm, can't answer now, since I don't know what these are.  I shall
> investigate and see if they do.

Basically, shallow clones cut off branches at some point, even if those 
commits have references to their parents.

For example, if you have a linear branch HEAD~1000..HEAD, and you want to 
get just the latest two commits, a shallow clone will give you just 
HEAD~1..HEAD, pretending that HEAD~1 is a root commit (a commit without 
parents), when it is not.

But if you do not care for history, instead just for being up-to-date, it 
should help you.

Ciao,
Dscho

P.S.: And no, I don't remember if you need to say "--depth 2" or "--depth 
1" for that... Sorry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 20:22         ` Johannes Schindelin
@ 2007-02-27 20:41           ` Bill Lear
  2007-02-27 20:51             ` Johannes Schindelin
  2007-02-27 20:49           ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Bill Lear @ 2007-02-27 20:41 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Tuesday, February 27, 2007 at 21:22:31 (+0100) Johannes Schindelin writes:
>...
>Basically, shallow clones cut off branches at some point, even if those 
>commits have references to their parents.

Ah, so a sort of temporal surgery.

I don't think this will help, and I don't think this is a unique
git issue, either.  It happens with any system, I would think.

Let's say I have 6 code repos on my system and one data repo.  If I
make changes in one of my code repos that requires a test data
change, I have to move to my test data repo, make the change
there, and commit there.  Then, back in my code repo, I commit also.

Now, instead of one tidy package (a commit) that holds code and test
together in a coherent package, I have two separate commits in two
repos that now have to be coordinated.  Imagine I do more changes in
similar fashion, and others do as well.  Now our lead of the QA
department is pulling his hair out, trying to figure out which commits
in the data directory match those in the code directory so he can do
regressions properly.

As I said, I don't think this is a git-specific issue, but more one
of organizational techniques.  Perhaps there is no good answer...


Bill

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 20:22         ` Johannes Schindelin
  2007-02-27 20:41           ` Bill Lear
@ 2007-02-27 20:49           ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2007-02-27 20:49 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Bill Lear, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> But if you do not care for history, instead just for being up-to-date, it 
> should help you.

I had a distinct impression that Bill was talking about a narrow
checkout, not a shallow clone, but it probably is just me.

> P.S.: And no, I don't remember if you need to say "--depth 2" or "--depth 
> 1" for that... Sorry

Depth 1 AFAICR meant "the tip plus one parent".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Organizing (large) test data in git
  2007-02-27 20:41           ` Bill Lear
@ 2007-02-27 20:51             ` Johannes Schindelin
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Schindelin @ 2007-02-27 20:51 UTC (permalink / raw)
  To: Bill Lear; +Cc: git

Hi,

On Tue, 27 Feb 2007, Bill Lear wrote:

> On Tuesday, February 27, 2007 at 21:22:31 (+0100) Johannes Schindelin writes:
>
> > Basically, shallow clones cut off branches at some point, even if 
> > those commits have references to their parents.
> 
> Ah, so a sort of temporal surgery.
> 
> I don't think this will help, and I don't think this is a unique git 
> issue, either.  It happens with any system, I would think.
> 
> Let's say I have 6 code repos on my system and one data repo.  If I make 
> changes in one of my code repos that requires a test data change, I have 
> to move to my test data repo, make the change there, and commit there.  
> Then, back in my code repo, I commit also.
> 
> Now, instead of one tidy package (a commit) that holds code and test 
> together in a coherent package, I have two separate commits in two repos 
> that now have to be coordinated.  Imagine I do more changes in similar 
> fashion, and others do as well.  Now our lead of the QA department is 
> pulling his hair out, trying to figure out which commits in the data 
> directory match those in the code directory so he can do regressions 
> properly.

So, it is _not_ a transport question.

Why not reference the commit name (the SHA1) of the commit changing 
behaviour in the commit of the test repo? Like: if abcdef0123 changes a 
certain output format, and this is expected, fix the test, and include a 
line "Reference-Commit: abcdef0123" in the test repo.

This could even be done automatically by a simple script to commit 
changes after fixing an issue both in the source _and_ test repo...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-02-27 20:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-27 17:58 Organizing (large) test data in git Bill Lear
2007-02-27 19:52 ` Johannes Schindelin
2007-02-27 20:00   ` Bill Lear
2007-02-27 20:14     ` Johannes Schindelin
2007-02-27 20:18       ` Bill Lear
2007-02-27 20:22         ` Johannes Schindelin
2007-02-27 20:41           ` Bill Lear
2007-02-27 20:51             ` Johannes Schindelin
2007-02-27 20:49           ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.