git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Should I store large text files on Git LFS?
@ 2017-07-24  2:01 Farshid Zavareh
  2017-07-24  2:29 ` Andrew Ardill
  0 siblings, 1 reply; 14+ messages in thread
From: Farshid Zavareh @ 2017-07-24  2:01 UTC (permalink / raw)
  To: git

Hey all. 

I'v been handed over a project that uses Git LFS for storing large CSV files.

My understanding is that the main benefit of using Git LFS is to keep the repository small for binary files, where Git can't keep track of the changes and ends up storing whole files for each revision. For a text file, that problem does not exist to begin with and Git can store only the changes. At the same time, this is going to make checkouts unnecessarily slow, not to mention the financial cost of storing the whole file for each revision.

Is there something I'm missing here?

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24  2:01 Should I store large text files on Git LFS? Farshid Zavareh
@ 2017-07-24  2:29 ` Andrew Ardill
  2017-07-24  3:46   ` Farshid Zavareh
       [not found]   ` <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Ardill @ 2017-07-24  2:29 UTC (permalink / raw)
  To: Farshid Zavareh; +Cc: git

Hi Farshid,

On 24 July 2017 at 12:01, Farshid Zavareh <fhzavareh@gmail.com> wrote:
> I'v been handed over a project that uses Git LFS for storing large CSV files.
>
> My understanding is that the main benefit of using Git LFS is to keep the repository small for binary files, where Git can't keep track of the changes and ends up storing whole files for each revision. For a text file, that problem does not exist to begin with and Git can store only the changes. At the same time, this is going to make checkouts unnecessarily slow, not to mention the financial cost of storing the whole file for each revision.
>
> Is there something I'm missing here?

Git LFS gives benefits when working on *large* files, not just large
*binary* files.

I can imagine a few reasons for using LFS for some CSV files
(especially the kinds of files I deal with sometimes!).

The main one is that many users don't need or want to download the
large files, or all versions of the large file. Moreover, you probably
don't care about changes between those files, or there would be so
many that using the git machinery for comparing them would be
cumbersome and ineffective.

For me, if I was storing any CSV file over a couple of hundred
megabyte I would consider using something like LFS. An example would
be a large Dunn & Bradstreet data file, which I do an analysis on
every quarter. I want to include the file in the repository, so that
the analysis can be replicated later on, but I don't want to add 4GB
of data to the repo every single time the dataset gets updated (also
every quarter). Storing that in LFS would be a good solution then.

Regards,

Andrew Ardill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24  2:29 ` Andrew Ardill
@ 2017-07-24  3:46   ` Farshid Zavareh
  2017-07-24  4:13     ` David Lang
       [not found]   ` <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>
  1 sibling, 1 reply; 14+ messages in thread
From: Farshid Zavareh @ 2017-07-24  3:46 UTC (permalink / raw)
  To: Andrew Ardill; +Cc: git

Hi Andrew.

Thanks for your reply.

I'll probably test this myself, but would modifying and committing a 4GB text file actually add 4GB to the repository's size? I anticipate that it won't, since Git keeps track of the changes only, instead of storing a copy of the whole file (whereas this is not the case with binary files, hence the need for LFS).

Kind regards,
Farshid

> On 24 Jul 2017, at 12:29 pm, Andrew Ardill <andrew.ardill@gmail.com> wrote:
> 
> Hi Farshid,
> 
> On 24 July 2017 at 12:01, Farshid Zavareh <fhzavareh@gmail.com> wrote:
>> I'v been handed over a project that uses Git LFS for storing large CSV files.
>> 
>> My understanding is that the main benefit of using Git LFS is to keep the repository small for binary files, where Git can't keep track of the changes and ends up storing whole files for each revision. For a text file, that problem does not exist to begin with and Git can store only the changes. At the same time, this is going to make checkouts unnecessarily slow, not to mention the financial cost of storing the whole file for each revision.
>> 
>> Is there something I'm missing here?
> 
> Git LFS gives benefits when working on *large* files, not just large
> *binary* files.
> 
> I can imagine a few reasons for using LFS for some CSV files
> (especially the kinds of files I deal with sometimes!).
> 
> The main one is that many users don't need or want to download the
> large files, or all versions of the large file. Moreover, you probably
> don't care about changes between those files, or there would be so
> many that using the git machinery for comparing them would be
> cumbersome and ineffective.
> 
> For me, if I was storing any CSV file over a couple of hundred
> megabyte I would consider using something like LFS. An example would
> be a large Dunn & Bradstreet data file, which I do an analysis on
> every quarter. I want to include the file in the repository, so that
> the analysis can be replicated later on, but I don't want to add 4GB
> of data to the repo every single time the dataset gets updated (also
> every quarter). Storing that in LFS would be a good solution then.
> 
> Regards,
> 
> Andrew Ardill


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24  3:46   ` Farshid Zavareh
@ 2017-07-24  4:13     ` David Lang
  2017-07-24  4:18       ` Farshid Zavareh
       [not found]       ` <CANENsPpdQzBqStGjq4jUsAB0-7U8_SQq+=kjmJe6pJtiXxnYFg@mail.gmail.com>
  0 siblings, 2 replies; 14+ messages in thread
From: David Lang @ 2017-07-24  4:13 UTC (permalink / raw)
  To: Farshid Zavareh; +Cc: Andrew Ardill, git

On Mon, 24 Jul 2017, Farshid Zavareh wrote:

> I'll probably test this myself, but would modifying and committing a 4GB text 
> file actually add 4GB to the repository's size? I anticipate that it won't, 
> since Git keeps track of the changes only, instead of storing a copy of the 
> whole file (whereas this is not the case with binary files, hence the need for 
> LFS).

well, it wouldn't be 4G because text compresses well, but if the file changes 
drastically from version to version (say a quarterly report), the diff won't 
help.

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24  4:13     ` David Lang
@ 2017-07-24  4:18       ` Farshid Zavareh
       [not found]       ` <CANENsPpdQzBqStGjq4jUsAB0-7U8_SQq+=kjmJe6pJtiXxnYFg@mail.gmail.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Farshid Zavareh @ 2017-07-24  4:18 UTC (permalink / raw)
  To: David Lang; +Cc: Andrew Ardill, git

I see your point. So I guess it really comes down to how the file is anticipated to change. If only one or two line are going to change every now and then, then LFS is not really necessary. But, as you mentioned, text files that change drastically will affect the repository in the same way that binaries do.


> On 24 Jul 2017, at 2:13 pm, David Lang <david@lang.hm> wrote:
> 
> On Mon, 24 Jul 2017, Farshid Zavareh wrote:
> 
>> I'll probably test this myself, but would modifying and committing a 4GB text file actually add 4GB to the repository's size? I anticipate that it won't, since Git keeps track of the changes only, instead of storing a copy of the whole file (whereas this is not the case with binary files, hence the need for LFS).
> 
> well, it wouldn't be 4G because text compresses well, but if the file changes drastically from version to version (say a quarterly report), the diff won't help.
> 
> David Lang


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
       [not found]       ` <CANENsPpdQzBqStGjq4jUsAB0-7U8_SQq+=kjmJe6pJtiXxnYFg@mail.gmail.com>
@ 2017-07-24  4:19         ` David Lang
  0 siblings, 0 replies; 14+ messages in thread
From: David Lang @ 2017-07-24  4:19 UTC (permalink / raw)
  To: Farshid Zavareh; +Cc: Andrew Ardill, git

On Mon, 24 Jul 2017, Farshid Zavareh wrote:

> I see your point. So I guess it really comes down to how the file is
> anticipated to change. If only one or two line are going to change every
> now and then, then LFS is not really necessary. But, as you mentioned, text
> files that change drastically will affect the repository in the same way
> that binaries do.

Not quite the same way that binaries do, because text files compress well. but 
close.

David Lang

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
       [not found]   ` <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>
@ 2017-07-24  4:58     ` Andrew Ardill
  2017-07-24 18:11       ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Ardill @ 2017-07-24  4:58 UTC (permalink / raw)
  To: Farshid Zavareh; +Cc: git

Hi Farshid,

On 24 July 2017 at 13:45, Farshid Zavareh <fhzavareh@gmail.com> wrote:
> I'll probably test this myself, but would modifying and committing a 4GB
> text file actually add 4GB to the repository's size? I anticipate that it
> won't, since Git keeps track of the changes only, instead of storing a copy
> of the whole file (whereas this is not the case with binary files, hence the
> need for LFS).

I decided to do a little test myself. I add three versions of the same
data set (sometimes slightly different cuts of the parent data set,
which I don't have) each between 2 and 4GB in size.
Each time I added a new version it added ~500MB to the repository, and
operations on the repository took 35-45 seconds to complete.
Running `git gc` compressed the objects fairly well, saving ~400MB of
space. I would imagine that even more space would be saved
(proportionally) if there were a lot more similar files in the repo.
The time to checkout different commits didn't change much, I presume
that most of the time is spent copying the large file into the working
directory, but I didn't test that. I did test adding some other small
files, and sometimes it was slow (when cold I think?) and other times
fast.

Overall, I think as long as the files change rarely, and the
repository remains responsive, having these large files in the
repository is ok. They're still big, and if most people will never use
them it will be annoying for people to clone and checkout updated
versions of the files. If you have a lot of the files, or they update
often, or most people don't need all the files, using something like
LFS will help a lot.

$ git version  # running on my windows machine at work
git version 2.6.3.windows.1

$ git init git-csv-test && cd git-csv-test
$ du -h --max-depth=2  # including here to compare after large data
files are added
35K     ./.git/hooks
1.0K    ./.git/info
0       ./.git/objects
0       ./.git/refs
43K     ./.git
43K     .

$ git add data.txt  # first version of the data file, 3.2 GB
$ git commit
$ du -h --max-depth=2  # the data gets compressed down to ~580M of
objects in the git store
35K     ./.git/hooks
1.0K    ./.git/info
2.0K    ./.git/logs
580M    ./.git/objects
1.0K    ./.git/refs
581M    ./.git
3.7G    .


$ git add data.txt  # second version of the data file, 3.6 GB
$ git commit
$ du -h --max-depth=1  # an extra ~520M of objects added
1.2G    ./.git
4.7G    .


$ time git add data.txt  # 42.344s - second version of the data file, 2.2 GB
$ git commit  # takes about 30 seconds to load editor
$ du -h --max-depth=1
1.7G    ./.git
3.9G    .

$ time git checkout HEAD^  # 36.509s
$ time git checkout HEAD^  # 44.658s
$ time git checkout master  # 38.267s

$ git gc
$ du -h --max-depth=1
1.3G    ./.git
3.4G    .

$ time git checkout HEAD^  # 34.743s
$ time git checkout HEAD^  # 41.226s

Regards,

Andrew Ardill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24  4:58     ` Andrew Ardill
@ 2017-07-24 18:11       ` Jeff King
  2017-07-24 19:41         ` Junio C Hamano
  2017-07-25  8:06         ` Andrew Ardill
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff King @ 2017-07-24 18:11 UTC (permalink / raw)
  To: Andrew Ardill; +Cc: Farshid Zavareh, git

On Mon, Jul 24, 2017 at 02:58:38PM +1000, Andrew Ardill wrote:

> On 24 July 2017 at 13:45, Farshid Zavareh <fhzavareh@gmail.com> wrote:
> > I'll probably test this myself, but would modifying and committing a 4GB
> > text file actually add 4GB to the repository's size? I anticipate that it
> > won't, since Git keeps track of the changes only, instead of storing a copy
> > of the whole file (whereas this is not the case with binary files, hence the
> > need for LFS).
> 
> I decided to do a little test myself. I add three versions of the same
> data set (sometimes slightly different cuts of the parent data set,
> which I don't have) each between 2 and 4GB in size.
> Each time I added a new version it added ~500MB to the repository, and
> operations on the repository took 35-45 seconds to complete.
> Running `git gc` compressed the objects fairly well, saving ~400MB of
> space. I would imagine that even more space would be saved
> (proportionally) if there were a lot more similar files in the repo.

Did you tweak core.bigfilethreshold? Git won't actually try to find
deltas on files larger than that (500MB by default). So you might be
seeing just the effects of zlib compression, and not deltas.

You can always check the delta status after a gc by running:

  git rev-list --objects --all |
  git cat-file --batch-check='%(objectsize:disk) %(objectsize) %(deltabase) %(rest)'

That should give you a sense of how much you're saving due to zlib (by
comparing the first two numbers for a copy that isn't a delta; i.e.,
with an all-zeros delta base) and how much due to deltas (how much
smaller the first number is for an entry that _is_ a delta).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24 18:11       ` Jeff King
@ 2017-07-24 19:41         ` Junio C Hamano
  2017-07-25  8:06         ` Andrew Ardill
  1 sibling, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2017-07-24 19:41 UTC (permalink / raw)
  To: Jeff King; +Cc: Andrew Ardill, Farshid Zavareh, git

Jeff King <peff@peff.net> writes:

> On Mon, Jul 24, 2017 at 02:58:38PM +1000, Andrew Ardill wrote:
>
>> On 24 July 2017 at 13:45, Farshid Zavareh <fhzavareh@gmail.com> wrote:
>> > I'll probably test this myself, but would modifying and committing a 4GB
>> > text file actually add 4GB to the repository's size? I anticipate that it
>> > won't, since Git keeps track of the changes only, instead of storing a copy
>> > of the whole file (whereas this is not the case with binary files, hence the
>> > need for LFS).
>> 
>> I decided to do a little test myself. I add three versions of the same
>> data set (sometimes slightly different cuts of the parent data set,
>> which I don't have) each between 2 and 4GB in size.
>> Each time I added a new version it added ~500MB to the repository, and
>> operations on the repository took 35-45 seconds to complete.
>> Running `git gc` compressed the objects fairly well, saving ~400MB of
>> space. I would imagine that even more space would be saved
>> (proportionally) if there were a lot more similar files in the repo.
>
> Did you tweak core.bigfilethreshold? Git won't actually try to find
> deltas on files larger than that (500MB by default). So you might be
> seeing just the effects of zlib compression, and not deltas.
>
> You can always check the delta status after a gc by running:
>
>   git rev-list --objects --all |
>   git cat-file --batch-check='%(objectsize:disk) %(objectsize) %(deltabase) %(rest)'
>
> That should give you a sense of how much you're saving due to zlib (by
> comparing the first two numbers for a copy that isn't a delta; i.e.,
> with an all-zeros delta base) and how much due to deltas (how much
> smaller the first number is for an entry that _is_ a delta).

In addition to that, people need to take into account that "binary
vs text" is a secondary criteria when considering how effective our
deltifying algorithm works on their data.

We use the same xdelta algorithm that is oblivious to line breaks,
so given two pairs of input files (T1, T2) and (B1, B2), where T1
and B1 are comparative sizes and T2 and B2 are comparative sizes,
and the change made to T1 to produce T2 (e.g. copy byte range X-Y of
T1 to byte ranage starting from offset O of T2, insert this literal
byte string of length L, etc.) and the change made to B1 to produce
B2 are of comparative sizes (i.e. X-Ys and Os are similar), when T's
are text and B's are binary, you should get similarly sized delta to
represent T2 as a delta to T1 and B2 as a delta to B1.  

The reason why typical "binary" file does not delta well is not
inherent to their "binary"-ness but lies elsewhere.  It is because
tools that produce "binary" files tend not to care too much about
preserving original and only effect changes to limited part.  That
is what makes their data not delta well across versions.

Exceptions are editing exif data without changing the actual image
bits in jpeg files or editing id3 data without changing the actual
sound bits in mp3 files.  Binary files across these kind of
operations delta very well with Git, as "edit" is not done by
completely rewriting everything but is confined in a small area of
the file.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-24 18:11       ` Jeff King
  2017-07-24 19:41         ` Junio C Hamano
@ 2017-07-25  8:06         ` Andrew Ardill
  2017-07-25 19:13           ` Jeff King
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Ardill @ 2017-07-25  8:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Farshid Zavareh, git

On 25 July 2017 at 04:11, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 24, 2017 at 02:58:38PM +1000, Andrew Ardill wrote:
>
>> On 24 July 2017 at 13:45, Farshid Zavareh <fhzavareh@gmail.com> wrote:
>> > I'll probably test this myself, but would modifying and committing a 4GB
>> > text file actually add 4GB to the repository's size? I anticipate that it
>> > won't, since Git keeps track of the changes only, instead of storing a copy
>> > of the whole file (whereas this is not the case with binary files, hence the
>> > need for LFS).
>>
>> I decided to do a little test myself. I add three versions of the same
>> data set (sometimes slightly different cuts of the parent data set,
>> which I don't have) each between 2 and 4GB in size.
>> Each time I added a new version it added ~500MB to the repository, and
>> operations on the repository took 35-45 seconds to complete.
>> Running `git gc` compressed the objects fairly well, saving ~400MB of
>> space. I would imagine that even more space would be saved
>> (proportionally) if there were a lot more similar files in the repo.
>
> Did you tweak core.bigfilethreshold? Git won't actually try to find
> deltas on files larger than that (500MB by default). So you might be
> seeing just the effects of zlib compression, and not deltas.

I tweaked nothing!

The space saving I assumed was pretty much just zlib compression, and
I wasn't sure how much delta we actually could get, and how long that
might take to run.

> You can always check the delta status after a gc by running:
>
>   git rev-list --objects --all |
>   git cat-file --batch-check='%(objectsize:disk) %(objectsize) %(deltabase) %(rest)'
>
> That should give you a sense of how much you're saving due to zlib (by
> comparing the first two numbers for a copy that isn't a delta; i.e.,
> with an all-zeros delta base) and how much due to deltas (how much
> smaller the first number is for an entry that _is_ a delta).

Let's have a look:

$ git rev-list --objects --all |
  git cat-file --batch-check='%(objectsize:disk) %(objectsize)
%(deltabase) %(rest)'
174 262 0000000000000000000000000000000000000000
171 260 0000000000000000000000000000000000000000
139 212 0000000000000000000000000000000000000000
47 36 0000000000000000000000000000000000000000
377503831 2310238304 0000000000000000000000000000000000000000 data.txt
47 36 0000000000000000000000000000000000000000
500182546 3740427683 0000000000000000000000000000000000000000 data.txt
47 36 0000000000000000000000000000000000000000
447340264 3357717475 0000000000000000000000000000000000000000 data.txt

Yep, all zlib.

What do you think is a reasonable config for storing text files this
large, to get good delta compression, or is it more of a trial and
error to find out what works best?

Regards,

Andrew Ardill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-25  8:06         ` Andrew Ardill
@ 2017-07-25 19:13           ` Jeff King
  2017-07-25 20:52             ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2017-07-25 19:13 UTC (permalink / raw)
  To: Andrew Ardill; +Cc: Farshid Zavareh, git

On Tue, Jul 25, 2017 at 06:06:49PM +1000, Andrew Ardill wrote:

> Let's have a look:
> 
> $ git rev-list --objects --all |
>   git cat-file --batch-check='%(objectsize:disk) %(objectsize)
> %(deltabase) %(rest)'
> 174 262 0000000000000000000000000000000000000000
> 171 260 0000000000000000000000000000000000000000
> 139 212 0000000000000000000000000000000000000000
> 47 36 0000000000000000000000000000000000000000
> 377503831 2310238304 0000000000000000000000000000000000000000 data.txt
> 47 36 0000000000000000000000000000000000000000
> 500182546 3740427683 0000000000000000000000000000000000000000 data.txt
> 47 36 0000000000000000000000000000000000000000
> 447340264 3357717475 0000000000000000000000000000000000000000 data.txt
> 
> Yep, all zlib.

OK, that makes sense.

> What do you think is a reasonable config for storing text files this
> large, to get good delta compression, or is it more of a trial and
> error to find out what works best?

I think it would really depend on what's in your repo. If you just have
gigantic text files and no big binaries, and you have enough RAM to do
diffs on the text files, it's not unreasonable to just send
core.bigfilethreshold to something really big and not worry about it.

In general, a diff is going to want memory at least 2x the size of the
file (for the old and new images). And we tend to keep in memory all of
the images for a single tree-diff at one time (so if you touched two
gigantic files in one commit, then "git log -p" is probably going to
peak at having all four before/after images in memory at once).

If you just want deltas but not diffs, you can probably do:

  echo '*.gigantic -diff' >.gitattributes
  git config core.bigfilethreshold 10G

I think that will turn off streaming of the blobs in some code paths,
too. But hopefully a _single_ copy of each file would be OK to hold in
RAM. If it's not, you might also be able to get away with packing once
with:

  git -c core.bigfilethreshold=10G repack -adf

and then further repacks will carry those deltas forward. I think we
only apply the limit when actively searching for new deltas, not when
reusing existing ones.

As you can see, core.bigfilethreshold is a pretty blunt instrument. It
might be nice if .gitattributes understood other types of patterns
besides filenames, so you could do something like:

  echo '[size > 500MB] delta -diff' >.gitattributes

or something like that. I don't think it's come up enough for anybody to
care too much about it or work on it.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-25 19:13           ` Jeff King
@ 2017-07-25 20:52             ` Junio C Hamano
  2017-07-25 21:13               ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2017-07-25 20:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Andrew Ardill, Farshid Zavareh, git

Jeff King <peff@peff.net> writes:

> As you can see, core.bigfilethreshold is a pretty blunt instrument. It
> might be nice if .gitattributes understood other types of patterns
> besides filenames, so you could do something like:
>
>   echo '[size > 500MB] delta -diff' >.gitattributes
>
> or something like that. I don't think it's come up enough for anybody to
> care too much about it or work on it.

But attributes is about paths, at which a blob may or may not exist,
so it is a bad fit to add conditionals that are based on sizes and
types.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-25 20:52             ` Junio C Hamano
@ 2017-07-25 21:13               ` Jeff King
  2017-07-25 21:38                 ` Stefan Beller
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2017-07-25 21:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andrew Ardill, Farshid Zavareh, git

On Tue, Jul 25, 2017 at 01:52:46PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > As you can see, core.bigfilethreshold is a pretty blunt instrument. It
> > might be nice if .gitattributes understood other types of patterns
> > besides filenames, so you could do something like:
> >
> >   echo '[size > 500MB] delta -diff' >.gitattributes
> >
> > or something like that. I don't think it's come up enough for anybody to
> > care too much about it or work on it.
> 
> But attributes is about paths, at which a blob may or may not exist,
> so it is a bad fit to add conditionals that are based on sizes and
> types.

Do attributes _have_ to be about paths? In practice we often use them to
describe objects, and paths are just the only mechanism we give to refer
to objects.  But it is not actually a correct or rigorous mechanism in
some cases.  For example, imagine I have a .gitattributes with:

  foo -delta
  bar delta

and then imagine I have a tree with both "foo" and "bar" pointing to the
same blob. When I run pack-objects, it wants to know whether to delta
the object. What should it do?

The delta decision is really a property of the object. But the only
mechanism we give for selecting an object is by path, which we know is
not a one-to-one mapping with objects. So the results you get will
depend on which name we happened to see the object under first while
traversing.

I think the case you are getting at is something like clean filters,
where we might not have an object at all. In that case I would argue
that a property of an object could never be satisfied (so neither
"size > 500" nor "size <= 500" could match). Whether object properties
are meaningful is in the eye of the code that is looking up the value.
Or more generally, the set of properties to be matched is in the eye of
the caller. So looking up a clean filter might want to define the size
property based no the working tree size.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Should I store large text files on Git LFS?
  2017-07-25 21:13               ` Jeff King
@ 2017-07-25 21:38                 ` Stefan Beller
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Beller @ 2017-07-25 21:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Andrew Ardill, Farshid Zavareh, git

On Tue, Jul 25, 2017 at 2:13 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Jul 25, 2017 at 01:52:46PM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>>
>> > As you can see, core.bigfilethreshold is a pretty blunt instrument. It
>> > might be nice if .gitattributes understood other types of patterns
>> > besides filenames, so you could do something like:
>> >
>> >   echo '[size > 500MB] delta -diff' >.gitattributes
>> >
>> > or something like that. I don't think it's come up enough for anybody to
>> > care too much about it or work on it.
>>
>> But attributes is about paths, at which a blob may or may not exist,
>> so it is a bad fit to add conditionals that are based on sizes and
>> types.
>
> Do attributes _have_ to be about paths? In practice we often use them to
> describe objects, and paths are just the only mechanism we give to refer
> to objects.  But it is not actually a correct or rigorous mechanism in
> some cases.  For example, imagine I have a .gitattributes with:
>
>   foo -delta
>   bar delta
>
> and then imagine I have a tree with both "foo" and "bar" pointing to the
> same blob. When I run pack-objects, it wants to know whether to delta
> the object. What should it do?
>
> The delta decision is really a property of the object. But the only
> mechanism we give for selecting an object is by path, which we know is
> not a one-to-one mapping with objects. So the results you get will
> depend on which name we happened to see the object under first while
> traversing.
>
> I think the case you are getting at is something like clean filters,
> where we might not have an object at all. In that case I would argue
> that a property of an object could never be satisfied (so neither
> "size > 500" nor "size <= 500" could match). Whether object properties
> are meaningful is in the eye of the code that is looking up the value.
> Or more generally, the set of properties to be matched is in the eye of
> the caller. So looking up a clean filter might want to define the size
> property based no the working tree size.
>
> -Peff

I recall a similar discussion on the different "big repo" approaches.
Looking at the interface of LFS, there are things such as:

  git lfs fetch --recent
  git lfs fetch --all
  git lfs fetch [--exclude] <pathspec>

so LFS provides both the way to address objects via time or by path,
maybe even combined "I want everything from <pathspec 1> but only
'recent' things from <pathspec 2>".

attributes can already be queried from pathspecs, and I think when
designing from scratch we might put it the other way round:

    delta:
        bar
        everything <500m
    -delta
        foo
        binaries

So in the far future, attributes may learn about more than just
pathspecs that we currently use to assign labels, but could
* include size
* properties derived from the 'file' utility
* be specific about certain objects (historic paths)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-07-25 21:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-24  2:01 Should I store large text files on Git LFS? Farshid Zavareh
2017-07-24  2:29 ` Andrew Ardill
2017-07-24  3:46   ` Farshid Zavareh
2017-07-24  4:13     ` David Lang
2017-07-24  4:18       ` Farshid Zavareh
     [not found]       ` <CANENsPpdQzBqStGjq4jUsAB0-7U8_SQq+=kjmJe6pJtiXxnYFg@mail.gmail.com>
2017-07-24  4:19         ` David Lang
     [not found]   ` <CANENsPr271w=a4YNOYdrp9UM4L_eA1VZMRP_UrH+NZ+2PWM_qg@mail.gmail.com>
2017-07-24  4:58     ` Andrew Ardill
2017-07-24 18:11       ` Jeff King
2017-07-24 19:41         ` Junio C Hamano
2017-07-25  8:06         ` Andrew Ardill
2017-07-25 19:13           ` Jeff King
2017-07-25 20:52             ` Junio C Hamano
2017-07-25 21:13               ` Jeff King
2017-07-25 21:38                 ` Stefan Beller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).