git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
  @ 2023-02-01 15:21 14%                 ` demerphq
  0 siblings, 0 replies; 12+ results
From: demerphq @ 2023-02-01 15:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Michal Suchánek, brian m. carlson, Konstantin Ryabitsev,
	Eli Schwartz, Git List

On Wed, 1 Feb 2023 at 14:49, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Wed, Feb 01 2023, demerphq wrote:
>
> > On Wed, 1 Feb 2023, 20:21 Michal Suchánek, <msuchanek@suse.de> wrote:
> >>
> >> On Wed, Feb 01, 2023 at 12:34:06PM +0100, demerphq wrote:
> >> > Why does it have to be gzip? It is not that hard to come up with a
> >
> >> historical reasons?
> >
> > Currently git doesn't advertise that archive creation is stable
> > right[1]? So I wrote that with the assumption that this new
> > compression would only be used when making a new archive with a
> > hypothetical new '--stable' option. So historical reasons don't come
> > up. Or was there some other form of history that you meant?
>
> We haven't advertised it, but people have come to rely on it, as the
> widespread breakages reported when upgrading to v2.38.0 at the start of
> this thread show.
>
> That's unfortunate, and those people probably shouldn't have done that,
> but that's water under the bridge. I think it would be irresponsible to
> change the output willy-nilly at this point, especially when it seems
> rather easy to find some compromise everyone will be happy with.
>
> > I'm just trying to point out here that stable compression is doable
> > and doesn't need to be as complex as specifying a stable gzip format.
> > I am not even saying git should just do this, just that it /could/ if
> > it decided that stability was important, and that doing so wouldn't
> > involve the complexity that Avar was implying would be needed.  Simple
> > compression like LZ variants are pretty straightforward to implement,
> > achieve pretty good compression and can run pretty fast.
> >
> > Yves
> > [1] if it did the issue kicking off this thread would not have
> > happened as there would be a test that would have noticed the change.
>
> I have some patches I'm about to submit to address issues in this
> thread, and it does add *a* test for archive output stability.
>
> But I'm not at all confident that it's exhaustive. I just found it by
> experiment, by locating tests ouf ours where the "git archive" output at
> the end is different with gzip and "git archive gzip".
>
> But is it guaranteed to find all potential cases where repository
> content might trigger different output with different gzip
> implementations? I don't know, but probably not.

BTW, I just happened to be looking at the zstd docs (I am updating
code that uses it), I saw this:

Zstandard's format is stable and documented in
[RFC8878](https://datatracker.ietf.org/doc/html/rfc8878). Multiple
independent implementations are already available.
This repository represents the reference implementation, provided as
an open-source dual [BSD](LICENSE) and [GPLv2](COPYING) licensed **C**
library,
and a command line utility producing and decoding `.zst`, `.gz`, `.xz`
and `.lz4` files.
Should your project require another programming language,
a list of known ports and bindings is provided on [Zstandard
homepage](http://www.zstd.net/#other-languages).

So it sounds like that is a spec you could use. Not sure exactly what
they mean by "stable", but given the .gz compatibility maybe it would
be worth considering. Its a lot faster than zlib. (The library I
support includes Snappy, Zlib, and Zstd, and the latter is faster and
better than the other two.)

Yves
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[relevance 14%]

* git archive -o something.tar.zst but file info just says "POSIX tar archive"
@ 2021-10-10 11:19 11% Bagas Sanjaya
  0 siblings, 0 replies; 12+ results
From: Bagas Sanjaya @ 2021-10-10 11:19 UTC (permalink / raw)
  To: Git Users

Hi,

I noticed the following (possible bug?) when I tried to create zstd tar 
archive (.tar.zst) with `git archive`.

First, I created the plain tar archive with `git archive`, then extract 
and rearchive to zstd tar achive:

```
(on the repo)

$ git archive -o /tmp/something.tar --prefix=something/ HEAD

(outside the repo, on /tmp)

$ tar xvf something.tar
$ tar --zstd -c -v -f something.tar.zst something/
```

I checked that the archive was indeed zstd tar archive:

```
$ file something.tar.zst
something.tar.zst: Zstandard compressed data (v0.8+), Dictionary ID: None
```

Now I created the same archive with `git archive` directly:

```
(on the repo)

$ git archive -o /tmp/something1.tar.zst --prefix=something/ HEAD
```

But that archive info (with `file`) was something different:

```
(outside the repo, on /tmp)
$ file something1.tar.zst
something1.tar.zst: POSIX tar archive
```

I expected that `something1.tar.zst` was proper zstd tar archive, and 
not plain archive like above.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[relevance 11%]

* Re: Pain points in Git's patch flow
  2021-04-15 15:45  7% ` Son Luong Ngoc
@ 2021-04-19  2:57  0%   ` Eric Wong
  0 siblings, 0 replies; 12+ results
From: Eric Wong @ 2021-04-19  2:57 UTC (permalink / raw)
  To: Son Luong Ngoc
  Cc: Jonathan Nieder, git, Raxel Gutierrez, mricon, patchwork,
	Junio C Hamano, Taylor Blau, Emily Shaffer

Son Luong Ngoc <sluongng@gmail.com> wrote:
> Hi there,
> 
> I'm not a regular contributor but I have started to subscribe to the
> Git's Mailing List recently.  So I thought it might be worth sharing my
> personal view on this.
> 
> After writting all the below, I do realize that I have written quite a
> rant, some of which I think some might consider to be off topic.  For
> that, I do want to appologize before hand.

Thanks for the feedback, some points below.

>  Tue, Apr 13, 2021 at 11:13:26PM -0700, Jonathan Nieder wrote:
> > Hi,
> > 
> ...
> > 
> > Those four are important in my everyday life.  Questions:
> > 
> >  1. What pain points in the patch flow for git.git are important to
> >     you?
> 
> There are several points I want to highlight:
> 
> 1. Issue about reading the Mailing List:
> 
> - Subscribing to Git's Mailing List is not trivial:
>   It takes a lot of time to setup the email subscription.  I remember
>   having to google through a few documents to get my subscription
>   working.
> 
> - And even after having subscribed, I was bombarded with a set
>   of spam emails that was sent to the mailing list address.  These spams
>   range anywhere from absurd to disguising themselves as legitimate
>   users trying to contact you about a new shiny tech product.

Note that subscription is totally optional.

Gmail's mail filters probably aren't very good, perhaps
SpamAssassin or similar filters can be added locally to improve
things for you.

Spam filtering is a complex topic and Google's monopolistic
power probably doesn't inspire them to do better.

> 2. Issue about joining the conversation in the Maling List:
> 
> - Setting up email client to reply to the Mailing List was definitely
>   not trivial.  It's not trivial to send a reply without subscribing to
>   the ML(i.e. using a Header provided from one of the archive).
>   The list does not accept HTML emails, which many clients
>   use as default format.  Getting the formatting to work for line
>   wrapping is also a challenge depends on the client that you use.

The spam (and phishing) problem would be worse if HTML mail were
accepted.  Obfuscation/misdirection techniques used by spammers
and phishers aren't available in plain-text.

It's also more expensive to filter + archive HTML mail due to
decoding and size overheads, which makes it more expensive for
others to mirror/fork things.

> - It's a bit intimidating to ask 'trivial questions' about the patch and
>   create 'noise' in the ML.

I'm sorry you feel that way.  I understand the Internet and its
persistence (especially with mail archives :x) can have a
chilling effect on people.  I think the way to balance things is
to allow/encourage anonymity or pseudonyms, but some folks here
might disagree with me for copyright reasons.  OTOH, don't ask,
don't tell :)

(I am not speaking as a representative of the git project)

> 3. Isssue with archive:
> 
> - I don't find the ML archive trivial for new comers.  It took me a bit
>   of time to realize: 'Oh if I scroll to bottom and find the "Thread 
>   overview" then I can navigate a mailing thread a lot easier'.

(I'm the maintainer of public-inbox, the archival software you
seem to be referring to).

I'm not sure how to make "Thread overview" easier to find
without cluttering the display near the top.  Maybe I'll try
aria labels in the Subject: link...

> - The lack of labeling / categorization that I can filter while browsing
>   through the archive make the 'browse' experience to be quite
>   unpleasant.  Search is one way to do it, but a new comers would not be
>   knowledgable enough to craft search query to get the archive view just
>   right.  Perhaps a way to provide a curate set of categories would be
>   nice.

Perhaps TODO files/comments in the source tree are acceptable;
or a regularly-posted mail similar to "What's cooking".

Having a centralized website/tracker would give too much power
and influence to people/orgs who run the site.  It would like
either require network access or require learning more software
to synchronize.

> - Lost track of issues / discussion:
>   A quick example would be me searching for Git's zstd support
>   recently with 
> 
>   > https://lore.kernel.org/git/?q=zstandard 
> 
>   and got next to no relevant result.  However if I were to query
> 
>   > 'https://lore.kernel.org/git/?q=zstd'
> 
>   then a very relevant thread from Peff appeared.  I think this could be
>   avoided if the search in ML archive do more than just matching exact
>   text.

I'm planning to support Xapian synonyms for that, but haven't
gotten around to making it configurable+reproducible by admins.
Everything in public-inbox is designed to be reproducible+forkable.

> 4. Lack of way to run test suite / CI:
> 
>   It would be nice if we can discuss patches while having CI result as
>   part of the conversation.  Right now mostly I see that we have to
>   manually running benchmarks/tests and share the paste the results.
> 
>   But for folks who don't have a dev environment ready at hand (new
>   comers, during travel with only phone access), it would be nice to
>   have a way to run tests without a dev environment.

Fwiw, the GCC Farm project gives ssh accounts for all free
software contributors, not just gcc hackers: https://cfarm.tetaneutral.net
Perhaps there's other similar services, too.

Slow down and enjoy travel :)  There's very little in free
software urgent enough to require constant attention.  Email is
well-suited for asynchronous work, and nobody should expect
instant replies.  The always-on nature of the modern Internet
and smartphones increases stress and dangerous situations; so I
hope free software hackers aren't contributing to that.

>   This was mostly solved in the context of works spent on Github's
>   Action Workflow.  But if we are discussing about pure patch flow, this
>   is a miss.
> 
> >  2. What tricks do you use to get by with those existing pain points?
> 
> For (1):
> - I had to invested a lot of time into setting up a set of Gmail search
>   filter.  Move mails with topics that Im interested in into a special
>   tag while the rest into archive.  Regularly check if anything
>   interesting went to archive by accident.
> 
> For (2):
> - I had to setup Mutt + Tmux to have a compatible experience sending
>   replies like this one.

Fwiw, git-send-email works for non-patch mails, too.  I don't
want a monoculture around mutt or any particular clients, either.
(I've never used tmux and don't see why it's necessary, here).

Anyways, thanks again for the feedback.

^ permalink raw reply	[relevance 0%]

* Re: Pain points in Git's patch flow
  @ 2021-04-15 15:45  7% ` Son Luong Ngoc
  2021-04-19  2:57  0%   ` Eric Wong
  0 siblings, 1 reply; 12+ results
From: Son Luong Ngoc @ 2021-04-15 15:45 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, Raxel Gutierrez, mricon, patchwork, Junio C Hamano,
	Taylor Blau, Emily Shaffer

Hi there,

I'm not a regular contributor but I have started to subscribe to the
Git's Mailing List recently.  So I thought it might be worth sharing my
personal view on this.

After writting all the below, I do realize that I have written quite a
rant, some of which I think some might consider to be off topic.  For
that, I do want to appologize before hand.

 Tue, Apr 13, 2021 at 11:13:26PM -0700, Jonathan Nieder wrote:
> Hi,
> 
...
> 
> Those four are important in my everyday life.  Questions:
> 
>  1. What pain points in the patch flow for git.git are important to
>     you?

There are several points I want to highlight:

1. Issue about reading the Mailing List:

- Subscribing to Git's Mailing List is not trivial:
  It takes a lot of time to setup the email subscription.  I remember
  having to google through a few documents to get my subscription
  working.

- And even after having subscribed, I was bombarded with a set
  of spam emails that was sent to the mailing list address.  These spams
  range anywhere from absurd to disguising themselves as legitimate
  users trying to contact you about a new shiny tech product.

2. Issue about joining the conversation in the Maling List:

- Setting up email client to reply to the Mailing List was definitely
  not trivial.  It's not trivial to send a reply without subscribing to
  the ML(i.e. using a Header provided from one of the archive).
  The list does not accept HTML emails, which many clients
  use as default format.  Getting the formatting to work for line
  wrapping is also a challenge depends on the client that you use.

- It's a bit intimidating to ask 'trivial questions' about the patch and
  create 'noise' in the ML.

3. Isssue with archive:

- I don't find the ML archive trivial for new comers.  It took me a bit
  of time to realize: 'Oh if I scroll to bottom and find the "Thread 
  overview" then I can navigate a mailing thread a lot easier'.

- The lack of labeling / categorization that I can filter while browsing
  through the archive make the 'browse' experience to be quite
  unpleasant.  Search is one way to do it, but a new comers would not be
  knowledgable enough to craft search query to get the archive view just
  right.  Perhaps a way to provide a curate set of categories would be
  nice.

- Lost track of issues / discussion:
  A quick example would be me searching for Git's zstd support
  recently with 

  > https://lore.kernel.org/git/?q=zstandard 

  and got next to no relevant result.  However if I were to query

  > 'https://lore.kernel.org/git/?q=zstd'

  then a very relevant thread from Peff appeared.  I think this could be
  avoided if the search in ML archive do more than just matching exact
  text.

4. Lack of way to run test suite / CI:

  It would be nice if we can discuss patches while having CI result as
  part of the conversation.  Right now mostly I see that we have to
  manually running benchmarks/tests and share the paste the results.

  But for folks who don't have a dev environment ready at hand (new
  comers, during travel with only phone access), it would be nice to
  have a way to run tests without a dev environment.

  This was mostly solved in the context of works spent on Github's
  Action Workflow.  But if we are discussing about pure patch flow, this
  is a miss.

>  2. What tricks do you use to get by with those existing pain points?

For (1):
- I had to invested a lot of time into setting up a set of Gmail search
  filter.  Move mails with topics that Im interested in into a special
  tag while the rest into archive.  Regularly check if anything
  interesting went to archive by accident.

For (2):
- I had to setup Mutt + Tmux to have a compatible experience sending
  replies like this one.

- All the patches I have submitted were through
  > https://github.com/gitgitgadget/git/pulls
  and it was not directly trivial to get permission to send email from a
  PR.

For (3):
- Spending time reading git blame / git log / commit message helps
  identifying the keywords I need to refine my search result in the ML
  archive.  This requires some commitments and is a barrier to entry for
  new comers.

- Using service like Github Search or SourceGraph helped a lot in term
  of navigating through the commit message / git blame.

For (4):
- I leverage both Github action and a patch that added Gitlab CI to run
  the test suite.

>  3. Do you think patchwork goes in a direction that is likely to help
>     with these?
>
>  4. What other tools would you like to see that could help?

With all that said, I don't know if patchwork will solve the problems
above.  I do understand that the current patch workflow comes with a
certain set of advantages, and adopting another tool will most likely be
a trade-off.

Personally I have been spending more and more time reading through
git.git via Sourcegraph Web UI and I would love for the search feature
to be able to extend to be able to search in the Mailing List from
relevant commit if possible.  I have also tried both Github's Codespace
and Microsoft's DevContainer to setup an opionated IDE with predefined
tasks that help executing the test suite.  I think these tools (or
their competitors such as GitPod) are quite ideal to quickly onboard
new contributors onto a history-rich codebase such as git.git.

Perhaps some configure a set of sane default, including editor extensions
that would handle email config for first time users.

As for code review and issue tracking toolings, I don't think there are
a perfect solution.  Any solutions: Github PR, Gitlab MR, Gerrit,
Phabricator would come with their own set of tradeoffs.  I like the
prospect of PatchWork gona improve the patch workflow though.  Perhaps I
will give it a try.

> 
> Thanks,
> Jonathan

Thanks,
Son Luong.

^ permalink raw reply	[relevance 7%]

* Re: [PATCH] archive: support compression levels beyond 9
  2020-11-09 18:35  0% ` Junio C Hamano
@ 2020-11-09 23:48 14%   ` René Scharfe
  0 siblings, 0 replies; 12+ results
From: René Scharfe @ 2020-11-09 23:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

Am 09.11.20 um 19:35 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Compression programs like zip, gzip, bzip2 and xz allow to adjust the
>> trade-off between CPU cost and size gain with numerical options from -1
>> for fast compression and -9 for high compression ratio.  zip also
>> accepts -0 for storing files verbatim.  git archive directly support
>> these single-digit compression levels for ZIP output and passes them to
>> filters like gzip.
>>
>> Zstandard additionally supports compression level options -10 to -19, or
>> up to -22 with --ultra.  This *seems* to work with git archive in most
>> cases, e.g. it will produce an archive with -19 without complaining, but
>> since it only supports single-digit compression level options this is
>> the same as -1 -9 and thus -9.
>>
>> Allow git archive to accept multi-digit compression levels to support
>> the full range supported by zstd.  Explicitly reject them for the ZIP
>> format, as otherwise deflateInit2() would just fail with a somewhat
>> cryptic "stream consistency error".
>
> The implementation looks more like "not enable them for the ZIP
> format", but the symptom observable to end-users is exactly
> "explicitly reject", so that's OK ;-)
>
> As with the usual compression levels, this is only about how
> deflator finds a better results, and the stream is understandable by
> any existing inflator, right?

Support for higher levels might have been added in later versions of
Zstandard -- https://github.com/facebook/zstd/blob/dev/CHANGELOG
mentions "Command line utility compatible with high compression levels"
for v.0.4.0.  I'm not aware of other implementations  of the algorithm
than the original one from Facebook, so I don't know how compatible
they are.  It's not a problem we can solve in Git, though.

Side note: Using Zstandard with git archive requires the config setting
tar.tar.zst.command=zstd.

>> diff --git a/archive.c b/archive.c
>> index 3c1541af9e..7a888c5338 100644
>> --- a/archive.c
>> +++ b/archive.c
>> @@ -529,10 +529,12 @@ static int add_file_cb(const struct option *opt, const char *arg, int unset)
>>  	return 0;
>>  }
>>
>> -#define OPT__COMPR(s, v, h, p) \
>> -	OPT_SET_INT_F(s, NULL, v, h, p, PARSE_OPT_NONEG)
>> -#define OPT__COMPR_HIDDEN(s, v, p) \
>> -	OPT_SET_INT_F(s, NULL, v, "", p, PARSE_OPT_NONEG | PARSE_OPT_HIDDEN)
>> +static int number_callback(const struct option *opt, const char *arg, int unset)
>> +{
>> +	BUG_ON_OPT_NEG(unset);
>> +	*(int *)opt->value = strtol(arg, NULL, 10);
>> +	return 0;
>> +}
>>
>>  static int parse_archive_args(int argc, const char **argv,
>>  		const struct archiver **ar, struct archiver_args *args,
>> @@ -561,16 +563,8 @@ static int parse_archive_args(int argc, const char **argv,
>>  		OPT_BOOL(0, "worktree-attributes", &worktree_attributes,
>>  			N_("read .gitattributes in working directory")),
>>  		OPT__VERBOSE(&verbose, N_("report archived files on stderr")),
>> -		OPT__COMPR('0', &compression_level, N_("store only"), 0),
>> -		OPT__COMPR('1', &compression_level, N_("compress faster"), 1),
>> -		OPT__COMPR_HIDDEN('2', &compression_level, 2),
>> -		OPT__COMPR_HIDDEN('3', &compression_level, 3),
>> -		OPT__COMPR_HIDDEN('4', &compression_level, 4),
>> -		OPT__COMPR_HIDDEN('5', &compression_level, 5),
>> -		OPT__COMPR_HIDDEN('6', &compression_level, 6),
>> -		OPT__COMPR_HIDDEN('7', &compression_level, 7),
>> -		OPT__COMPR_HIDDEN('8', &compression_level, 8),
>> -		OPT__COMPR('9', &compression_level, N_("compress better"), 9),
>> +		OPT_NUMBER_CALLBACK(&compression_level,
>> +			N_("set compression level"), number_callback),
>
> Doubly nice.  Adds a feature while removing lines.
>
> Do we miss the description given in "git archive -h" though?
>
>     usage: git archive [<options>] <tree-ish> [<path>...]
>        or: git archive --list
>        ...
>         -v, --verbose         report archived files on stderr
>         -0                    store only
>         -1                    compress faster
>         -9                    compress better
>

Perhaps; I just couldn't cram it all into a single line.  Showing an
acceptable range would be nice and terse, but that depends on the
compressor.

Hmm, adding an option for passing arbitrary options to the filter and
removing the feature flag ARCHIVER_WANT_COMPRESSION_LEVELS from
archive-tar.c would be cleaner overall.  The latter would be a
regression, though.

René

^ permalink raw reply	[relevance 14%]

* Re: [PATCH] archive: support compression levels beyond 9
  2020-11-09 16:05  9% [PATCH] archive: support compression levels beyond 9 René Scharfe
@ 2020-11-09 18:35  0% ` Junio C Hamano
  2020-11-09 23:48 14%   ` René Scharfe
  0 siblings, 1 reply; 12+ results
From: Junio C Hamano @ 2020-11-09 18:35 UTC (permalink / raw)
  To: René Scharfe; +Cc: Git Mailing List

René Scharfe <l.s.r@web.de> writes:

> Compression programs like zip, gzip, bzip2 and xz allow to adjust the
> trade-off between CPU cost and size gain with numerical options from -1
> for fast compression and -9 for high compression ratio.  zip also
> accepts -0 for storing files verbatim.  git archive directly support
> these single-digit compression levels for ZIP output and passes them to
> filters like gzip.
>
> Zstandard additionally supports compression level options -10 to -19, or
> up to -22 with --ultra.  This *seems* to work with git archive in most
> cases, e.g. it will produce an archive with -19 without complaining, but
> since it only supports single-digit compression level options this is
> the same as -1 -9 and thus -9.
>
> Allow git archive to accept multi-digit compression levels to support
> the full range supported by zstd.  Explicitly reject them for the ZIP
> format, as otherwise deflateInit2() would just fail with a somewhat
> cryptic "stream consistency error".

The implementation looks more like "not enable them for the ZIP
format", but the symptom observable to end-users is exactly
"explicitly reject", so that's OK ;-)

As with the usual compression levels, this is only about how
deflator finds a better results, and the stream is understandable by
any existing inflator, right?

> diff --git a/archive.h b/archive.h
> index 82b226011a..e3d04e8ab3 100644
> --- a/archive.h
> +++ b/archive.h
> @@ -36,6 +36,7 @@ const char *archive_format_from_filename(const char *filename);
>
>  #define ARCHIVER_WANT_COMPRESSION_LEVELS 1
>  #define ARCHIVER_REMOTE 2
> +#define ARCHIVER_HIGH_COMPRESSION_LEVELS 4
>  struct archiver {
>  	const char *name;
>  	int (*write_archive)(const struct archiver *, struct archiver_args *);
> diff --git a/archive-tar.c b/archive-tar.c
> index f1a1447ebd..a971fdc0f6 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -374,7 +374,8 @@ static int tar_filter_config(const char *var, const char *value, void *data)
>  		ar = xcalloc(1, sizeof(*ar));
>  		ar->name = xmemdupz(name, namelen);
>  		ar->write_archive = write_tar_filter_archive;
> -		ar->flags = ARCHIVER_WANT_COMPRESSION_LEVELS;
> +		ar->flags = ARCHIVER_WANT_COMPRESSION_LEVELS |
> +			    ARCHIVER_HIGH_COMPRESSION_LEVELS;

Nice.  

Hindsight tells me that WANT should have been ACCEPT, though---and
an addition of ARCHIVER_ACCEPT_HIGH_COMPRESSION_LEVELS would be in
line with that.  But that probably is too minor---it just stood out
a bit funny to me.

> diff --git a/archive.c b/archive.c
> index 3c1541af9e..7a888c5338 100644
> --- a/archive.c
> +++ b/archive.c
> @@ -529,10 +529,12 @@ static int add_file_cb(const struct option *opt, const char *arg, int unset)
>  	return 0;
>  }
>
> -#define OPT__COMPR(s, v, h, p) \
> -	OPT_SET_INT_F(s, NULL, v, h, p, PARSE_OPT_NONEG)
> -#define OPT__COMPR_HIDDEN(s, v, p) \
> -	OPT_SET_INT_F(s, NULL, v, "", p, PARSE_OPT_NONEG | PARSE_OPT_HIDDEN)
> +static int number_callback(const struct option *opt, const char *arg, int unset)
> +{
> +	BUG_ON_OPT_NEG(unset);
> +	*(int *)opt->value = strtol(arg, NULL, 10);
> +	return 0;
> +}
>
>  static int parse_archive_args(int argc, const char **argv,
>  		const struct archiver **ar, struct archiver_args *args,
> @@ -561,16 +563,8 @@ static int parse_archive_args(int argc, const char **argv,
>  		OPT_BOOL(0, "worktree-attributes", &worktree_attributes,
>  			N_("read .gitattributes in working directory")),
>  		OPT__VERBOSE(&verbose, N_("report archived files on stderr")),
> -		OPT__COMPR('0', &compression_level, N_("store only"), 0),
> -		OPT__COMPR('1', &compression_level, N_("compress faster"), 1),
> -		OPT__COMPR_HIDDEN('2', &compression_level, 2),
> -		OPT__COMPR_HIDDEN('3', &compression_level, 3),
> -		OPT__COMPR_HIDDEN('4', &compression_level, 4),
> -		OPT__COMPR_HIDDEN('5', &compression_level, 5),
> -		OPT__COMPR_HIDDEN('6', &compression_level, 6),
> -		OPT__COMPR_HIDDEN('7', &compression_level, 7),
> -		OPT__COMPR_HIDDEN('8', &compression_level, 8),
> -		OPT__COMPR('9', &compression_level, N_("compress better"), 9),
> +		OPT_NUMBER_CALLBACK(&compression_level,
> +			N_("set compression level"), number_callback),

Doubly nice.  Adds a feature while removing lines.  

Do we miss the description given in "git archive -h" though?

    usage: git archive [<options>] <tree-ish> [<path>...]
       or: git archive --list
       ...
        -v, --verbose         report archived files on stderr
        -0                    store only
        -1                    compress faster
        -9                    compress better


^ permalink raw reply	[relevance 0%]

* [PATCH] archive: support compression levels beyond 9
@ 2020-11-09 16:05  9% René Scharfe
  2020-11-09 18:35  0% ` Junio C Hamano
  0 siblings, 1 reply; 12+ results
From: René Scharfe @ 2020-11-09 16:05 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano

Compression programs like zip, gzip, bzip2 and xz allow to adjust the
trade-off between CPU cost and size gain with numerical options from -1
for fast compression and -9 for high compression ratio.  zip also
accepts -0 for storing files verbatim.  git archive directly support
these single-digit compression levels for ZIP output and passes them to
filters like gzip.

Zstandard additionally supports compression level options -10 to -19, or
up to -22 with --ultra.  This *seems* to work with git archive in most
cases, e.g. it will produce an archive with -19 without complaining, but
since it only supports single-digit compression level options this is
the same as -1 -9 and thus -9.

Allow git archive to accept multi-digit compression levels to support
the full range supported by zstd.  Explicitly reject them for the ZIP
format, as otherwise deflateInit2() would just fail with a somewhat
cryptic "stream consistency error".

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c |  3 ++-
 archive.c     | 26 +++++++++++---------------
 archive.h     |  1 +
 3 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index f1a1447ebd..a971fdc0f6 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -374,7 +374,8 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 		ar = xcalloc(1, sizeof(*ar));
 		ar->name = xmemdupz(name, namelen);
 		ar->write_archive = write_tar_filter_archive;
-		ar->flags = ARCHIVER_WANT_COMPRESSION_LEVELS;
+		ar->flags = ARCHIVER_WANT_COMPRESSION_LEVELS |
+			    ARCHIVER_HIGH_COMPRESSION_LEVELS;
 		ALLOC_GROW(tar_filters, nr_tar_filters + 1, alloc_tar_filters);
 		tar_filters[nr_tar_filters++] = ar;
 	}
diff --git a/archive.c b/archive.c
index 3c1541af9e..7a888c5338 100644
--- a/archive.c
+++ b/archive.c
@@ -529,10 +529,12 @@ static int add_file_cb(const struct option *opt, const char *arg, int unset)
 	return 0;
 }

-#define OPT__COMPR(s, v, h, p) \
-	OPT_SET_INT_F(s, NULL, v, h, p, PARSE_OPT_NONEG)
-#define OPT__COMPR_HIDDEN(s, v, p) \
-	OPT_SET_INT_F(s, NULL, v, "", p, PARSE_OPT_NONEG | PARSE_OPT_HIDDEN)
+static int number_callback(const struct option *opt, const char *arg, int unset)
+{
+	BUG_ON_OPT_NEG(unset);
+	*(int *)opt->value = strtol(arg, NULL, 10);
+	return 0;
+}

 static int parse_archive_args(int argc, const char **argv,
 		const struct archiver **ar, struct archiver_args *args,
@@ -561,16 +563,8 @@ static int parse_archive_args(int argc, const char **argv,
 		OPT_BOOL(0, "worktree-attributes", &worktree_attributes,
 			N_("read .gitattributes in working directory")),
 		OPT__VERBOSE(&verbose, N_("report archived files on stderr")),
-		OPT__COMPR('0', &compression_level, N_("store only"), 0),
-		OPT__COMPR('1', &compression_level, N_("compress faster"), 1),
-		OPT__COMPR_HIDDEN('2', &compression_level, 2),
-		OPT__COMPR_HIDDEN('3', &compression_level, 3),
-		OPT__COMPR_HIDDEN('4', &compression_level, 4),
-		OPT__COMPR_HIDDEN('5', &compression_level, 5),
-		OPT__COMPR_HIDDEN('6', &compression_level, 6),
-		OPT__COMPR_HIDDEN('7', &compression_level, 7),
-		OPT__COMPR_HIDDEN('8', &compression_level, 8),
-		OPT__COMPR('9', &compression_level, N_("compress better"), 9),
+		OPT_NUMBER_CALLBACK(&compression_level,
+			N_("set compression level"), number_callback),
 		OPT_GROUP(""),
 		OPT_BOOL('l', "list", &list,
 			N_("list supported archive formats")),
@@ -617,7 +611,9 @@ static int parse_archive_args(int argc, const char **argv,

 	args->compression_level = Z_DEFAULT_COMPRESSION;
 	if (compression_level != -1) {
-		if ((*ar)->flags & ARCHIVER_WANT_COMPRESSION_LEVELS)
+		int levels_ok = (*ar)->flags & ARCHIVER_WANT_COMPRESSION_LEVELS;
+		int high_ok = (*ar)->flags & ARCHIVER_HIGH_COMPRESSION_LEVELS;
+		if (levels_ok && (compression_level <= 9 || high_ok))
 			args->compression_level = compression_level;
 		else {
 			die(_("Argument not supported for format '%s': -%d"),
diff --git a/archive.h b/archive.h
index 82b226011a..e3d04e8ab3 100644
--- a/archive.h
+++ b/archive.h
@@ -36,6 +36,7 @@ const char *archive_format_from_filename(const char *filename);

 #define ARCHIVER_WANT_COMPRESSION_LEVELS 1
 #define ARCHIVER_REMOTE 2
+#define ARCHIVER_HIGH_COMPRESSION_LEVELS 4
 struct archiver {
 	const char *name;
 	int (*write_archive)(const struct archiver *, struct archiver_args *);
--
2.29.2

^ permalink raw reply related	[relevance 9%]

* Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4
  2019-02-14 10:14  0%   ` Duy Nguyen
@ 2019-02-15 20:22  0%     ` Ben Peart
  0 siblings, 0 replies; 12+ results
From: Ben Peart @ 2019-02-15 20:22 UTC (permalink / raw)
  To: Duy Nguyen, Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano



On 2/14/2019 5:14 AM, Duy Nguyen wrote:
> On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>> Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
>>> same most of the time. ctime should often be the same (or differs just
>>> slightly). And sometimes mtime is the same as well. st_ino is also
>>> always zero on Windows. We're storing a lot of duplicate values.
>>>
>>> Index v5 handles this
>>
>> This looks really promising.
> 
> I was going to reply to Junio. But it turns out I underestimated
> "varint" encoding overhead and it increases read time too much. I
> might get back and try some optimization when I'm bored, but until
> then this is yet another failed experiment.
> 
>>> As a result of this, v5 reduces file size from 30% (git.git) to
>>> 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
>>> size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
>>>

Just for kicks, I tried this out on a couple of repos I have handy.

files	version	index size	%savings
200k	2	25,033,758	0.00%
	3	25,033,758	0.00%
	4	15,269,923	39.00%
	5	9,759,844	61.01%
			
3m	2	446,123,848	0.00%
	3	446,123,848	0.00%
	4	249,631,640	44.04%
	5	82,147,981	81.59%

The 81% savings is very impressive.  I didn't measure performance but 
not writing out an extra 167MB to disk has to help.

I'm definitely also interested in your 'sparse index' format ideas as in 
our 3M repos, there are typically only a few thousand that don't have 
the skip-worktree bit set.  I'm not sure if that is the same 'sparse' 
you had in mind but it would sure be nice!



I've also contemplated multi-threading the index write code path.  My 
thought was in the primary thread to allocate a buffer and when it is 
full have a background thread compute the SHA and write it to disk while 
the primary thread fills the next buffer.

I'm not sure how much it will buy us as I don't know the relative cost 
of computing the SHA/writing to disk vs filling the buffer.  I've 
suspected the filling the buffer thread would end up blocked on the 
background thread most of the time which is why I haven't tried it yet.

>>> Of course we trade off storage with cpu. We now need to spend more
>>> cycles writing or even reading (but still plenty fast compared to
>>> zlib). For reading, I'm counting on multi thread to hide away all this
>>> even if it becomes significant.
>>
>> This would be a bigger change, but have we/you ever done a POC
>> experiment to see how much of this time is eaten up by zlib that
>> wouldn't be eaten up with some of the newer "fast but good enough"
>> compression algorithms, e.g. Snappy and Zstandard?
> 
> I'm quite sure I tried zlib at some point, the only lasting impression
> I have is "not good enough". Other algorithms might improve a bit,
> perhaps on the uncompress/read side, but I find it unlikely we could
> reasonably compress like a hundred megabytes in a few dozen
> milliseconds (a quick google says Snappy compresses 250MB/s, so about
> 400ms per 100MB, too long). Splitting the files and compressing in
> parallel might help. But I will probably focus on "sparse index"
> approach before going that direction.
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4
  2019-02-14 10:02 10% ` Ævar Arnfjörð Bjarmason
@ 2019-02-14 10:14  0%   ` Duy Nguyen
  2019-02-15 20:22  0%     ` Ben Peart
  0 siblings, 1 reply; 12+ results
From: Duy Nguyen @ 2019-02-14 10:14 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List, Junio C Hamano

On Thu, Feb 14, 2019 at 5:02 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> > Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
> > same most of the time. ctime should often be the same (or differs just
> > slightly). And sometimes mtime is the same as well. st_ino is also
> > always zero on Windows. We're storing a lot of duplicate values.
> >
> > Index v5 handles this
>
> This looks really promising.

I was going to reply to Junio. But it turns out I underestimated
"varint" encoding overhead and it increases read time too much. I
might get back and try some optimization when I'm bored, but until
then this is yet another failed experiment.

> > As a result of this, v5 reduces file size from 30% (git.git) to
> > 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
> > size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
> >
> > Of course we trade off storage with cpu. We now need to spend more
> > cycles writing or even reading (but still plenty fast compared to
> > zlib). For reading, I'm counting on multi thread to hide away all this
> > even if it becomes significant.
>
> This would be a bigger change, but have we/you ever done a POC
> experiment to see how much of this time is eaten up by zlib that
> wouldn't be eaten up with some of the newer "fast but good enough"
> compression algorithms, e.g. Snappy and Zstandard?

I'm quite sure I tried zlib at some point, the only lasting impression
I have is "not good enough". Other algorithms might improve a bit,
perhaps on the uncompress/read side, but I find it unlikely we could
reasonably compress like a hundred megabytes in a few dozen
milliseconds (a quick google says Snappy compresses 250MB/s, so about
400ms per 100MB, too long). Splitting the files and compressing in
parallel might help. But I will probably focus on "sparse index"
approach before going that direction.
-- 
Duy

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4
  @ 2019-02-14 10:02 10% ` Ævar Arnfjörð Bjarmason
  2019-02-14 10:14  0%   ` Duy Nguyen
  0 siblings, 1 reply; 12+ results
From: Ævar Arnfjörð Bjarmason @ 2019-02-14 10:02 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git


On Wed, Feb 13 2019, Nguyễn Thái Ngọc Duy wrote:

> Index file size more or less translates to write time because we hash
> the entire file every time we update the index. And we update the index
> quite often (automatically index refresh is done everywhere). This means
> smaller index files are faster, especially true for very large
> worktrees.
>
> Index v4 attempts to reduce file size by "prefix compressing"
> paths. This reduces file size from 17% (git.git) to 41% (webkit.git,
> deep hierarchy).
>
> Index v5 takes the same idea to the next level. Instead of compressing
> just paths, based on the previous entry, we "compress" a lot more
> fields.
>
> Take a look at stat data, st_dev, st_uid, st_gid and st_mode are the
> same most of the time. ctime should often be the same (or differs just
> slightly). And sometimes mtime is the same as well. st_ino is also
> always zero on Windows. We're storing a lot of duplicate values.
>
> Index v5 handles this

This looks really promising.

>  - by adding a "same mask" per entry. If st_dev is the same as previous
>    entry, for instance, we set "st_dev is the same" flag and will not
>    store it at all, saving 31 bits per entry.
>
>  - even when we store it, "varint" encoding is used. We should rarely
>    need to write out 4 bytes
>
>  - for ctime and mtime, even if we have to store it, we store the offset
>    instead of absolute numbers. This often leads to smaller numbers,
>    which also means fewer bytes to encode.

Sounds good. I wonder if you've thought about/considered a couple of
optimizations on top of this, or if they're possible. Both share the
same theme:

* Instead of adding a "same as last mask" adding "same as Nth
  mask". Something similar exists in the Sereal format (which also has
  other techniques you use, e.g. varint
  https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod#the-copy-tag)

  So instead of:

      <mask1><same><mask2><same><mask1><same> etc.

   You'd have:

      <mask1 (mark1)><same><mask2 (mark2)><same><insert: mark1><same> etc.

   I.e. when you have data that flip-flops a lot you can save space by
   saying "it's the same as existing earlier value at offset N". Maybe
   it doesn't make sense for this data, I don't know...

* For ctime/mtime presumably for dir paths, are these paths tolerant to
  or already out of glob() order? Then perhaps they can be pre-sorted so
  the compression or ctime/mtime offset compression is more effective.

> As a result of this, v5 reduces file size from 30% (git.git) to
> 36% (webkit.git) compared to v4. Comparing to v2, webkit.git index file
> size is reduced by 63%! A 8.4MB index file is _almost_ acceptable.
>
> Of course we trade off storage with cpu. We now need to spend more
> cycles writing or even reading (but still plenty fast compared to
> zlib). For reading, I'm counting on multi thread to hide away all this
> even if it becomes significant.

This would be a bigger change, but have we/you ever done a POC
experiment to see how much of this time is eaten up by zlib that
wouldn't be eaten up with some of the newer "fast but good enough"
compression algorithms, e.g. Snappy and Zstandard?

^ permalink raw reply	[relevance 10%]

* Re: SHA1 collisions found
  2017-02-26 21:38 11%                       ` Ævar Arnfjörð Bjarmason
@ 2017-02-26 21:52  0%                         ` Jeff King
  0 siblings, 0 replies; 12+ results
From: Jeff King @ 2017-02-26 21:52 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Linus Torvalds, brian m. carlson, Jason Cooper, ankostis,
	Junio C Hamano, Git Mailing List, Stefan Beller, David Lang,
	Ian Jackson, Joey Hess

On Sun, Feb 26, 2017 at 10:38:35PM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Sun, Feb 26, 2017 at 8:11 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > But yes, SHA3-256 looks like the sane choice. Performance of hashing
> > is important in the sense that it shouldn't _suck_, but is largely
> > secondary. All my profiles on real loads (well, *my* real loads) have
> > shown that zlib performance is actually much more important than SHA1.
> 
> What's the zlib v.s. hash ratio on those profiles? If git is switching
> to another hashing function given the developments in faster
> compression algorithms (gzip v.s. snappy v.s. zstd v.s. lz4)[1] we'll
> probably switch to another compression algorithm sooner than later.
> 
> Would compression still be the bottleneck by far with zstd, how about with lz4?
> 
> 1. https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/

zstd does help in normal operations that access lots of blobs. Here are
some timings:

  http://public-inbox.org/git/20161023080552.lma2v6zxmyaiiqz5@sigill.intra.peff.net/

Compression is part of the on-the-wire packfile format, so it introduces
compatibility headaches. Unlike the hash, it _can_ be a local thing
negotiated between the two ends, and a server with zstd data could
convert on-the-fly to zlib. You just wouldn't want to do so on a server
because it's really expensive (or you double your cache footprint to
store both).

If there were a hash flag day, we _could_ make sure all post-flag-day
implementations have zstd, and just start using that (it transparently
handles old zlib data, too). I'm just hesitant to through in the kitchen
sink and make the hash transition harder than it already is.

Hash performance doesn't matter much for normal read operations. If your
implementation is really _slow_ it does matter for a few operations
(notably index-pack receiving a large push or fetch). Some timings:

  http://public-inbox.org/git/20170223230621.43anex65ndoqbgnf@sigill.intra.peff.net/

If the new algorithm is faster than SHA-1, that might be measurable in
those operations, too, but obviously less dramatic, as hashing is just a
percentage of the total operation (so it can balloon the time if it's
slow, but optimizing it can only save so much).

I don't know if the per-hash setup cost of any of the new algorithms is
higher than SHA-1. We care as much about hashing lots of small content
as we do about sustained throughput of a single hash.

-Peff

^ permalink raw reply	[relevance 0%]

* Re: SHA1 collisions found
  @ 2017-02-26 21:38 11%                       ` Ævar Arnfjörð Bjarmason
  2017-02-26 21:52  0%                         ` Jeff King
  0 siblings, 1 reply; 12+ results
From: Ævar Arnfjörð Bjarmason @ 2017-02-26 21:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: brian m. carlson, Jason Cooper, ankostis, Junio C Hamano,
	Git Mailing List, Stefan Beller, David Lang, Ian Jackson,
	Joey Hess

On Sun, Feb 26, 2017 at 8:11 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> But yes, SHA3-256 looks like the sane choice. Performance of hashing
> is important in the sense that it shouldn't _suck_, but is largely
> secondary. All my profiles on real loads (well, *my* real loads) have
> shown that zlib performance is actually much more important than SHA1.

What's the zlib v.s. hash ratio on those profiles? If git is switching
to another hashing function given the developments in faster
compression algorithms (gzip v.s. snappy v.s. zstd v.s. lz4)[1] we'll
probably switch to another compression algorithm sooner than later.

Would compression still be the bottleneck by far with zstd, how about with lz4?

1. https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/

^ permalink raw reply	[relevance 11%]

Results 1-12 of 12 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2017-02-24 15:13     SHA1 collisions found Ian Jackson
2017-02-24 17:32     ` Junio C Hamano
2017-02-24 17:45       ` David Lang
2017-02-24 18:14         ` Junio C Hamano
2017-02-24 18:58           ` Stefan Beller
2017-02-24 19:20             ` Junio C Hamano
2017-02-24 20:05               ` ankostis
2017-02-24 20:32                 ` Junio C Hamano
2017-02-25  0:31                   ` ankostis
2017-02-26  0:16                     ` Jason Cooper
2017-02-26 17:38                       ` brian m. carlson
2017-02-26 19:11                         ` Linus Torvalds
2017-02-26 21:38 11%                       ` Ævar Arnfjörð Bjarmason
2017-02-26 21:52  0%                         ` Jeff King
2019-02-13 12:08     [PATCH] read-cache.c: index format v5 -- 30% smaller/faster than v4 Nguyễn Thái Ngọc Duy
2019-02-14 10:02 10% ` Ævar Arnfjörð Bjarmason
2019-02-14 10:14  0%   ` Duy Nguyen
2019-02-15 20:22  0%     ` Ben Peart
2020-11-09 16:05  9% [PATCH] archive: support compression levels beyond 9 René Scharfe
2020-11-09 18:35  0% ` Junio C Hamano
2020-11-09 23:48 14%   ` René Scharfe
2021-04-14  6:13     Pain points in Git's patch flow Jonathan Nieder
2021-04-15 15:45  7% ` Son Luong Ngoc
2021-04-19  2:57  0%   ` Eric Wong
2021-10-10 11:19 11% git archive -o something.tar.zst but file info just says "POSIX tar archive" Bagas Sanjaya
2023-01-31  0:06     Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
2023-01-31  9:54     ` brian m. carlson
2023-01-31 15:05       ` Konstantin Ryabitsev
2023-01-31 22:32         ` brian m. carlson
2023-02-01  9:40           ` Ævar Arnfjörð Bjarmason
2023-02-01 11:34             ` demerphq
2023-02-01 12:21               ` Michal Suchánek
2023-02-01 12:48                 ` demerphq
2023-02-01 13:43                   ` Ævar Arnfjörð Bjarmason
2023-02-01 15:21 14%                 ` demerphq

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).