git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ANNOUNCE] more archives of this list
@ 2016-07-10  0:48 Eric Wong
  2016-07-10  3:47 ` Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric Wong @ 2016-07-10  0:48 UTC (permalink / raw)
  To: git

Very much a work-in-progress, but NNTP and HTTP/HTTPS sorta work
based on stuff that is on gmane and stuff I'm accumulating by
being a subscriber.

The first two Tor hidden service onions are actually on better
hardware than the non-hidden public-inbox.org one:

nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
nntp://news.public-inbox.org/inbox.comp.version-control.git

http://czquwvybam4bgbro.onion/git
http://hjrcffqmbrq6wope.onion/git
http://ou63pmih66umazou.onion/git
https://public-inbox.org/git/

HTTP URLs are clonable, but I've generated the following fast-export dump:

	https://public-inbox.org/.temp/git.vger.kernel.org-6c38c917e55c.gz
	(362M)

	git init --bare mirror.git
	curl $FAST_EXPORT_GZ_URL | git --git-dir=mirror.git fast-import
	git --git-dir=mirror.git remote add --mirror=fetch origin $URL

I recommend bare repos for importing, since the trees consist of
2/38 SHA-1 paths of Message-IDs and there's nearly 300K messages.

In contrast, bundles and packs delta poorly and only get down
around 750-800M with aggressive packing
(And I haven't done that in a while.)


Code is AGPL-3.0+: git clone https://public-inbox.org/


Additional mirrors or forks (perhaps different UIs) are very welcome,
as I expect none of my servers or network connections to be reliable.


I have the "public-inbox-watch" command running in screen
watching my Maildirs, it uses a config file which is parseable/writable
using git-config:

==> ~/.public-inbox/config <==
[publicinboxlearn]
	; spam gets moved here for auto-removal:
	watchspam = maildir:/path/to/maildirs/.INBOX.spam
[publicinboxwatch]
	; optional, adds some additional spam checking
	spamcheck = spamc
[publicinbox "git"]
	; git repo for this list
	mainrepo = /path/to/mirror.git

	; this removes the list footer signature:
	filter = PublicInbox::Filter::Vger

	; this is where my git-related mail goes (some of it is from Debian)
	watch = maildir:/path/to/maildirs/.INBOX.git

	; only match messages with the correct List-Id header:
	watchheader = List-Id:<git.vger.kernel.org>

	; next 4 lines are only necessary for HTTP and NNTP servers
	address = git@vger.kernel.org
	url = http://ou63pmih66umazou.onion/git
	newsgroup = inbox.comp.version-control.git
	infourl = http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-07-10  0:48 [ANNOUNCE] more archives of this list Eric Wong
@ 2016-07-10  3:47 ` Eric Wong
  2016-07-28 20:59 ` Eric Wong
  2016-08-05  9:28 ` Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-07-10  3:47 UTC (permalink / raw)
  To: git

Eric Wong <e@80x24.org> wrote:
> 	https://public-inbox.org/.temp/git.vger.kernel.org-6c38c917e55c.gz
> 	(362M)
> 
> 	git init --bare mirror.git
> 	curl $FAST_EXPORT_GZ_URL | git --git-dir=mirror.git fast-import

Oops, that is missing zcat:

	curl $FAST_EXPORT_GZ_URL | zcat | git --git-dir=mirror.git fast-import

> 	git --git-dir=mirror.git remote add --mirror=fetch origin $URL

And I forgot to set a branch for fast-export and just exported
a ref, so importers will need to create master explicitly:

	git update-ref refs/heads/master 6c38c917e55c

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-07-10  0:48 [ANNOUNCE] more archives of this list Eric Wong
  2016-07-10  3:47 ` Eric Wong
@ 2016-07-28 20:59 ` Eric Wong
  2016-08-05  9:28 ` Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-07-28 20:59 UTC (permalink / raw)
  To: git

Eric Wong <e@80x24.org> wrote:
> Code is AGPL-3.0+: git clone https://public-inbox.org/
> 
> 
> Additional mirrors or forks (perhaps different UIs) are very welcome,

Btw, it's possible to do quote highlighting with user-side CSS:

https://public-inbox.org/meta/20160709-user-side-css-example@11/

Will probably add classes for diff colors, too, since a git
repository browser with mailing list integration will happen.

For the moment, cgit + examples/cgit-commit-filter.lua allows
searching subjects (at least non-merge ones).

So the commit subject below is a link:
https://bogomips.org/mirrors/git.git/commit/?id=7b35efd734e501f
to:
https://public-inbox.org/git/?x=t&q="fsck_walk():+optionally+name+objects+on+the+go"

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-07-10  0:48 [ANNOUNCE] more archives of this list Eric Wong
  2016-07-10  3:47 ` Eric Wong
  2016-07-28 20:59 ` Eric Wong
@ 2016-08-05  9:28 ` Jeff King
  2016-08-05  9:35   ` Jeff King
  2016-08-05 15:04   ` Duy Nguyen
  2 siblings, 2 replies; 11+ messages in thread
From: Jeff King @ 2016-08-05  9:28 UTC (permalink / raw)
  To: Eric Wong; +Cc: git

On Sun, Jul 10, 2016 at 12:48:13AM +0000, Eric Wong wrote:

> Very much a work-in-progress, but NNTP and HTTP/HTTPS sorta work
> based on stuff that is on gmane and stuff I'm accumulating by
> being a subscriber.

I checked this out when you posted it, and have been using it the past
few weeks. I really like it. I find the URL structure much easier to
navigate than gmane.

I do find it visually a little harder to navigate through threads,
because there's not much styling there, and the messages seem to run
into one another. I don't know if a border around the divs or something
would help. I'm really terrible at that kind of visual design.

> HTTP URLs are clonable, but I've generated the following fast-export dump:
> 
> 	https://public-inbox.org/.temp/git.vger.kernel.org-6c38c917e55c.gz
> 	(362M)
> [...]
> In contrast, bundles and packs delta poorly and only get down
> around 750-800M with aggressive packing

I pulled this down. It is indeed rather huge, and git doesn't perform
all that well with it. All the usual "git is not a database" things
apply, I think.

I noticed in particular that traversing the object graph is _really_
slow. This is very sensitive to the "branchiness" of the tree. I notice
that you use a single level of hash (e.g., d4/9a37e4974...). Since there
almost 300K messages, the average 2nd-level tree has over 1000 entries
in it, and each commit changes exactly one entry.

So what happens during a traversal is that we see some tree A, look at
all of its entries, and see each of its blobs. Then we see A', the same
tree with one entry different, and we still have to walk each of those
thousand entries, looking up each in a hash only to find "yep, we
already saw that blob".

Whereas if your tree is more tree-like (rather than list-like), you can
cull unchanged sub-trees more frequently. The tradeoff, though, is the
extra overhead in storing the sha1 for the extra level of tree
indirection.

Here are some timing and size results for various incarnations of the
packfile. The sizes come from:

  git cat-file --batch-all-objects \
               --batch-check='%(objectsize:disk) %(objecttype)' |
  perl -lne '
    /(\d+) (.*)/; $count{$2}++; $size{$2} += $1;
    END { print "$size{$_} ($count{$_}) $_" for sort(keys(%count))
  }'

And the timings are just "git rev-list --objects --all".

Here's the initial sizes after fast-import:

  536339725 (291113) blob
   63767736 (291154) commit
  929164567 (582290) tree

Yikes, fast-import does a really terrible job of tree deltas (actually,
I'm not even sure it finds tree deltas at all). Notice that blob
contents are bigger than the fast-import stream (which contains all of
those contents!). That's unfortunate, but comes from the fact that we
zlib deflate the objects individually. Whereas the fast-import stream
was compressed as a whole, so the common elements between the emails get
a really good compression ratio.

There was discussion a long time ago about storing a common zlib
dictionary in the packfile and using it for all of the objects. I don't
recall whether there were any patches, though. It does create some
complications with serving clones/fetches, as they may ask for a subset
of the objects (so you have to send them the whole dictionary, which may
be a lot of overhead if you're only fetching a few objects).

Anyway, here are numbers after an aggressive repack:

  628307898 (291113) blob
   63209416 (291154) commit
   44342440 (582290) tree

Much better trees. Ironically the blobs got worse. I think there are
just too many with similar names and sizes for our heuristics to do a
good job of finding deltas.

Here's what running rev-list looks like:

  real    6m4.933s
  user    6m4.124s
  sys     0m0.616s

Yow, that's pretty painful. Without bitmaps, that's an operation that
every single clone would need to run.

Here's what it looks like with an extra level of hashing (so storing
"12/34/abcd..." instead of "12/34abcd..."):

  628308433 (291113) blob
   63207951 (291154) commit
   60654550 (873339) tree

We're storing a lot more trees, and spending 16MB extra on tree storage.
But here's the rev-list time:

  real    0m55.120s
  user    0m55.016s
  sys     0m0.096s

I didn't try doing an extra level of hashing on top of that (i.e.,
"12/34/ab/cd..."). It might help, but I suspect it's diminishing returns
versus the cost of accessing the extra trees.

The other thing that would probably make a big difference is avoiding
the one-commit-per-message pattern. The commit objects aren't that big,
but each one involves 2 new trees (one with ~1000 entries, and one with
256 entries). If you batched them into blocks of, say, 10 minutes, that
drops the number of commits by half.

Which I computed with:

  git log --reverse --format=%at |
  sort -n |
  perl -lne '
	if (!@block) {
		@block = ($_);
	} else {
		my $diff = $_ - $block[0];
		if ($diff >= 0 && $diff < 600) {
			push @block, $_;
		} else {
			print join(" ", @block);
			@block = ($_);
		}
	}
	END { print join(" ", @block) }
  '

Of course that means your mirror lags by 10 minutes. And you lose the
cool property of "git log --author=peff", though of course that
information is redundant with what is in the blobs. I haven't looked at
the public-inbox code but I would imagine it's mostly operating on the
tip tree.

If you're willing to give up the cool commits, we could also just squash
the whole archive into a single base commit, and start building there.
We'd run into problems in another 10 years, I guess, but it would be
pretty efficient to start with, at least. :)

> Additional mirrors or forks (perhaps different UIs) are very welcome,
> as I expect none of my servers or network connections to be reliable.

I'm tempted to host a mirror at GitHub, but I'm wary of the Git storage.
I don't think it really scales all that well. Bitmaps help with the cost
of a clone, but they're not magic. We still have to do traversals for a
lot of operations (including repacks).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05  9:28 ` Jeff King
@ 2016-08-05  9:35   ` Jeff King
  2016-08-05  9:59     ` Eric Wong
  2016-08-05 18:19     ` Eric Wong
  2016-08-05 15:04   ` Duy Nguyen
  1 sibling, 2 replies; 11+ messages in thread
From: Jeff King @ 2016-08-05  9:35 UTC (permalink / raw)
  To: Eric Wong; +Cc: git

On Fri, Aug 05, 2016 at 05:28:05AM -0400, Jeff King wrote:

> On Sun, Jul 10, 2016 at 12:48:13AM +0000, Eric Wong wrote:
> 
> > Very much a work-in-progress, but NNTP and HTTP/HTTPS sorta work
> > based on stuff that is on gmane and stuff I'm accumulating by
> > being a subscriber.
> 
> I checked this out when you posted it, and have been using it the past
> few weeks. I really like it. I find the URL structure much easier to
> navigate than gmane.
> 
> I do find it visually a little harder to navigate through threads,
> because there's not much styling there, and the messages seem to run
> into one another. I don't know if a border around the divs or something
> would help. I'm really terrible at that kind of visual design.

I took a peek at doing something simple like:

  pre {
    border-style: solid;
    border-width: 1px;
    background: #dddddd;
  }

but it looks like there's no HTML structure at all to the current
output. It's just one big <pre> tag with various levels of indentation
to represent the messages.

So I guess a potential first step would be actually representing a
thread as:

  <div class="message">
    parent message...
    <div class="message">
      reply...
    </div>
  </div>

and so on.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05  9:35   ` Jeff King
@ 2016-08-05  9:59     ` Eric Wong
  2016-08-05 18:19     ` Eric Wong
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-08-05  9:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> wrote:
> On Fri, Aug 05, 2016 at 05:28:05AM -0400, Jeff King wrote:
> 
> > On Sun, Jul 10, 2016 at 12:48:13AM +0000, Eric Wong wrote:
> > 
> > > Very much a work-in-progress, but NNTP and HTTP/HTTPS sorta work
> > > based on stuff that is on gmane and stuff I'm accumulating by
> > > being a subscriber.
> > 
> > I checked this out when you posted it, and have been using it the past
> > few weeks. I really like it. I find the URL structure much easier to
> > navigate than gmane.

Thanks :>

> > I do find it visually a little harder to navigate through threads,
> > because there's not much styling there, and the messages seem to run
> > into one another. I don't know if a border around the divs or something
> > would help. I'm really terrible at that kind of visual design.
> 
> I took a peek at doing something simple like:

I'm trying to keep the visual design consistent across browsers
without CSS support (I mainly use w3m) and CSS scares me:

http://thejh.net/misc/website-terminal-copy-paste

(and JavaScript has me cowering in a corner behind a chair :x)

> but it looks like there's no HTML structure at all to the current
> output. It's just one big <pre> tag with various levels of indentation
> to represent the messages.
> 
> So I guess a potential first step would be actually representing a
> thread as:
> 
>   <div class="message">
>     parent message...
>     <div class="message">
>       reply...
>     </div>
>   </div>

The, <ul><li>... in the /$MID/t/ (as opposed to /$MID/T/) endpoint
might be what you're looking for.  See the "[flat|threaded]" links.

I run out of horizontal space with the giant fonts I like to
use, but it might be preferable for some folks.



And about the 2/38 tree structure; I am starting to avoid it
(for Xapian users, at least) but haven't performed the reindex
on all my servers, yet:

https://public-inbox.org/meta/20160805010300.7053-1-e%4080x24.org/

Of course, Xapian is only an optional dependency and won't
affect git internals.

I might end up having to nuke history occasionally anyways,
(in case somebody invokes the DMCA, or there's enough spam
 to warrant permanent deletion)

(Will try to digest the rest of your message later).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05  9:28 ` Jeff King
  2016-08-05  9:35   ` Jeff King
@ 2016-08-05 15:04   ` Duy Nguyen
  2016-08-05 20:20     ` Jeff King
  1 sibling, 1 reply; 11+ messages in thread
From: Duy Nguyen @ 2016-08-05 15:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Wong, Git Mailing List

On Fri, Aug 5, 2016 at 11:28 AM, Jeff King <peff@peff.net> wrote:
> There was discussion a long time ago about storing a common zlib
> dictionary in the packfile and using it for all of the objects. I don't
> recall whether there were any patches, though. It does create some
> complications with serving clones/fetches, as they may ask for a subset
> of the objects (so you have to send them the whole dictionary, which may
> be a lot of overhead if you're only fetching a few objects).

I'm nit picking since it's not actually "all objects". But pack v4
patches have two dictionaries (or more? i don't remember) for commits
and trees :)
-- 
Duy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05  9:35   ` Jeff King
  2016-08-05  9:59     ` Eric Wong
@ 2016-08-05 18:19     ` Eric Wong
  2016-08-05 20:22       ` Jeff King
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Wong @ 2016-08-05 18:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> wrote:
> On Fri, Aug 05, 2016 at 05:28:05AM -0400, Jeff King wrote:
> > I do find it visually a little harder to navigate through threads,
> > because there's not much styling there, and the messages seem to run
> > into one another. I don't know if a border around the divs or something
> > would help. I'm really terrible at that kind of visual design.
> 
> I took a peek at doing something simple like:
> 
>   pre {
>     border-style: solid;
>     border-width: 1px;
>     background: #dddddd;
>   }
> 
> but it looks like there's no HTML structure at all to the current
> output. It's just one big <pre> tag with various levels of indentation
> to represent the messages.

I added an <hr> between each message so the /T/ view ought to be more
readable:

https://public-inbox.org/meta/20160805181459.24420-1-e@80x24.org/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05 15:04   ` Duy Nguyen
@ 2016-08-05 20:20     ` Jeff King
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2016-08-05 20:20 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Eric Wong, Git Mailing List

On Fri, Aug 05, 2016 at 05:04:27PM +0200, Duy Nguyen wrote:

> On Fri, Aug 5, 2016 at 11:28 AM, Jeff King <peff@peff.net> wrote:
> > There was discussion a long time ago about storing a common zlib
> > dictionary in the packfile and using it for all of the objects. I don't
> > recall whether there were any patches, though. It does create some
> > complications with serving clones/fetches, as they may ask for a subset
> > of the objects (so you have to send them the whole dictionary, which may
> > be a lot of overhead if you're only fetching a few objects).
> 
> I'm nit picking since it's not actually "all objects". But pack v4
> patches have two dictionaries (or more? i don't remember) for commits
> and trees :)

I couldn't remember if that zlib stuff was part of packv4 or not. I like
many of the ideas in pack v4, but I do worry a lot about the
compatibility issues, as packv2 is the on-the-wire format.

Being able to send bytes directly off disk with minimal processing is a
key thing that makes running a large git hosting site practical. One of
the things that makes nervous is having to do on-the-fly conversion when
serving fetches and clones (but to be clear that is just gut
nervousness; I haven't done any actual testing with the proto-packv4
patches).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05 18:19     ` Eric Wong
@ 2016-08-05 20:22       ` Jeff King
  2016-08-06 23:41         ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2016-08-05 20:22 UTC (permalink / raw)
  To: Eric Wong; +Cc: git

On Fri, Aug 05, 2016 at 06:19:57PM +0000, Eric Wong wrote:

> Jeff King <peff@peff.net> wrote:
> > On Fri, Aug 05, 2016 at 05:28:05AM -0400, Jeff King wrote:
> > > I do find it visually a little harder to navigate through threads,
> > > because there's not much styling there, and the messages seem to run
> > > into one another. I don't know if a border around the divs or something
> > > would help. I'm really terrible at that kind of visual design.
> > 
> > I took a peek at doing something simple like:
> > 
> >   pre {
> >     border-style: solid;
> >     border-width: 1px;
> >     background: #dddddd;
> >   }
> > 
> > but it looks like there's no HTML structure at all to the current
> > output. It's just one big <pre> tag with various levels of indentation
> > to represent the messages.
> 
> I added an <hr> between each message so the /T/ view ought to be more
> readable:
> 
> https://public-inbox.org/meta/20160805181459.24420-1-e@80x24.org/

Thanks. That's definitely an improvement. I still think the styling
could go further, but I don't expect you to do it. It's something I may
look into, but I should probably try to clear out my backlog of
"to-review" patches before I go off spending time on it. :)

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [ANNOUNCE] more archives of this list
  2016-08-05 20:22       ` Jeff King
@ 2016-08-06 23:41         ` Eric Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2016-08-06 23:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> wrote:
> Thanks. That's definitely an improvement. I still think the styling
> could go further, but I don't expect you to do it. It's something I may
> look into, but I should probably try to clear out my backlog of
> "to-review" patches before I go off spending time on it. :)

Heh, and I've nearly been sidetracked into hacking w3m to
iframes, persistent connections, color-mapping, etc, too :x
Right now, I'm just happy w3m supports piping buffers to
arbitrary programs (so one could run "git am" on the /raw
endpoint).

It'll be tough to beat a good local mail setup some of us have
with mutt/gnus.  Being able to mark messages as read/unread
would either be intrusive (cookies/tracking users) or
cause accessibility problems (frames), I think...

...which is why the mbox.gz endpoints exist in public-inbox :>

So I think I'll try to do POP3 support, first...

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-06 23:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-10  0:48 [ANNOUNCE] more archives of this list Eric Wong
2016-07-10  3:47 ` Eric Wong
2016-07-28 20:59 ` Eric Wong
2016-08-05  9:28 ` Jeff King
2016-08-05  9:35   ` Jeff King
2016-08-05  9:59     ` Eric Wong
2016-08-05 18:19     ` Eric Wong
2016-08-05 20:22       ` Jeff King
2016-08-06 23:41         ` Eric Wong
2016-08-05 15:04   ` Duy Nguyen
2016-08-05 20:20     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).