linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] LKML Archive in Maildir Format
@ 2018-12-16 19:06 Joey Pabalinas
  2018-12-16 19:17 ` Joe Perches
  2018-12-16 19:46 ` Konstantin Ryabitsev
  0 siblings, 2 replies; 14+ messages in thread
From: Joey Pabalinas @ 2018-12-16 19:06 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: kernelnewbies, Linus Torvalds, Greg Kroah-Hartman, Joey Pabalinas

[-- Attachment #1: Type: text/plain, Size: 677 bytes --]

I spent a lot of time trying to find an LKML archive in Maildir format
that I could use for local searches with nutmuch or something, but all
the links I was able to find were all dead.

I ended up just compiling one myself and I currently host it at:

https://alyptik.org/lkml.tar.xz

It's possible I'm the only weirdo who finds this kind of thing useful, but
I figured I should share it just in case I'm not.

It's about 1.1 million files, I was wondering if anyone had an idea of a
better way to host this? I've tried Github and GitLab, but they don't
appreciate repos with that many files, hah.

Open to suggestions, thanks!

-- 
Cheers,
Joey Pabalinas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:06 [RFC] LKML Archive in Maildir Format Joey Pabalinas
@ 2018-12-16 19:17 ` Joe Perches
  2018-12-16 19:21   ` Joey Pabalinas
  2018-12-16 19:46 ` Konstantin Ryabitsev
  1 sibling, 1 reply; 14+ messages in thread
From: Joe Perches @ 2018-12-16 19:17 UTC (permalink / raw)
  To: Joey Pabalinas, Linux Kernel Mailing List
  Cc: kernelnewbies, Linus Torvalds, Greg Kroah-Hartman

On Sun, 2018-12-16 at 09:06 -1000, Joey Pabalinas wrote:
> I spent a lot of time trying to find an LKML archive in Maildir format
> that I could use for local searches with nutmuch or something, but all
> the links I was able to find were all dead.

You might instead use

https://www.kernel.org/lore.html
https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:17 ` Joe Perches
@ 2018-12-16 19:21   ` Joey Pabalinas
  2018-12-16 19:55     ` Konstantin Ryabitsev
  2018-12-18 20:26     ` Jasper Spaans
  0 siblings, 2 replies; 14+ messages in thread
From: Joey Pabalinas @ 2018-12-16 19:21 UTC (permalink / raw)
  To: Joe Perches
  Cc: Joey Pabalinas, Linux Kernel Mailing List, kernelnewbies,
	Linus Torvalds, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Sun, Dec 16, 2018 at 11:17:34AM -0800, Joe Perches wrote:
> On Sun, 2018-12-16 at 09:06 -1000, Joey Pabalinas wrote:
> > I spent a lot of time trying to find an LKML archive in Maildir format
> > that I could use for local searches with nutmuch or something, but all
> > the links I was able to find were all dead.
> 
> You might instead use
> 
> https://www.kernel.org/lore.html
> https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/

That was my first attempt, but the ducumentation for the public-inbox
format is sort of terrible, and after a few hours trying to convert it
to Maildir I just gave up.

I ended up just slowly scraping lkml.org for a couple weeks so I
wouldn't disrupt anything and it worked fairly well. Just looking for
advice on where to host this now so others might be able to use it.

-- 
Cheers,
Joey Pabalinas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:06 [RFC] LKML Archive in Maildir Format Joey Pabalinas
  2018-12-16 19:17 ` Joe Perches
@ 2018-12-16 19:46 ` Konstantin Ryabitsev
  2018-12-16 19:53   ` Joey Pabalinas
  1 sibling, 1 reply; 14+ messages in thread
From: Konstantin Ryabitsev @ 2018-12-16 19:46 UTC (permalink / raw)
  To: Joey Pabalinas, Linux Kernel Mailing List, kernelnewbies,
	Linus Torvalds, Greg Kroah-Hartman

On Sun, Dec 16, 2018 at 09:06:39AM -1000, Joey Pabalinas wrote:
> I spent a lot of time trying to find an LKML archive in Maildir format
> that I could use for local searches with nutmuch or something, but all
> the links I was able to find were all dead.
> 
> I ended up just compiling one myself and I currently host it at:
> 
> https://alyptik.org/lkml.tar.xz

You seem to have duplicated a lot of effort that has already been done
to compile the archive on lore.kernel.org.

> It's possible I'm the only weirdo who finds this kind of thing useful, but
> I figured I should share it just in case I'm not.

The maildir format is kind of terrible for LKML, because having millions
of messages in a single directory is very hard on the underlying FS. If
you break it up into multiple folders, then it becomes difficult to
search. This is the main reason why we have chosen to go with the
public-inbox format, which solves both of these problems and allows for
a very efficient archive updating and replication using git.

> It's about 1.1 million files, I was wondering if anyone had an idea of a
> better way to host this? I've tried Github and GitLab, but they don't
> appreciate repos with that many files, hah.

Like I said, you seem to be going down the road we've already tried and
rejected. :)

-K

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:46 ` Konstantin Ryabitsev
@ 2018-12-16 19:53   ` Joey Pabalinas
  2019-01-04  1:35     ` Eric Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Joey Pabalinas @ 2018-12-16 19:53 UTC (permalink / raw)
  To: Joey Pabalinas, Linux Kernel Mailing List, kernelnewbies,
	Linus Torvalds, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 1967 bytes --]

On Sun, Dec 16, 2018 at 02:46:49PM -0500, Konstantin Ryabitsev wrote:
> On Sun, Dec 16, 2018 at 09:06:39AM -1000, Joey Pabalinas wrote:
> > I spent a lot of time trying to find an LKML archive in Maildir format
> > that I could use for local searches with nutmuch or something, but all
> > the links I was able to find were all dead.
> > 
> > I ended up just compiling one myself and I currently host it at:
> > 
> > https://alyptik.org/lkml.tar.xz
> 
> You seem to have duplicated a lot of effort that has already been done
> to compile the archive on lore.kernel.org.

Absolutely correct, haha.

> 
> > It's possible I'm the only weirdo who finds this kind of thing useful, but
> > I figured I should share it just in case I'm not.
> 
> The maildir format is kind of terrible for LKML, because having millions
> of messages in a single directory is very hard on the underlying FS. If
> you break it up into multiple folders, then it becomes difficult to
> search. This is the main reason why we have chosen to go with the
> public-inbox format, which solves both of these problems and allows for
> a very efficient archive updating and replication using git.
> 
> > It's about 1.1 million files, I was wondering if anyone had an idea of a
> > better way to host this? I've tried Github and GitLab, but they don't
> > appreciate repos with that many files, hah.
> 
> Like I said, you seem to be going down the road we've already tried and
> rejected. :)

Yes, I had a strong suspicion I might be the only crazy person who prefers this
kind of format :)

My only comment on the public-mailbox choice is that the documentation
is very sparse and erratic. Myself and a couple other people just
couldn't figure out how to convert that format to Maildir or some other
format you could feed into a reader like neomutt.

Do you have any advice on how to convert those public-inbox files
correctly?

-- 
Cheers,
Joey Pabalinas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:21   ` Joey Pabalinas
@ 2018-12-16 19:55     ` Konstantin Ryabitsev
  2018-12-16 21:55       ` Joey Pabalinas
  2018-12-18 20:26     ` Jasper Spaans
  1 sibling, 1 reply; 14+ messages in thread
From: Konstantin Ryabitsev @ 2018-12-16 19:55 UTC (permalink / raw)
  To: Joey Pabalinas, Joe Perches, Linux Kernel Mailing List,
	kernelnewbies, Linus Torvalds, Greg Kroah-Hartman

On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> That was my first attempt, but the ducumentation for the public-inbox
> format is sort of terrible, 

I'm surprised you think so, because it's basically a simple file called
"m" that is updated on each commit and contains the body of the
message.

> and after a few hours trying to convert it to Maildir I just gave up.

It's as easy as something like this:

for commit in $(git rev-list master); do:
  git show $commit:m > maildir/new/$commit
done

You have to do it per each of the shards for the complete archive.

-K

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:55     ` Konstantin Ryabitsev
@ 2018-12-16 21:55       ` Joey Pabalinas
  0 siblings, 0 replies; 14+ messages in thread
From: Joey Pabalinas @ 2018-12-16 21:55 UTC (permalink / raw)
  To: Joey Pabalinas, Joe Perches, Linux Kernel Mailing List,
	kernelnewbies, Linus Torvalds, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 924 bytes --]

On Sun, Dec 16, 2018 at 02:55:05PM -0500, Konstantin Ryabitsev wrote:
> On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> > That was my first attempt, but the ducumentation for the public-inbox
> > format is sort of terrible, 
> 
> I'm surprised you think so, because it's basically a simple file called
> "m" that is updated on each commit and contains the body of the
> message.
> 
> > and after a few hours trying to convert it to Maildir I just gave up.
> 
> It's as easy as something like this:
> 
> for commit in $(git rev-list master); do:
>   git show $commit:m > maildir/new/$commit
> done
> 
> You have to do it per each of the shards for the complete archive.

Ah dang, I was trying to use stuff like ssoma to split it, no wonder it
didn't work.  Not sure why I didn't think to try any git commands...

Well, at least now I know, ha. Thanks!

-- 
Cheers,
Joey Pabalinas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:21   ` Joey Pabalinas
  2018-12-16 19:55     ` Konstantin Ryabitsev
@ 2018-12-18 20:26     ` Jasper Spaans
  2018-12-18 22:53       ` Joey Pabalinas
  1 sibling, 1 reply; 14+ messages in thread
From: Jasper Spaans @ 2018-12-18 20:26 UTC (permalink / raw)
  To: Joey Pabalinas, Joe Perches, Linux Kernel Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 1510 bytes --]

Hi Joey,

On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> > > I spent a lot of time trying to find an LKML archive in Maildir format
> > > that I could use for local searches with nutmuch or something, but all
> > > the links I was able to find were all dead.
> > 
> > You might instead use
> > 
> > https://www.kernel.org/lore.html
> > https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/
> 
> That was my first attempt, but the ducumentation for the public-inbox
> format is sort of terrible, and after a few hours trying to convert it
> to Maildir I just gave up.
> 
> I ended up just slowly scraping lkml.org for a couple weeks so I
> wouldn't disrupt anything and it worked fairly well. Just looking for
> advice on where to host this now so others might be able to use it.

Now you've caught my attention; first of all, there are more than 3M
messages stored in the lkml.org datase, so I guess you've missed some
messages or something is really broken.

Besides, unless you figured out how to get to the raw data, you've just
scraped a rendering which discards stuff like pgp signatures etc and has
very incomplete headers. Unless you don't care for those of course :)

Note that I've also been toying with the lore dataset, and wrote a tiny tool
to get Maildir-like data out of it; this code is a bit of a single-use-jig
so you'll need to do some coding if you really want to use it.  Attached
anyway.

All the best and enjoy,
Jasper

[-- Attachment #1.2: Pipfile --]
[-- Type: text/plain, Size: 168 bytes --]

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
gitpython = "*"
ipython = "*"

[dev-packages]

[requires]
python_version = "3.7"

[-- Attachment #1.3: test.py --]
[-- Type: text/x-python, Size: 1130 bytes --]

from email.parser import BytesParser
from email.message import EmailMessage
from email.policy import default

from git import Repo

our_last_id = '<dc4d502c-bc3c-46e3-a984-41271951a5f7@mellanox.com>'
#'<20180711142744.GN3593@linux.vnet.ibm.com>'


repo = Repo('/Users/spaans/xsrc/lkml/lkml/git/6.git')


commit = repo.commit("master")
counter = 5000
froms = set()
while True:
    tree = commit.tree
    blob = tree['m']
    data = blob.data_stream.read()

    msg = BytesParser(policy=default).parsebytes(data)

    msgid = msg['Message-ID']
    from_ = msg['From']
    froms.add(from_)
    print(msgid)

    #import pdb; pdb.set_trace()
    if len(froms) > 1000:
        print("HAVE LOTS OF FRIENDS NOW")
        break
    if msgid == our_last_id:
        print("LADIES & GENTLEMEN, WE'VE GOT HIM")
        break
    parents = commit.parents
    if len(parents) != 1:
        print("WUH")
        break
    else:
        commit = commit.parents[0]

    #with open("output/%04d.eml" % counter, "bw") as f:
    #    f.write(data)
    counter -= 1

import pprint
pprint.pprint(froms)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1528 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-18 20:26     ` Jasper Spaans
@ 2018-12-18 22:53       ` Joey Pabalinas
  0 siblings, 0 replies; 14+ messages in thread
From: Joey Pabalinas @ 2018-12-18 22:53 UTC (permalink / raw)
  To: Jasper Spaans; +Cc: Joey Pabalinas, Joe Perches, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On Tue, Dec 18, 2018 at 09:26:27PM +0100, Jasper Spaans wrote:
> Now you've caught my attention; first of all, there are more than 3M
> messages stored in the lkml.org datase, so I guess you've missed some
> messages or something is really broken.
> 
> Besides, unless you figured out how to get to the raw data, you've just
> scraped a rendering which discards stuff like pgp signatures etc and has
> very incomplete headers. Unless you don't care for those of course :)
> 
> Note that I've also been toying with the lore dataset, and wrote a tiny tool
> to get Maildir-like data out of it; this code is a bit of a single-use-jig
> so you'll need to do some coding if you really want to use it.  Attached
> anyway.

Yeah, after looking closer at it last week, something here is very
weird. This is definitely far from complete.

When I have some free time I'm just going to give it another go with
the public-inbox conversion.

-- 
Cheers,
Joey Pabalinas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2018-12-16 19:53   ` Joey Pabalinas
@ 2019-01-04  1:35     ` Eric Wong
  2019-03-05 20:48       ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Wong @ 2019-01-04  1:35 UTC (permalink / raw)
  To: Joey Pabalinas
  Cc: linux-kernel, kernelnewbies, Linus Torvalds, Greg Kroah-Hartman

Joey Pabalinas <joeypabalinas@gmail.com> wrote:
> My only comment on the public-mailbox choice is that the documentation
> is very sparse and erratic. Myself and a couple other people just
> couldn't figure out how to convert that format to Maildir or some other
> format you could feed into a reader like neomutt.

Sorry, I didn't notice this before.  I started making some attempts
at improving documentation (among other things, when time permits)
to public-inbox:

  https://public-inbox.org/meta/20190102083305.30473-1-e@80x24.org/

And without knowing anything about git or public-inbox, you can
get NNTP messages into Maildir or mboxrd pretty easily.  Nothing
new to learn :)

I wrote a one-off Ruby years ago (before public-inbox) for
converting slrnspools to Maildir (sample slrnpull.conf below).
But yeah, I wouldn't recommend 3M+ messages in a Maildir...

==> slrnspool2maildir <==
#!/usr/bin/ruby
require 'socket'
require 'fileutils'
HOSTNAME = Socket.gethostname

usage = "Usage #$0 <spooldir> <maildir>"
spooldir = ARGV[0] or abort usage
maildir = ARGV[1] or abort usage

f = base = nil
nr = 0
%w(cur new tmp).each { |x| FileUtils.mkpath("#{maildir}/#{x}") }
Dir.glob("#{spooldir}/*").each do |src|
  File.file?(src) or next
  base = File.basename(src)
  dest = "#{maildir}/new/#{Time.now.to_i}_#{base}_0.#{HOSTNAME}:2,"
  begin
    File.link(src, dest)
  rescue Errno::EEXIST
    warn "#{dest} already exists"
    next
  end
  File.unlink(src)
end
__END__
==> slrnpull.conf <==
# group_name                         max        expire     headers_only
inbox.com.example.news.group.name    1000000000 1000000000 0
# usage: slrnpull -d $PWD -h news.example.com --no-post

# Wouldn't be hard to script something using Net::NNTP in Perl
# to write directly to Maildirs, either.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2019-01-04  1:35     ` Eric Wong
@ 2019-03-05 20:48       ` Bjorn Helgaas
  2019-03-05 23:26         ` Eric Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2019-03-05 20:48 UTC (permalink / raw)
  To: Eric Wong
  Cc: Joey Pabalinas, Linux Kernel Mailing List, kernelnewbies,
	Linus Torvalds, Greg Kroah-Hartman, Konstantin Ryabitsev,
	Eric Biederman, Jasper Spaans

OK, so I understand how to clone archives from lore.kernel.org and how
to convert a git archive to a maildir (thanks, Konstantin!)

What I *don't* understand is how to effectively read this locally.
Ideally I'd like to run mutt, possibly with notmuch for indexing.  But
a maildir with 3M files seems impractical.  I did actually try it
(without notmuch), but it takes mutt about 5 minutes to start up.  And
the maildir is about 23G, compared with 7.5G for the git archive.

Any pointers?  I guess there's no mutt backend that can read a
public-inbox archive directly?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2019-03-05 20:48       ` Bjorn Helgaas
@ 2019-03-05 23:26         ` Eric Wong
  2019-03-06 20:50           ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Wong @ 2019-03-05 23:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Joey Pabalinas, linux-kernel, kernelnewbies, Linus Torvalds,
	Greg Kroah-Hartman, Konstantin Ryabitsev, Eric Biederman,
	Jasper Spaans

Bjorn Helgaas <bhelgaas@google.com> wrote:
> OK, so I understand how to clone archives from lore.kernel.org and how
> to convert a git archive to a maildir (thanks, Konstantin!)
> 
> What I *don't* understand is how to effectively read this locally.
> Ideally I'd like to run mutt, possibly with notmuch for indexing.  But
> a maildir with 3M files seems impractical.  I did actually try it
> (without notmuch), but it takes mutt about 5 minutes to start up.  And
> the maildir is about 23G, compared with 7.5G for the git archive.

Right, relying on Maildir for long-term storage of giant archives
is not a usable solution with any general purpose FSes I know about.
git itself had the same problem with loose object scalability in
the old days and packs were invented as a result.

> Any pointers?  I guess there's no mutt backend that can read a
> public-inbox archive directly?

There's mutt patches to support reading over NNTP, so that
works:

	mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP

I don't think mutt handles mboxrd 100% correctly, but it's close
enough that you can can download the gzipped mboxrd of a search
query and open it via "mutt -f /path/to/downloaded/mbox.gz"

  curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY&x=m"

POST is required(*), and -OJ lets it use the
Content-Disposition: header for a meaningful server-generated
name, but you can also redirect the result to whatever you want.

For all messages since March 1, you could use:

	SEARCH_QUERY=d:20190301..

All the supported search queries are documented in
$INBOX_URL/_/text/help/ and the search prefixes (e.g. "d:",
"s:", "b:") are modeled after what's in mairix.  You'll need to
escape the queries for URIs (e.g. " " => "+", and so on).
Xapian requires date ranges to be denoted with ".." whereas
mairix uses "-" for ranges.


The main thing public-inbox search misses from mairix is support
for "-t" which grabs non-matching messages from the same thread.
I would like to support that someday, but don't have enough time
(or funding) to make it happen at the moment.


(*) to reliably avoid wasting resources from spiders/prefetchers

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2019-03-05 23:26         ` Eric Wong
@ 2019-03-06 20:50           ` Bjorn Helgaas
  2019-03-07  3:44             ` Eric Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2019-03-06 20:50 UTC (permalink / raw)
  To: Eric Wong
  Cc: Joey Pabalinas, Linux Kernel Mailing List, kernelnewbies,
	Linus Torvalds, Greg Kroah-Hartman, Konstantin Ryabitsev,
	Eric Biederman, Jasper Spaans

On Tue, Mar 5, 2019 at 5:26 PM Eric Wong <e@80x24.org> wrote:
> Bjorn Helgaas <bhelgaas@google.com> wrote:

> > Any pointers?  I guess there's no mutt backend that can read a
> > public-inbox archive directly?
>
> There's mutt patches to support reading over NNTP, so that
> works:
>
>         mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP

Neomutt includes NNTP support, so I tried this:

  neomutt -f news://nntp.lore.kernel.org/org.kernel.vger.linux-kernel

which worked OK but (1) I only see the most recent 1000 messages and
(2) obviously isn't reading a *local* archive.  Neomutt took about 45
seconds to start up over my wimpy ISP.

I assume I could probably have a local archive and run a local NNTP
server and point neomutt at that local server.  But I don't know how
full-archive searching would work there.

> I don't think mutt handles mboxrd 100% correctly, but it's close
> enough that you can can download the gzipped mboxrd of a search
> query and open it via "mutt -f /path/to/downloaded/mbox.gz"
>
>   curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY&x=m"

I got nothing at all with -XPOST, but this:

  curl -OJ "https://lore.kernel.org/linux-pci/?q=d:20190301..&x=m"

got me the HTML source.  Nothing that looks like mboxrd.  I assume
this is stupid user error on my part, but even with that resolved, it
wouldn't have the nice git fetch properties of the git archive, i.e.,
incremental updates of only new stuff, would it?

I think my ideal solution would be a mutt that could read the git
archive directly, plus a notmuch index.  But AFAIK, mutt can't do
that, and notmuch only works with one message per file, not with the
git archive.

Something that might work would be to use Konstantin's "git archive to
maildir" hint but shard into a bunch of smaller maildirs instead of
one big one, then have notmuch index those, and use mutt or vim with
notmuch queries instead of having it read in a maildir.

But I feel like I must be missing the solution that's obvious to
everybody but me.

Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] LKML Archive in Maildir Format
  2019-03-06 20:50           ` Bjorn Helgaas
@ 2019-03-07  3:44             ` Eric Wong
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Wong @ 2019-03-07  3:44 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Joey Pabalinas, linux-kernel, kernelnewbies, Linus Torvalds,
	Greg Kroah-Hartman, Konstantin Ryabitsev, Eric Biederman,
	Jasper Spaans

Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Mar 5, 2019 at 5:26 PM Eric Wong <e@80x24.org> wrote:
> > Bjorn Helgaas <bhelgaas@google.com> wrote:
> 
> > > Any pointers?  I guess there's no mutt backend that can read a
> > > public-inbox archive directly?
> >
> > There's mutt patches to support reading over NNTP, so that
> > works:
> >
> >         mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP
> 
> Neomutt includes NNTP support, so I tried this:
> 
>   neomutt -f news://nntp.lore.kernel.org/org.kernel.vger.linux-kernel
> 
> which worked OK but (1) I only see the most recent 1000 messages and
> (2) obviously isn't reading a *local* archive.  Neomutt took about 45
> seconds to start up over my wimpy ISP.
> 
> I assume I could probably have a local archive and run a local NNTP
> server and point neomutt at that local server.  But I don't know how
> full-archive searching would work there.

Right.  AFAIK there isn't a good solution for search via NNTP.

> > I don't think mutt handles mboxrd 100% correctly, but it's close
> > enough that you can can download the gzipped mboxrd of a search
> > query and open it via "mutt -f /path/to/downloaded/mbox.gz"
> >
> >   curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY&x=m"
> 
> I got nothing at all with -XPOST, but this:

Ah, I guess nginx (or something in AWS) rejects POST without
Content-Length headers.  Adding "-HContent-Length:0"
to the command-line with -XPOST works for lore.

>   curl -OJ "https://lore.kernel.org/linux-pci/?q=d:20190301..&x=m"
> 
> got me the HTML source.  Nothing that looks like mboxrd.  I assume

Right.  The "x=m" requests an mbox; but it's only available via
POST requests (to prevent search engine spiders from wasting
time on non-HTML content).  With the HTML output in a browser,
the "mbox.gz" button makes the POST request and allows you to
download the mbox.

> this is stupid user error on my part, but even with that resolved, it
> wouldn't have the nice git fetch properties of the git archive, i.e.,
> incremental updates of only new stuff, would it?

You could bump d:YYYYMMDD (there's also "dt:" for date-time if
you need more precision).

> I think my ideal solution would be a mutt that could read the git
> archive directly, plus a notmuch index.  But AFAIK, mutt can't do
> that, and notmuch only works with one message per file, not with the
> git archive.
> 
> Something that might work would be to use Konstantin's "git archive to
> maildir" hint but shard into a bunch of smaller maildirs instead of
> one big one, then have notmuch index those, and use mutt or vim with
> notmuch queries instead of having it read in a maildir.

Small Maildirs work great, but large ones fall over.  I don't
think having a bunch of smaller Maildirs would help notmuch
since notmuch still needs to know each file path.

The only way I could see notmuch/Maildir working well is to keep
the overall number of messages relatively small.

One of my longer-term goals is to write a mairix-like tool in
Perl which works with public-inbox archives; but I barely have
enough time for public-inbox these days :<

mairix works with gzipped mboxes, which is great for large
archives; but the indexing falls over since it rewrites the
entire search index every time. SSDs have died as a result :<

> But I feel like I must be missing the solution that's obvious to
> everybody but me.

Nope, you're not alone :)  There's not a lot of mail software
which can handle LKML-sized histories efficiently.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-03-07  3:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-16 19:06 [RFC] LKML Archive in Maildir Format Joey Pabalinas
2018-12-16 19:17 ` Joe Perches
2018-12-16 19:21   ` Joey Pabalinas
2018-12-16 19:55     ` Konstantin Ryabitsev
2018-12-16 21:55       ` Joey Pabalinas
2018-12-18 20:26     ` Jasper Spaans
2018-12-18 22:53       ` Joey Pabalinas
2018-12-16 19:46 ` Konstantin Ryabitsev
2018-12-16 19:53   ` Joey Pabalinas
2019-01-04  1:35     ` Eric Wong
2019-03-05 20:48       ` Bjorn Helgaas
2019-03-05 23:26         ` Eric Wong
2019-03-06 20:50           ` Bjorn Helgaas
2019-03-07  3:44             ` Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).