linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jasper Spaans <j@jasper.es>
To: Joey Pabalinas <joeypabalinas@gmail.com>,
	Joe Perches <joe@perches.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] LKML Archive in Maildir Format
Date: Tue, 18 Dec 2018 21:26:27 +0100	[thread overview]
Message-ID: <20181218202627.j6d2jgxercylclpc@jasper.es> (raw)
In-Reply-To: <20181216192135.hc7gykmwkfgil2j5@gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 1510 bytes --]

Hi Joey,

On Sun, Dec 16, 2018 at 09:21:35AM -1000, Joey Pabalinas wrote:
> > > I spent a lot of time trying to find an LKML archive in Maildir format
> > > that I could use for local searches with nutmuch or something, but all
> > > the links I was able to find were all dead.
> > 
> > You might instead use
> > 
> > https://www.kernel.org/lore.html
> > https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/git.git/
> 
> That was my first attempt, but the ducumentation for the public-inbox
> format is sort of terrible, and after a few hours trying to convert it
> to Maildir I just gave up.
> 
> I ended up just slowly scraping lkml.org for a couple weeks so I
> wouldn't disrupt anything and it worked fairly well. Just looking for
> advice on where to host this now so others might be able to use it.

Now you've caught my attention; first of all, there are more than 3M
messages stored in the lkml.org datase, so I guess you've missed some
messages or something is really broken.

Besides, unless you figured out how to get to the raw data, you've just
scraped a rendering which discards stuff like pgp signatures etc and has
very incomplete headers. Unless you don't care for those of course :)

Note that I've also been toying with the lore dataset, and wrote a tiny tool
to get Maildir-like data out of it; this code is a bit of a single-use-jig
so you'll need to do some coding if you really want to use it.  Attached
anyway.

All the best and enjoy,
Jasper

[-- Attachment #1.2: Pipfile --]
[-- Type: text/plain, Size: 168 bytes --]

[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
gitpython = "*"
ipython = "*"

[dev-packages]

[requires]
python_version = "3.7"

[-- Attachment #1.3: test.py --]
[-- Type: text/x-python, Size: 1130 bytes --]

from email.parser import BytesParser
from email.message import EmailMessage
from email.policy import default

from git import Repo

our_last_id = '<dc4d502c-bc3c-46e3-a984-41271951a5f7@mellanox.com>'
#'<20180711142744.GN3593@linux.vnet.ibm.com>'


repo = Repo('/Users/spaans/xsrc/lkml/lkml/git/6.git')


commit = repo.commit("master")
counter = 5000
froms = set()
while True:
    tree = commit.tree
    blob = tree['m']
    data = blob.data_stream.read()

    msg = BytesParser(policy=default).parsebytes(data)

    msgid = msg['Message-ID']
    from_ = msg['From']
    froms.add(from_)
    print(msgid)

    #import pdb; pdb.set_trace()
    if len(froms) > 1000:
        print("HAVE LOTS OF FRIENDS NOW")
        break
    if msgid == our_last_id:
        print("LADIES & GENTLEMEN, WE'VE GOT HIM")
        break
    parents = commit.parents
    if len(parents) != 1:
        print("WUH")
        break
    else:
        commit = commit.parents[0]

    #with open("output/%04d.eml" % counter, "bw") as f:
    #    f.write(data)
    counter -= 1

import pprint
pprint.pprint(froms)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1528 bytes --]

  parent reply	other threads:[~2018-12-18 21:54 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-16 19:06 [RFC] LKML Archive in Maildir Format Joey Pabalinas
2018-12-16 19:17 ` Joe Perches
2018-12-16 19:21   ` Joey Pabalinas
2018-12-16 19:55     ` Konstantin Ryabitsev
2018-12-16 21:55       ` Joey Pabalinas
2018-12-18 20:26     ` Jasper Spaans [this message]
2018-12-18 22:53       ` Joey Pabalinas
2018-12-16 19:46 ` Konstantin Ryabitsev
2018-12-16 19:53   ` Joey Pabalinas
2019-01-04  1:35     ` Eric Wong
2019-03-05 20:48       ` Bjorn Helgaas
2019-03-05 23:26         ` Eric Wong
2019-03-06 20:50           ` Bjorn Helgaas
2019-03-07  3:44             ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181218202627.j6d2jgxercylclpc@jasper.es \
    --to=j@jasper.es \
    --cc=joe@perches.com \
    --cc=joeypabalinas@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).