RFC: individual public-inbox/git activity feeds

* RFC: individual public-inbox/git activity feeds
@ 2019-10-10 19:28 Konstantin Ryabitsev
  2019-10-10 23:57 ` Eric Wong
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2019-10-10 19:28 UTC (permalink / raw)
  To: workflows

Hi, all:

The idea of using public-inbox repositories as individual feeds has been 
mentioned here a couple of times already, and I would like to propose a 
tentative approach that could work without needing to involve SSB or 
other protocols.

# What are public-inbox repos?

Public-inbox (v2) uses git to archive mail messages, with the following 
general structure:

topdir/
  0.git/
  1.git/
  ...

Each of these git repositories has a single ref, master, with a single 
file "m" containing the entire body of the message, e.g.:
  - https://erol.kernel.org/workflows/git/0/tree/m

Each incoming message overwrites this file and creates a new commit, 
e.g.:
  - https://erol.kernel.org/workflows/git/0/log/m

This has the following upsides:

  - with a single file, git commit operations are very fast
  - git performance remains pretty much unaffected as repository grows, 
    since there aren't more and more objects to hash (the main downside 
    of public-inbox v1).
  - it is easy to get the contents of any message by simply performing 
    `git show <commit hash>:m`, which is a very fast operation even for 
    very old messages in the archive
  - most language environments have decent git libraries, so writing 
    tooling around git repositories is easy
  - git is really good at replicating itself, especially with a single 
    ref
  - git supports commit signing, so all commits can have cryptographic 
    attestation if the tools are configured to do that

There are a few downsides to this, too:

  - git maintenance tools like git-repack don't expect that repository 
    contents are going to be 90%-100% rewritten with every new commit, 
    so by default it will try to perform many rather useless 
    optimizations looking for non-existent deltas (but this can be 
    tweaked in config files)
  - most useful operations require maintaining auxiliary databases, e.g.  
    for message-id to commit-id mapping -- so repositories need to be 
    indexed using public-inbox-index in order to be useful for more than 
    just archival and replication. For huge repositories like LKML, the 
    initial indexing takes a long time, though subsequent 
    public-inbox-index calls after each `git remote update` are pretty 
    quick.
  - there is only rudimentary sharding into epochs, which makes partial 
    replication tricky (e.g. "replicate just the archives from last 
    October")

# Public-inbox repositories are feeds

Each public-inbox repository is therefore a consecutive feed of messages 
in the same sense something like SSB or NNTP is (for this reason, 
there's robust NNTP support in public-inbox). Public-inbox feeds are:

  - distributed
  - immutable (or tamper-evident once replicated, which is effectively 
    the same as immutable if git is configured to reject non-ff updates)
  - cryptographically attestable, if commit signing is used

# Individual developer feeds

Individual developers can begin providing their own public-inbox feeds.
At the start, they can act as a sort of a "public sent-mail folder" -- a 
simple tool would monitor the local/IMAP "sent" folder and add any new 
mail it finds (sent to specific mailing lists) to the developer's local 
public-inbox instance. Every commit will be automatically signed and 
pushed out to a public remote. 

On the kernel.org side, we can collate these individual feeds and mirror 
them into an aggregated feeds repository, with a ref per individual 
developer, like so:

refs/feeds/gregkh/0/master
refs/feeds/davem/0/master
refs/feeds/davem/1/master
...

Already, this gives us the following perks:

  - cryptographic attestation
  - patches that are guaranteed against mangling by MTA software
  - guaranteed spam-free message delivery from all the important people
  - permanent, attestable and distributable archive

(With time, we can teach kernel.org to act as an MTA bridge that sends 
actual mail to the mailing lists after we receive individual feed 
updates.)

# Using public-inbox with structured data

One of the problems we are trying to solve is how to deliver structured 
data like CI reports, bugs, issues, etc in a decentralized fashion.  
Instead of (or in addition to) sending mail to mailing lists and 
individual developers, bots and bug-tracking tools can provide their own 
feeds with structured data aimed at consumption by client-side and 
server-side tools.

I suggest we use public-inbox feeds with structured data in addition to 
human-readable data, using some universally adopted machine-parseable
format like JSON. In my mind, I see this working as a separate ref in 
each individual feed, e.g.:

refs/heads/master -- RFC-2822 (email) feed for human consumption
refs/heads/json -- json feed for machine-readable structured data

E.g. syzbot could publish a human-readable message in master:

----
From: syzbot
To: [list of addressees here]
Subject: BUG: bad usercopy in read_rio
Date: Wed, 09 Oct 2019 09:09:06 -0700

Hello,

syzbot found the following crash on:

HEAD commit:    58d5f26a usb-fuzzer: main usb gadget fuzzer driver
git tree:       https://github.com/google/kasan.git usb-fuzzer
console output: https://syzkaller.appspot.com/x/log.txt?x=149329b3600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58
dashboard link: https://syzkaller.appspot.com/bug?extid=43e923a8937c203e9954
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

...
----

The same data, including all the relevant info provided via
syzkaller.appspot.com links would be included in the structured-section 
commit, allowing client-side tools to present it to the developer 
without requiring that they view it on the internet (or simply included 
for archival purposes).

The same approach can be used by bugzilla and any other bug-tracking 
software -- a human-readable commit in master, plus a corresponding 
machine-formatted commit in refs/heads/json. Minor record changes that 
aren't intended for humans can omit the commit in master (to avoid
the usual noise of "so-and-so started following this bug" messages). All 
commits would be cryptographically signed and fully attestable.

All these feeds can be aggregated centrally by entities like kernel.org 
for ease of discovery and replication, though this process would be 
human-administered and not automatic.

# Where this falls short

This is an archival solution first and foremost and not a true 
distributed, decentralized communication fabric. It solves the following 
problems:

  - it gets us cryptographically attestable feeds from important people 
    with little effort on their part (after initial setup)
  - it allows centralized tools (bots, forges, bug trackers, CI) to 
    export internal data so it can be preserved for future reference or 
    consumed directly by client-side tools -- though it obviously 
    requires that vendors jump on this bandwagon and don't simply ignore 
    it
  - it uses existing technologies that are known to work well together
    (public-inbox, git) and doesn't require that we adopt any nascent 
    technologies like SSB that are still in early stages of development 
    and haven't yet had time to mature

What this doesn't fix:

  - we still continue to largely rely on email and mailing lists, though 
    theoretically their use would become less important as more 
    developer feeds are aggregated and maintainer tools start to rely on 
    those as their primary source of truth. We can easily see a future 
    where vger.kernel.org just writes to public-inbox archives and 
    leaves mail delivery and subscription management up to someone else.
  - we still need aggregation authorities like kernel.org -- though we 
    can hedge this by having multiple mirrors and publishing a manifest 
    of feeds that can be pulled individually if needed
  - this doesn't really get us builtin encrypted communication between 
    developers, though we can think of some clever solutions, such as
    keypairs per incident that are initially only distributed to members 
    of security@kernel.org and then disclosed publicly after embargo is 
    lifted, allowing anyone interested to go back and read the encrypted 
    discussion for the purpose of full transparency.

The main upside of this approach is that it's evolutionary and not 
revolutionary and we can start implementing it right away, using it to 
augment and improve mailing lists instead of replacing them outright.

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread