From: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
To: workflows@vger.kernel.org
Subject: RFC: individual public-inbox/git activity feeds
Date: Thu, 10 Oct 2019 15:28:52 -0400 [thread overview]
Message-ID: <20191010192852.wl622ijvyy6i6tiu@chatter.i7.local> (raw)
Hi, all:
The idea of using public-inbox repositories as individual feeds has been
mentioned here a couple of times already, and I would like to propose a
tentative approach that could work without needing to involve SSB or
other protocols.
# What are public-inbox repos?
Public-inbox (v2) uses git to archive mail messages, with the following
general structure:
topdir/
0.git/
1.git/
...
Each of these git repositories has a single ref, master, with a single
file "m" containing the entire body of the message, e.g.:
- https://erol.kernel.org/workflows/git/0/tree/m
Each incoming message overwrites this file and creates a new commit,
e.g.:
- https://erol.kernel.org/workflows/git/0/log/m
This has the following upsides:
- with a single file, git commit operations are very fast
- git performance remains pretty much unaffected as repository grows,
since there aren't more and more objects to hash (the main downside
of public-inbox v1).
- it is easy to get the contents of any message by simply performing
`git show <commit hash>:m`, which is a very fast operation even for
very old messages in the archive
- most language environments have decent git libraries, so writing
tooling around git repositories is easy
- git is really good at replicating itself, especially with a single
ref
- git supports commit signing, so all commits can have cryptographic
attestation if the tools are configured to do that
There are a few downsides to this, too:
- git maintenance tools like git-repack don't expect that repository
contents are going to be 90%-100% rewritten with every new commit,
so by default it will try to perform many rather useless
optimizations looking for non-existent deltas (but this can be
tweaked in config files)
- most useful operations require maintaining auxiliary databases, e.g.
for message-id to commit-id mapping -- so repositories need to be
indexed using public-inbox-index in order to be useful for more than
just archival and replication. For huge repositories like LKML, the
initial indexing takes a long time, though subsequent
public-inbox-index calls after each `git remote update` are pretty
quick.
- there is only rudimentary sharding into epochs, which makes partial
replication tricky (e.g. "replicate just the archives from last
October")
# Public-inbox repositories are feeds
Each public-inbox repository is therefore a consecutive feed of messages
in the same sense something like SSB or NNTP is (for this reason,
there's robust NNTP support in public-inbox). Public-inbox feeds are:
- distributed
- immutable (or tamper-evident once replicated, which is effectively
the same as immutable if git is configured to reject non-ff updates)
- cryptographically attestable, if commit signing is used
# Individual developer feeds
Individual developers can begin providing their own public-inbox feeds.
At the start, they can act as a sort of a "public sent-mail folder" -- a
simple tool would monitor the local/IMAP "sent" folder and add any new
mail it finds (sent to specific mailing lists) to the developer's local
public-inbox instance. Every commit will be automatically signed and
pushed out to a public remote.
On the kernel.org side, we can collate these individual feeds and mirror
them into an aggregated feeds repository, with a ref per individual
developer, like so:
refs/feeds/gregkh/0/master
refs/feeds/davem/0/master
refs/feeds/davem/1/master
...
Already, this gives us the following perks:
- cryptographic attestation
- patches that are guaranteed against mangling by MTA software
- guaranteed spam-free message delivery from all the important people
- permanent, attestable and distributable archive
(With time, we can teach kernel.org to act as an MTA bridge that sends
actual mail to the mailing lists after we receive individual feed
updates.)
# Using public-inbox with structured data
One of the problems we are trying to solve is how to deliver structured
data like CI reports, bugs, issues, etc in a decentralized fashion.
Instead of (or in addition to) sending mail to mailing lists and
individual developers, bots and bug-tracking tools can provide their own
feeds with structured data aimed at consumption by client-side and
server-side tools.
I suggest we use public-inbox feeds with structured data in addition to
human-readable data, using some universally adopted machine-parseable
format like JSON. In my mind, I see this working as a separate ref in
each individual feed, e.g.:
refs/heads/master -- RFC-2822 (email) feed for human consumption
refs/heads/json -- json feed for machine-readable structured data
E.g. syzbot could publish a human-readable message in master:
----
From: syzbot
To: [list of addressees here]
Subject: BUG: bad usercopy in read_rio
Date: Wed, 09 Oct 2019 09:09:06 -0700
Hello,
syzbot found the following crash on:
HEAD commit: 58d5f26a usb-fuzzer: main usb gadget fuzzer driver
git tree: https://github.com/google/kasan.git usb-fuzzer
console output: https://syzkaller.appspot.com/x/log.txt?x=149329b3600000
kernel config: https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58
dashboard link: https://syzkaller.appspot.com/bug?extid=43e923a8937c203e9954
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
...
----
The same data, including all the relevant info provided via
syzkaller.appspot.com links would be included in the structured-section
commit, allowing client-side tools to present it to the developer
without requiring that they view it on the internet (or simply included
for archival purposes).
The same approach can be used by bugzilla and any other bug-tracking
software -- a human-readable commit in master, plus a corresponding
machine-formatted commit in refs/heads/json. Minor record changes that
aren't intended for humans can omit the commit in master (to avoid
the usual noise of "so-and-so started following this bug" messages). All
commits would be cryptographically signed and fully attestable.
All these feeds can be aggregated centrally by entities like kernel.org
for ease of discovery and replication, though this process would be
human-administered and not automatic.
# Where this falls short
This is an archival solution first and foremost and not a true
distributed, decentralized communication fabric. It solves the following
problems:
- it gets us cryptographically attestable feeds from important people
with little effort on their part (after initial setup)
- it allows centralized tools (bots, forges, bug trackers, CI) to
export internal data so it can be preserved for future reference or
consumed directly by client-side tools -- though it obviously
requires that vendors jump on this bandwagon and don't simply ignore
it
- it uses existing technologies that are known to work well together
(public-inbox, git) and doesn't require that we adopt any nascent
technologies like SSB that are still in early stages of development
and haven't yet had time to mature
What this doesn't fix:
- we still continue to largely rely on email and mailing lists, though
theoretically their use would become less important as more
developer feeds are aggregated and maintainer tools start to rely on
those as their primary source of truth. We can easily see a future
where vger.kernel.org just writes to public-inbox archives and
leaves mail delivery and subscription management up to someone else.
- we still need aggregation authorities like kernel.org -- though we
can hedge this by having multiple mirrors and publishing a manifest
of feeds that can be pulled individually if needed
- this doesn't really get us builtin encrypted communication between
developers, though we can think of some clever solutions, such as
keypairs per incident that are initially only distributed to members
of security@kernel.org and then disclosed publicly after embargo is
lifted, allowing anyone interested to go back and read the encrypted
discussion for the purpose of full transparency.
The main upside of this approach is that it's evolutionary and not
revolutionary and we can start implementing it right away, using it to
augment and improve mailing lists instead of replacing them outright.
-K
next reply other threads:[~2019-10-10 19:28 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-10 19:28 Konstantin Ryabitsev [this message]
2019-10-10 23:57 ` RFC: individual public-inbox/git activity feeds Eric Wong
2019-10-18 2:48 ` Eric Wong
2019-10-11 17:15 ` Dmitry Vyukov
2019-10-11 19:07 ` Geert Uytterhoeven
2019-10-11 19:12 ` Laurent Pinchart
2019-10-14 6:43 ` Dmitry Vyukov
2019-10-11 19:39 ` Konstantin Ryabitsev
2019-10-12 11:48 ` Mauro Carvalho Chehab
2019-10-11 22:57 ` Daniel Borkmann
2019-10-12 7:50 ` Greg KH
2019-10-12 11:20 ` Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191010192852.wl622ijvyy6i6tiu@chatter.i7.local \
--to=konstantin@linuxfoundation.org \
--cc=workflows@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).