Hi, all: The idea of using public-inbox repositories as individual feeds has been mentioned here a couple of times already, and I would like to propose a tentative approach that could work without needing to involve SSB or other protocols. # What are public-inbox repos? Public-inbox (v2) uses git to archive mail messages, with the following general structure: topdir/ 0.git/ 1.git/ ... Each of these git repositories has a single ref, master, with a single file "m" containing the entire body of the message, e.g.: - https://erol.kernel.org/workflows/git/0/tree/m Each incoming message overwrites this file and creates a new commit, e.g.: - https://erol.kernel.org/workflows/git/0/log/m This has the following upsides: - with a single file, git commit operations are very fast - git performance remains pretty much unaffected as repository grows, since there aren't more and more objects to hash (the main downside of public-inbox v1). - it is easy to get the contents of any message by simply performing `git show <commit hash>:m`, which is a very fast operation even for very old messages in the archive - most language environments have decent git libraries, so writing tooling around git repositories is easy - git is really good at replicating itself, especially with a single ref - git supports commit signing, so all commits can have cryptographic attestation if the tools are configured to do that There are a few downsides to this, too: - git maintenance tools like git-repack don't expect that repository contents are going to be 90%-100% rewritten with every new commit, so by default it will try to perform many rather useless optimizations looking for non-existent deltas (but this can be tweaked in config files) - most useful operations require maintaining auxiliary databases, e.g. for message-id to commit-id mapping -- so repositories need to be indexed using public-inbox-index in order to be useful for more than just archival and replication. For huge repositories like LKML, the initial indexing takes a long time, though subsequent public-inbox-index calls after each `git remote update` are pretty quick. - there is only rudimentary sharding into epochs, which makes partial replication tricky (e.g. "replicate just the archives from last October") # Public-inbox repositories are feeds Each public-inbox repository is therefore a consecutive feed of messages in the same sense something like SSB or NNTP is (for this reason, there's robust NNTP support in public-inbox). Public-inbox feeds are: - distributed - immutable (or tamper-evident once replicated, which is effectively the same as immutable if git is configured to reject non-ff updates) - cryptographically attestable, if commit signing is used # Individual developer feeds Individual developers can begin providing their own public-inbox feeds. At the start, they can act as a sort of a "public sent-mail folder" -- a simple tool would monitor the local/IMAP "sent" folder and add any new mail it finds (sent to specific mailing lists) to the developer's local public-inbox instance. Every commit will be automatically signed and pushed out to a public remote. On the kernel.org side, we can collate these individual feeds and mirror them into an aggregated feeds repository, with a ref per individual developer, like so: refs/feeds/gregkh/0/master refs/feeds/davem/0/master refs/feeds/davem/1/master ... Already, this gives us the following perks: - cryptographic attestation - patches that are guaranteed against mangling by MTA software - guaranteed spam-free message delivery from all the important people - permanent, attestable and distributable archive (With time, we can teach kernel.org to act as an MTA bridge that sends actual mail to the mailing lists after we receive individual feed updates.) # Using public-inbox with structured data One of the problems we are trying to solve is how to deliver structured data like CI reports, bugs, issues, etc in a decentralized fashion. Instead of (or in addition to) sending mail to mailing lists and individual developers, bots and bug-tracking tools can provide their own feeds with structured data aimed at consumption by client-side and server-side tools. I suggest we use public-inbox feeds with structured data in addition to human-readable data, using some universally adopted machine-parseable format like JSON. In my mind, I see this working as a separate ref in each individual feed, e.g.: refs/heads/master -- RFC-2822 (email) feed for human consumption refs/heads/json -- json feed for machine-readable structured data E.g. syzbot could publish a human-readable message in master: ---- From: syzbot To: [list of addressees here] Subject: BUG: bad usercopy in read_rio Date: Wed, 09 Oct 2019 09:09:06 -0700 Hello, syzbot found the following crash on: HEAD commit: 58d5f26a usb-fuzzer: main usb gadget fuzzer driver git tree: https://github.com/google/kasan.git usb-fuzzer console output: https://syzkaller.appspot.com/x/log.txt?x=149329b3600000 kernel config: https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58 dashboard link: https://syzkaller.appspot.com/bug?extid=43e923a8937c203e9954 compiler: gcc (GCC) 9.0.0 20181231 (experimental) ... ---- The same data, including all the relevant info provided via syzkaller.appspot.com links would be included in the structured-section commit, allowing client-side tools to present it to the developer without requiring that they view it on the internet (or simply included for archival purposes). The same approach can be used by bugzilla and any other bug-tracking software -- a human-readable commit in master, plus a corresponding machine-formatted commit in refs/heads/json. Minor record changes that aren't intended for humans can omit the commit in master (to avoid the usual noise of "so-and-so started following this bug" messages). All commits would be cryptographically signed and fully attestable. All these feeds can be aggregated centrally by entities like kernel.org for ease of discovery and replication, though this process would be human-administered and not automatic. # Where this falls short This is an archival solution first and foremost and not a true distributed, decentralized communication fabric. It solves the following problems: - it gets us cryptographically attestable feeds from important people with little effort on their part (after initial setup) - it allows centralized tools (bots, forges, bug trackers, CI) to export internal data so it can be preserved for future reference or consumed directly by client-side tools -- though it obviously requires that vendors jump on this bandwagon and don't simply ignore it - it uses existing technologies that are known to work well together (public-inbox, git) and doesn't require that we adopt any nascent technologies like SSB that are still in early stages of development and haven't yet had time to mature What this doesn't fix: - we still continue to largely rely on email and mailing lists, though theoretically their use would become less important as more developer feeds are aggregated and maintainer tools start to rely on those as their primary source of truth. We can easily see a future where vger.kernel.org just writes to public-inbox archives and leaves mail delivery and subscription management up to someone else. - we still need aggregation authorities like kernel.org -- though we can hedge this by having multiple mirrors and publishing a manifest of feeds that can be pulled individually if needed - this doesn't really get us builtin encrypted communication between developers, though we can think of some clever solutions, such as keypairs per incident that are initially only distributed to members of security@kernel.org and then disclosed publicly after embargo is lifted, allowing anyone interested to go back and read the encrypted discussion for the purpose of full transparency. The main upside of this approach is that it's evolutionary and not revolutionary and we can start implementing it right away, using it to augment and improve mailing lists instead of replacing them outright. -K
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: <snip a bunch of stuff I agree with> > # Individual developer feeds <snip> > (With time, we can teach kernel.org to act as an MTA bridge that sends > actual mail to the mailing lists after we receive individual feed updates.) I'm skeptical and pessimistic about that bit happening (as I usually am :>). But the great thing is all that stuff can happen without disrupting/changing existing workflows and is totally optional. > # Using public-inbox with structured data > > One of the problems we are trying to solve is how to deliver structured data > like CI reports, bugs, issues, etc in a decentralized fashion. Instead of > (or in addition to) sending mail to mailing lists and individual developers, > bots and bug-tracking tools can provide their own feeds with structured data > aimed at consumption by client-side and server-side tools. > > I suggest we use public-inbox feeds with structured data in addition to > human-readable data, using some universally adopted machine-parseable > format like JSON. In my mind, I see this working as a separate ref in each > individual feed, e.g.: > > refs/heads/master -- RFC-2822 (email) feed for human consumption > refs/heads/json -- json feed for machine-readable structured data Having a side-channel in addition to email make people learn and use new tools (not good). Furthermore, that data likely end up in commit messages, and have to be translated from JSON... Instead, the structured data should be RFC822-like so "git interpret-trailers" can write it. It'd probably be similar to Debbugs: https://lore.kernel.org/workflows/20191008213626.GB8130@dcvr/ > E.g. syzbot could publish a human-readable message in master: > > ---- > From: syzbot > To: [list of addressees here] > Subject: BUG: bad usercopy in read_rio > Date: Wed, 09 Oct 2019 09:09:06 -0700 > > Hello, > > syzbot found the following crash on: > > HEAD commit: 58d5f26a usb-fuzzer: main usb gadget fuzzer driver > git tree: https://github.com/google/kasan.git usb-fuzzer > console output: https://syzkaller.appspot.com/x/log.txt?x=149329b3600000 > kernel config: https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58 > dashboard link: https://syzkaller.appspot.com/bug?extid=43e923a8937c203e9954 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) That's already close enough to git trailers (s/ /-/). > ... > ---- > > The same data, including all the relevant info provided via > syzkaller.appspot.com links would be included in the structured-section > commit, allowing client-side tools to present it to the developer without > requiring that they view it on the internet (or simply included for archival > purposes). That seems redundant given the above. > The same approach can be used by bugzilla and any other bug-tracking > software -- a human-readable commit in master, plus a corresponding > machine-formatted commit in refs/heads/json. Minor record changes that > aren't intended for humans can omit the commit in master (to avoid > the usual noise of "so-and-so started following this bug" messages). All > commits would be cryptographically signed and fully attestable. If those bug trackers can already interpret stuff like "Fixes:" in the kernel commit messages, making them deal with JSON or another channel is too much. If they can't deal with "Fixes:", then there's no expectation they'd deal with a new JSON thing, either. "so-and-so following messages" don't need to be public info. > All these feeds can be aggregated centrally by entities like kernel.org for > ease of discovery and replication, though this process would be > human-administered and not automatic. > > # Where this falls short > > This is an archival solution first and foremost and not a true distributed, > decentralized communication fabric. It solves the following problems: > > - it gets us cryptographically attestable feeds from important people > with little effort on their part (after initial setup) > - it allows centralized tools (bots, forges, bug trackers, CI) to export > internal data so it can be preserved for future reference or consumed > directly by client-side tools -- though it obviously requires that > vendors jump on this bandwagon and don't simply ignore it > - it uses existing technologies that are known to work well together > (public-inbox, git) and doesn't require that we adopt any nascent > technologies like SSB that are still in early stages of development and > haven't yet had time to mature Even the JSON feed is too much to ask people to adopt. <snip> > The main upside of this approach is that it's evolutionary and not > revolutionary and we can start implementing it right away, using it to > augment and improve mailing lists instead of replacing them outright. That. We should take these one small step-at-a-time and see where things take us. The key is to remain harmonious with existing workflows and be transparent to people who won't change. Same thing worked for git-svn obsoleting Subversion. I just don't want to end up with a proprietary/centralized InboxHub this time around :P
On Thu, Oct 10, 2019 at 9:29 PM Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > Hi, all: > > The idea of using public-inbox repositories as individual feeds has been > mentioned here a couple of times already, and I would like to propose a > tentative approach that could work without needing to involve SSB or > other protocols. > > # What are public-inbox repos? > > Public-inbox (v2) uses git to archive mail messages, with the following > general structure: > > topdir/ > 0.git/ > 1.git/ > ... > > Each of these git repositories has a single ref, master, with a single > file "m" containing the entire body of the message, e.g.: > - https://erol.kernel.org/workflows/git/0/tree/m > > Each incoming message overwrites this file and creates a new commit, > e.g.: > - https://erol.kernel.org/workflows/git/0/log/m > > This has the following upsides: > > - with a single file, git commit operations are very fast > - git performance remains pretty much unaffected as repository grows, > since there aren't more and more objects to hash (the main downside > of public-inbox v1). > - it is easy to get the contents of any message by simply performing > `git show <commit hash>:m`, which is a very fast operation even for > very old messages in the archive > - most language environments have decent git libraries, so writing > tooling around git repositories is easy > - git is really good at replicating itself, especially with a single > ref > - git supports commit signing, so all commits can have cryptographic > attestation if the tools are configured to do that > > There are a few downsides to this, too: > > - git maintenance tools like git-repack don't expect that repository > contents are going to be 90%-100% rewritten with every new commit, > so by default it will try to perform many rather useless > optimizations looking for non-existent deltas (but this can be > tweaked in config files) > - most useful operations require maintaining auxiliary databases, e.g. > for message-id to commit-id mapping -- so repositories need to be > indexed using public-inbox-index in order to be useful for more than > just archival and replication. For huge repositories like LKML, the > initial indexing takes a long time, though subsequent > public-inbox-index calls after each `git remote update` are pretty > quick. > - there is only rudimentary sharding into epochs, which makes partial > replication tricky (e.g. "replicate just the archives from last > October") > > # Public-inbox repositories are feeds > > Each public-inbox repository is therefore a consecutive feed of messages > in the same sense something like SSB or NNTP is (for this reason, > there's robust NNTP support in public-inbox). Public-inbox feeds are: > > - distributed > - immutable (or tamper-evident once replicated, which is effectively > the same as immutable if git is configured to reject non-ff updates) > - cryptographically attestable, if commit signing is used > > # Individual developer feeds > > Individual developers can begin providing their own public-inbox feeds. > At the start, they can act as a sort of a "public sent-mail folder" -- a > simple tool would monitor the local/IMAP "sent" folder and add any new > mail it finds (sent to specific mailing lists) to the developer's local > public-inbox instance. Every commit will be automatically signed and > pushed out to a public remote. > > On the kernel.org side, we can collate these individual feeds and mirror > them into an aggregated feeds repository, with a ref per individual > developer, like so: > > refs/feeds/gregkh/0/master > refs/feeds/davem/0/master > refs/feeds/davem/1/master > ... > > Already, this gives us the following perks: > > - cryptographic attestation > - patches that are guaranteed against mangling by MTA software > - guaranteed spam-free message delivery from all the important people > - permanent, attestable and distributable archive > > (With time, we can teach kernel.org to act as an MTA bridge that sends > actual mail to the mailing lists after we receive individual feed > updates.) > > # Using public-inbox with structured data > > One of the problems we are trying to solve is how to deliver structured > data like CI reports, bugs, issues, etc in a decentralized fashion. > Instead of (or in addition to) sending mail to mailing lists and > individual developers, bots and bug-tracking tools can provide their own > feeds with structured data aimed at consumption by client-side and > server-side tools. > > I suggest we use public-inbox feeds with structured data in addition to > human-readable data, using some universally adopted machine-parseable > format like JSON. In my mind, I see this working as a separate ref in > each individual feed, e.g.: > > refs/heads/master -- RFC-2822 (email) feed for human consumption > refs/heads/json -- json feed for machine-readable structured data > > E.g. syzbot could publish a human-readable message in master: > > ---- > From: syzbot > To: [list of addressees here] > Subject: BUG: bad usercopy in read_rio > Date: Wed, 09 Oct 2019 09:09:06 -0700 > > Hello, > > syzbot found the following crash on: > > HEAD commit: 58d5f26a usb-fuzzer: main usb gadget fuzzer driver > git tree: https://github.com/google/kasan.git usb-fuzzer > console output: https://syzkaller.appspot.com/x/log.txt?x=149329b3600000 > kernel config: https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58 > dashboard link: https://syzkaller.appspot.com/bug?extid=43e923a8937c203e9954 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > ... > ---- > > The same data, including all the relevant info provided via > syzkaller.appspot.com links would be included in the structured-section > commit, allowing client-side tools to present it to the developer > without requiring that they view it on the internet (or simply included > for archival purposes). > > The same approach can be used by bugzilla and any other bug-tracking > software -- a human-readable commit in master, plus a corresponding > machine-formatted commit in refs/heads/json. Minor record changes that > aren't intended for humans can omit the commit in master (to avoid > the usual noise of "so-and-so started following this bug" messages). All > commits would be cryptographically signed and fully attestable. > > All these feeds can be aggregated centrally by entities like kernel.org > for ease of discovery and replication, though this process would be > human-administered and not automatic. > > # Where this falls short > > This is an archival solution first and foremost and not a true > distributed, decentralized communication fabric. It solves the following > problems: > > - it gets us cryptographically attestable feeds from important people > with little effort on their part (after initial setup) > - it allows centralized tools (bots, forges, bug trackers, CI) to > export internal data so it can be preserved for future reference or > consumed directly by client-side tools -- though it obviously > requires that vendors jump on this bandwagon and don't simply ignore > it > - it uses existing technologies that are known to work well together > (public-inbox, git) and doesn't require that we adopt any nascent > technologies like SSB that are still in early stages of development > and haven't yet had time to mature > > What this doesn't fix: > > - we still continue to largely rely on email and mailing lists, though > theoretically their use would become less important as more > developer feeds are aggregated and maintainer tools start to rely on > those as their primary source of truth. We can easily see a future > where vger.kernel.org just writes to public-inbox archives and > leaves mail delivery and subscription management up to someone else. > - we still need aggregation authorities like kernel.org -- though we > can hedge this by having multiple mirrors and publishing a manifest > of feeds that can be pulled individually if needed > - this doesn't really get us builtin encrypted communication between > developers, though we can think of some clever solutions, such as > keypairs per incident that are initially only distributed to members > of security@kernel.org and then disclosed publicly after embargo is > lifted, allowing anyone interested to go back and read the encrypted > discussion for the purpose of full transparency. > > The main upside of this approach is that it's evolutionary and not > revolutionary and we can start implementing it right away, using it to > augment and improve mailing lists instead of replacing them outright. Interesting. This is similar to SSB on _some_ level, right? Because it's just a different type of transport. I personally don't have any horses in the transport race (as long as it is easy to setup and provides a good foundation for transferring structured data). What attracted my attention is this part: refs/feeds/gregkh/0/master refs/feeds/davem/0/master refs/feeds/davem/1/master Will this provide a total ordering over all messages by all participants? That may be a significant advantage over SSB then (see point 14 in [1]). But the "that can be pulled individually" part breaks this (complete read-only mirrors for fault-tolerance are fine, though). This may also need some form of DoS protection (esp as we move further from email). I also tend to conclude that some actions should not be done offline and then "synced" a week later. Ted provided an example of starting tests in another thread. Or, say if you close a bug and then push than update a month later without any regard to the current bug state, that may not be the right thing. Working with read-only data offline is perfectly fine. Doing _some_ updates locally and then pushing a week later is fine (e.g. queue a new patch for review). But not necessary all updates should be doable in offline mode. And this seems to be inherent conflict with any scheme where one can "queue" any updates locally, and then "sync" them anytime later without any regard to the current state of things and just tell the system and all other participants "deal with it". Also, if we have any kind of permissions/quotas, when are these checks done: when one creates an update or when it's synced? This is interesting too: refs/heads/master -- RFC-2822 (email) feed for human consumption refs/heads/json -- json feed for machine-readable structured data Playing devil's advocate, what about MIME? :) It does not need to be completely arbitrary MIME, but say only 2 alternative section, first has to be plain/text, second (optional) has to be kthul/json. Say, "kthul mail" creates that properly formed email with plain text and all structured data. Or, CI creates both human readable and machine readable form. It seems reasonable to keep both versions together. Though, it's not that I thought it all out and strongly advocating this. Just a potential interesting option. [1] https://lore.kernel.org/workflows/CACT4Y+YU78dQUeFob7NXaOU-gjnKHtxpceQj2c4=2aBV0_PSxg@mail.gmail.com/T/#t
Hi Dmitry,
On Fri, Oct 11, 2019 at 7:15 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> I also tend to conclude that some actions should not be done offline
> and then "synced" a week later. Ted provided an example of starting
> tests in another thread. Or, say if you close a bug and then push than
> update a month later without any regard to the current bug state, that
> may not be the right thing. Working with read-only data offline is
> perfectly fine. Doing _some_ updates locally and then pushing a week
> later is fine (e.g. queue a new patch for review). But not necessary
> all updates should be doable in offline mode. And this seems to be
> inherent conflict with any scheme where one can "queue" any updates
> locally, and then "sync" them anytime later without any regard to the
> current state of things and just tell the system and all other
> participants "deal with it". Also, if we have any kind of
> permissions/quotas, when are these checks done: when one creates an
> update or when it's synced?
Not unlike "git push" accepting fast-forwards only, and rejecting
forced updates.
Hence you cannot push the close of a bug (each bug has its own
branch?) before merging the updated remote state first.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Fri, Oct 11, 2019 at 09:07:20PM +0200, Geert Uytterhoeven wrote:
> Hi Dmitry,
>
> On Fri, Oct 11, 2019 at 7:15 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > I also tend to conclude that some actions should not be done offline
> > and then "synced" a week later. Ted provided an example of starting
> > tests in another thread. Or, say if you close a bug and then push than
> > update a month later without any regard to the current bug state, that
> > may not be the right thing. Working with read-only data offline is
> > perfectly fine. Doing _some_ updates locally and then pushing a week
> > later is fine (e.g. queue a new patch for review). But not necessary
> > all updates should be doable in offline mode. And this seems to be
> > inherent conflict with any scheme where one can "queue" any updates
> > locally, and then "sync" them anytime later without any regard to the
> > current state of things and just tell the system and all other
> > participants "deal with it". Also, if we have any kind of
> > permissions/quotas, when are these checks done: when one creates an
> > update or when it's synced?
>
> Not unlike "git push" accepting fast-forwards only, and rejecting
> forced updates.
> Hence you cannot push the close of a bug (each bug has its own
> branch?) before merging the updated remote state first.
That might work in small projects, but at a bigger scale you soon start
hitting races to get to the build bot before everybody else, and the CI
system gets trashed with cycles of lost races, rebase and retry. It's
not something we could enforce globally.
--
Regards,
Laurent Pinchart
On Fri, Oct 11, 2019 at 07:15:12PM +0200, Dmitry Vyukov wrote: >> The main upside of this approach is that it's evolutionary and not >> revolutionary and we can start implementing it right away, using it to >> augment and improve mailing lists instead of replacing them outright. > > >Interesting. This is similar to SSB on _some_ level, right? Because >it's just a different type of transport. I personally don't have any >horses in the transport race (as long as it is easy to setup and >provides a good foundation for transferring structured data). It's similar only in the sense that it's a chain of records that can be optionally cryptographically signed. Some of the problems that SSB (and especially v2) tries to solve are not anything git concerns itself about, such as discovery, feed cross-reference, verifiable partial clones, etc. >What attracted my attention is this part: > >refs/feeds/gregkh/0/master >refs/feeds/davem/0/master >refs/feeds/davem/1/master > >Will this provide a total ordering over all messages by all >participants? That may be a significant advantage over SSB then (see >point 14 in [1]). But the "that can be pulled individually" part >breaks this (complete read-only mirrors for fault-tolerance are fine, >though). No, these refs are entirely independent of each-other. In a sense, it's the equivalent of cloning individual public-inbox repos together and then tar'ing them up. For ordering, we still have to go with commit timestamps and we'll still have conflicting resolutions, just like you mention (though this isn't any different than with email). >This may also need some form of DoS protection (esp as we move further >from email). Well, amusingly, there are ways of distributing git via decentralized protocols (SSB, DAT, IPFS). They are all fairly immature, though, and some of them are truly terrible ideas. For the moment, our best protection against DoS attacks on git repos is having many frontends, some powerful allies (e.g. see kernel.googlesource.com), and DoS-avoidance by obscurity ("I can't push to kernel.org right now, but you can pull my repo from my personal server over here"). >I also tend to conclude that some actions should not be done offline >and then "synced" a week later. Ted provided an example of starting >tests in another thread. Or, say if you close a bug and then push than >update a month later without any regard to the current bug state, that >may not be the right thing. The same is true with email, though -- people who queue up email in their outbox and lose connectivity before they can send it out is something that happens often. True, we aren't solving this, but it's not a net-new problem and will always be a hard problem to solve for laggy decentralized environments. >Working with read-only data offline is >perfectly fine. Doing _some_ updates locally and then pushing a week >later is fine (e.g. queue a new patch for review). But not necessary >all updates should be doable in offline mode. And this seems to be >inherent conflict with any scheme where one can "queue" any updates >locally, and then "sync" them anytime later without any regard to the >current state of things and just tell the system and all other >participants "deal with it". Well, in all honesty, "queueing things up for a week" is going to be an increasingly rare problem for anyone who works on the Linux kernel. I don't know about others, but I can recall every time I've actually been offline in the past year and in each case it involved a cross-atlantic flight with a totally broken wi-fi or a trip into a rare spot on the map without cell towers. Even long power outages simply mean I have to tether my laptop via my phone. Thanks to wireguard, I don't even lose ssh sessions when that happens. :) Replicating a feed out is a very quick task that can be made quicker with tricks like ssh controlmaster connections that keep sessions going. >This is interesting too: > >refs/heads/master -- RFC-2822 (email) feed for human consumption >refs/heads/json -- json feed for machine-readable structured data > >Playing devil's advocate, what about MIME? :) >It does not need to be completely arbitrary MIME, but say only 2 >alternative section, first has to be plain/text, second (optional) has >to be kthul/json. The main reason why I wanted two different refs is so entities like bots could only pull the json ref and ignore the one aimed at humans. So, while this makes the repository larger by having some data duplication, this should make pulling and parsing less problematic by bots, and I expect bots to be the ones generating most frequent hits and traffic. -K
On 10/10/19 9:28 PM, Konstantin Ryabitsev wrote: [...] > # Individual developer feeds > > Individual developers can begin providing their own public-inbox feeds. > At the start, they can act as a sort of a "public sent-mail folder" -- a simple tool would monitor the local/IMAP "sent" folder and add any new mail it finds (sent to specific mailing lists) to the developer's local public-inbox instance. Every commit will be automatically signed and pushed out to a public remote. > On the kernel.org side, we can collate these individual feeds and mirror them into an aggregated feeds repository, with a ref per individual developer, like so: > > refs/feeds/gregkh/0/master > refs/feeds/davem/0/master > refs/feeds/davem/1/master > ... > > Already, this gives us the following perks: > > - cryptographic attestation > - patches that are guaranteed against mangling by MTA software > - guaranteed spam-free message delivery from all the important people > - permanent, attestable and distributable archive > > (With time, we can teach kernel.org to act as an MTA bridge that sends actual mail to the mailing lists after we receive individual feed updates.) [...] [...] > - we still continue to largely rely on email and mailing lists, though theoretically their use would become less important as more developer feeds are aggregated and maintainer tools start to rely on those as their primary source of truth. We can easily see a future where vger.kernel.org just writes to public-inbox archives and leaves mail delivery and subscription management up to someone else. [...] > The main upside of this approach is that it's evolutionary and not revolutionary and we can start implementing it right away, using it to augment and improve mailing lists instead of replacing them outright. I do like these aspects, and the receive side aka git to mail client integration is already done, so the one missing piece is a sendmail drop-in replacement acting as public git sent-mail folder. I think it doesn't have to be on kernel.org, but could live anywhere e.g. developers could also push to github or elsewhere with such tool, so "subscribing" to a mailing list for sending would need kernel.org infra that adds the repo to a list of repos to pull from, extracts <commit hash>:m from that developers repo from the point where it was last read up to the git HEAD (e.g. rejecting any forced pushes, and doing sanity checks on m), and m would then be committed conflict-free to the official public-inbox repositories of the lists in Cc in m, and potentially sent from kernel.org via MTA bridge to old-style mail receivers. Nice thing is that this would allow for transparent testing/roll-out to today's development workflow. It might be one component/(sub-)tool of the bigger picture to have email slowly fade out (and new/non-mail based tools could be built around it, too). Thanks, Daniel
On Thu, Oct 10, 2019 at 03:28:52PM -0400, Konstantin Ryabitsev wrote: > # Individual developer feeds > > Individual developers can begin providing their own public-inbox feeds. > At the start, they can act as a sort of a "public sent-mail folder" -- a > simple tool would monitor the local/IMAP "sent" folder and add any new mail > it finds (sent to specific mailing lists) to the developer's local > public-inbox instance. Every commit will be automatically signed and pushed > out to a public remote. > > On the kernel.org side, we can collate these individual feeds and mirror > them into an aggregated feeds repository, with a ref per individual > developer, like so: > > refs/feeds/gregkh/0/master The stuff I send out is probably not all that interesting compared to what is sent to me, given that I receive way more than I send. > refs/feeds/davem/0/master > refs/feeds/davem/1/master > ... > > Already, this gives us the following perks: > > - cryptographic attestation > - patches that are guaranteed against mangling by MTA software > - guaranteed spam-free message delivery from all the important people > - permanent, attestable and distributable archive > > (With time, we can teach kernel.org to act as an MTA bridge that sends > actual mail to the mailing lists after we receive individual feed updates.) This would work well for developers that are "large producers" but that doesn't help maintainers much, right? I think I'm missing something, but what would a "feed that only comes from gregkh" help out with? Who wants to consume that? > # Using public-inbox with structured data > > One of the problems we are trying to solve is how to deliver structured data > like CI reports, bugs, issues, etc in a decentralized fashion. Instead of > (or in addition to) sending mail to mailing lists and individual developers, > bots and bug-tracking tools can provide their own feeds with structured data > aimed at consumption by client-side and server-side tools. > > I suggest we use public-inbox feeds with structured data in addition to > human-readable data, using some universally adopted machine-parseable > format like JSON. In my mind, I see this working as a separate ref in each > individual feed, e.g.: > > refs/heads/master -- RFC-2822 (email) feed for human consumption > refs/heads/json -- json feed for machine-readable structured data > > E.g. syzbot could publish a human-readable message in master: > > ---- > From: syzbot > To: [list of addressees here] > Subject: BUG: bad usercopy in read_rio > Date: Wed, 09 Oct 2019 09:09:06 -0700 > > Hello, > > syzbot found the following crash on: > > HEAD commit: 58d5f26a usb-fuzzer: main usb gadget fuzzer driver > git tree: https://github.com/google/kasan.git usb-fuzzer > console output: https://syzkaller.appspot.com/x/log.txt?x=149329b3600000 > kernel config: https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58 > dashboard link: https://syzkaller.appspot.com/bug?extid=43e923a8937c203e9954 > compiler: gcc (GCC) 9.0.0 20181231 (experimental) > > ... > ---- > > The same data, including all the relevant info provided via > syzkaller.appspot.com links would be included in the structured-section > commit, allowing client-side tools to present it to the developer without > requiring that they view it on the internet (or simply included for archival > purposes). > > The same approach can be used by bugzilla and any other bug-tracking > software -- a human-readable commit in master, plus a corresponding > machine-formatted commit in refs/heads/json. Minor record changes that > aren't intended for humans can omit the commit in master (to avoid > the usual noise of "so-and-so started following this bug" messages). All > commits would be cryptographically signed and fully attestable. > > All these feeds can be aggregated centrally by entities like kernel.org for > ease of discovery and replication, though this process would be > human-administered and not automatic. > > # Where this falls short > > This is an archival solution first and foremost and not a true distributed, > decentralized communication fabric. It solves the following problems: > > - it gets us cryptographically attestable feeds from important people > with little effort on their part (after initial setup) > - it allows centralized tools (bots, forges, bug trackers, CI) to export > internal data so it can be preserved for future reference or consumed > directly by client-side tools -- though it obviously requires that > vendors jump on this bandwagon and don't simply ignore it > - it uses existing technologies that are known to work well together > (public-inbox, git) and doesn't require that we adopt any nascent > technologies like SSB that are still in early stages of development and > haven't yet had time to mature > > What this doesn't fix: > > - we still continue to largely rely on email and mailing lists, though > theoretically their use would become less important as more developer > feeds are aggregated and maintainer tools start to rely on those as their > primary source of truth. We can easily see a future where vger.kernel.org > just writes to public-inbox archives and leaves mail delivery and > subscription management up to someone else. That last one would make the vger.kernel.org admins happy :) > - we still need aggregation authorities like kernel.org -- though we can > hedge this by having multiple mirrors and publishing a manifest of feeds > that can be pulled individually if needed > - this doesn't really get us builtin encrypted communication between > developers, though we can think of some clever solutions, such as > keypairs per incident that are initially only distributed to members > of security@kernel.org and then disclosed publicly after embargo is > lifted, allowing anyone interested to go back and read the encrypted > discussion for the purpose of full transparency. We have tools for that with Thomas's encrypted email server, don't know if you want to roll that into this type of system or not. > The main upside of this approach is that it's evolutionary and not > revolutionary and we can start implementing it right away, using it to > augment and improve mailing lists instead of replacing them outright. evolution is good, I think the slow migration of more people using public-inbox archives instead of directly subscribing to mailing lists might help out a lot. Already it seems that lore.kernel.org is updated faster than my email server sees new messages :) thanks, greg k-h
Em Thu, 10 Oct 2019 15:28:52 -0400
Konstantin Ryabitsev <konstantin@linuxfoundation.org> escreveu:
> # Using public-inbox with structured data
>
> One of the problems we are trying to solve is how to deliver structured
> data like CI reports, bugs, issues, etc in a decentralized fashion.
> Instead of (or in addition to) sending mail to mailing lists and
> individual developers, bots and bug-tracking tools can provide their own
> feeds with structured data aimed at consumption by client-side and
> server-side tools.
>
> I suggest we use public-inbox feeds with structured data in addition to
> human-readable data, using some universally adopted machine-parseable
> format like JSON. In my mind, I see this working as a separate ref in
> each individual feed, e.g.:
>
> refs/heads/master -- RFC-2822 (email) feed for human consumption
> refs/heads/json -- json feed for machine-readable structured data
That sounds scary. I mean, now, instead of looking on one inbox,
we'll need to look at two ones that may have the same message
(one in RFC-2822 and the other one in JSON). Worse than that,
the contents of the human-readable could be different than the
contents of the JSON one.
IMO, the best is to have just one format (whatever it is) and some
tool that would convert from it into JSON and/or RFC-2822.
Thanks,
Mauro
Em Fri, 11 Oct 2019 15:39:00 -0400 Konstantin Ryabitsev <konstantin@linuxfoundation.org> escreveu: > On Fri, Oct 11, 2019 at 07:15:12PM +0200, Dmitry Vyukov wrote: > >This may also need some form of DoS protection (esp as we move further > >from email). > > Well, amusingly, there are ways of distributing git via decentralized > protocols (SSB, DAT, IPFS). They are all fairly immature, though, and > some of them are truly terrible ideas. > > For the moment, our best protection against DoS attacks on git repos is > having many frontends, some powerful allies (e.g. see > kernel.googlesource.com), and DoS-avoidance by obscurity ("I can't push > to kernel.org right now, but you can pull my repo from my personal > server over here"). The way I see, some spammer could send git pushes to the public-inbox with thousands of SPAM, with would be forever stored at the repository. So, if we're willing to implement it, we should have already a solution for it since the beginning. The only solution that sounds viable for me is to have a pre-receive hook at the git server that would be receiving the commits. Such hook would be customizable via .git/config, enabling or disabling the functionalities and setting the thresholds: 1) prevent commits with too many patches; If the push have more than, let's say, 20 patches, it would reject the PR. Doing that should be easy. 2) prevent commits from the same person if a certain threshold of patches per period of time exceeds. For example, no single developer (except maybe for the inbox owner) should be allowed to send more than, let's say, 1000 messages per day. 3) Implement gray lists That would be more complex, but I guess it would be possible to implement a hook that, for example, would check if the push comes from a known person (with signed patches with known keys) and/or a know IP address. If not, it would push the contents on a separate gray list repository, rejecting the change at the main one, and adding a notice for the maintainer when a new person is added to the gray list. If the owner of the public-inbox decides to accept the patch, it will simply merge the gray list for that committer at the main inbox, and the developer will be accepted as someone to trust. 4) Implement black list If a previously trusted developer starts spamming or badly behaving, he would be added to a black list file. Anyone there will have any pull requests silently discarded. Thanks, Mauro
On Fri, Oct 11, 2019 at 9:07 PM Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>
> Hi Dmitry,
>
> On Fri, Oct 11, 2019 at 7:15 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > I also tend to conclude that some actions should not be done offline
> > and then "synced" a week later. Ted provided an example of starting
> > tests in another thread. Or, say if you close a bug and then push than
> > update a month later without any regard to the current bug state, that
> > may not be the right thing. Working with read-only data offline is
> > perfectly fine. Doing _some_ updates locally and then pushing a week
> > later is fine (e.g. queue a new patch for review). But not necessary
> > all updates should be doable in offline mode. And this seems to be
> > inherent conflict with any scheme where one can "queue" any updates
> > locally, and then "sync" them anytime later without any regard to the
> > current state of things and just tell the system and all other
> > participants "deal with it". Also, if we have any kind of
> > permissions/quotas, when are these checks done: when one creates an
> > update or when it's synced?
>
> Not unlike "git push" accepting fast-forwards only, and rejecting
> forced updates.
> Hence you cannot push the close of a bug (each bug has its own
> branch?) before merging the updated remote state first.
The update is in my private git. Nobody touched it and there are no
conflicts. The logical conflicts are in other people git's.
Eric Wong <e@80x24.org> wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>
> <snip a bunch of stuff I agree with>
>
> > # Individual developer feeds
>
> <snip>
>
> > (With time, we can teach kernel.org to act as an MTA bridge that sends
> > actual mail to the mailing lists after we receive individual feed updates.)
>
> I'm skeptical and pessimistic about that bit happening (as I
> usually am :>). But the great thing is all that stuff can
> happen without disrupting/changing existing workflows and is
> totally optional.
Well, maybe less skeptical and pessimistic today...
Readers can look for messages intended for them on a DHT or some
other peer-to-peer system. Or maybe various search engines can
spring into existence or existing ones can be optimized for this.
Readers can opt into this by using invalid/mangled addresses
(e.g "user@i-pull-my-email.invalid"); and rely on that to find
messages intended for them.
Senders sending to them will get a bounce, see the address; and
hopefully assume the reader will see it eventually if any
publically-archived address is also in the recipients list.
Or an an alternate header (e.g. "Intended-To", "Intended-Cc")
can also be used to avoid bounces (but MUAs would lose those on
"Reply-All"), so maybe putting those pseudo-headers in the
message body can work.
This will NOT solve the spam/flooding/malicious content problem.
However, the receiving end can still use SpamAssassin, rspamd,
or whatever pipe-friendly mail filters they want because it
still looks like mail.