Re: Structured feeds

From: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
To: Daniel Axtens <dja@axtens.net>
Cc: Dmitry Vyukov <dvyukov@google.com>,
	workflows@vger.kernel.org, automated-testing@yoctoproject.org,
	Brendan Higgins <brendanhiggins@google.com>,
	Han-Wen Nienhuys <hanwen@google.com>,
	Kevin Hilman <khilman@baylibre.com>,
	Veronika Kabatova <vkabatov@redhat.com>
Subject: Re: Structured feeds
Date: Wed, 6 Nov 2019 15:50:51 -0500	[thread overview]
Message-ID: <20191106205051.56v25onrxkymrfjz@chatter.i7.local> (raw)
In-Reply-To: <8736f1hvbn.fsf@dja-thinkpad.axtens.net>

On Thu, Nov 07, 2019 at 02:35:08AM +1100, Daniel Axtens wrote:
>This is an non-trivial problem, fwiw. Patchwork's email parser clocks 
>in
>at almost thirteen hundred lines, and that's with the benefit of the
>Python standard library. It also regularly gets patched to handle
>changes to email systems (e.g. DMARC), changes to git (git request-pull
>format changed subtly in 2.14.3), the bizzare ways people send email,
>and so on.

I'm actually very interested in seeing patchwork switch from being fed 
mail directly from postfix to using public-inbox repositories as its 
source of patches. I know it's easy enough to accomplish as-is, by 
piping things from public-inbox to parsemail.sh, but it would be even 
more awesome if patchwork learned to work with these repos natively.

The way I see it:

- site administrator configures upstream public-inbox feeds
- a backend process clones these repositories
   - if it doesn't find a refs/heads/json, then it does its own parsing 
     to generate a structured feed with patches/series/trailers/pull 
     requests, cross-referencing them by series as necessary. Something 
     like a subset of this, excluding patchwork-specific data:
     https://patchwork.kernel.org/api/1.1/patches/11177661/
   - if it does find an existing structured feed, it simply uses it (e.g.  
     it was made available by another patchwork instance)
- the same backend process updates the repositories from upstream using 
   proper manifest files (e.g. see 
   https://lore.kernel.org/workflows/manifest.js.gz)

- patchwork projects then consume one (or more) of these structured 
   feeds to generate the actionable list of patches that maintainers can 
   use, perhaps with optional filtering by specific headers (list-id, 
   from, cc), patch paths, keywords, etc.

Basically, parsemail.sh is split into two, where one part does feed 
cloning, pulling, and parsing into structured data (if not already 
done), and another populates actual patchwork project with patches 
matching requested parameters.

I see the following upsides to this:

- we consume public-inbox feeds directly, no longer losing patches due 
   to MTA problems, postfix burps, parse failures, etc
- a project can have multiple sources for patches instead of being tied 
   to a single mailing list
- downstream patchwork instances (the "local patchwork" tool I mentioned 
   earlier) can benefit from structured feeds provided by 
   patchwork.kernel.org

>Patchwork does expose much of this as an API, for example for patches:
>https://patchwork.ozlabs.org/api/patches/?order=-id so if you want to
>build on that feel free. We can possibly add data to the API if that
>would be helpful. (Patches are always welcome too, if you don't want to
>wait an indeterminate amount of time.)

As I said previously, I may be able to fund development of various 
features, but I want to make sure that I properly work with upstream.  
That requires getting consensus on features to make sure that we don't 
spend funds and efforts on a feature that gets rejected. :)

Would the above feature (using one or more public-inbox repositories as 
sources for a patchwork project) be a welcome addition to upstream?

-K