Workflows Archive on lore.kernel.org
 help / color / Atom feed
From: Eric Wong <e@80x24.org>
To: workflows@vger.kernel.org, meta@public-inbox.org
Subject: Re: WIP: searching all of lore
Date: Tue, 1 Dec 2020 18:48:14 +0000
Message-ID: <20201201184814.GA32272@dcvr> (raw)
In-Reply-To: <20201201140033.gyxmaejay2ddpiz3@nitro.local>

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Nov 26, 2020 at 07:45:43PM +0000, Eric Wong wrote:
> > Requires Tor, for now:
> > 
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> > http://lore.czquwvybam4bgbro.onion/all/
> 
> Thanks for this work, Eric, things are looking good in my tests, though
> I uncovered a bunch of problems with b4 when used with torsocks. :)
> 
> When grabbing t.mbox.gz threads from /all, it appears to properly
> reconstitute follow-ups from multiple mailing lists, correct?

Yup, though some duplicates appear due to different mailing list-added
trailers.  Maybe some of the PublicInbox::Filter::* stuff (currently
only for -mda + -watch) can be applied to the indexing phase to better
dedupe and drop trailers

> Is there a
> way to "weight" different sources, so that when the same message-id
> exist in multiple places, we can prefer one source over another?

It indexes based on the order it iterates through the inboxes
and messages.  That's usually that follows order in the config file;
especially if indexing is delayed.   Of course it's possible a
message can show up in a low-priority source first due to
network latency or outages (something I'm too familiar with :<).

I have any idea to fix that via --reindex which *might*
allow performance improvements on the Xapian side, too.

--reindex is another mind twister when dealing with multiple
histories compared to normal inboxes and will need a new
approach.  Been working on that and my head hurts :x

> For
> example, this is useful when we're trying to do DKIM validation and some
> lists are known to mess that up, while others do the right thing.

Right, though I think it's somewhat less necessary given how sensitive
PublicInbox::ContentHash is compared to just using the Message-ID to
dedupe...

One bad thing about it being too sensitive is NNTP speedups couldn't rely
solely on contents hashing because of mailing list trailers yesterday:

https://public-inbox.org/meta/20201130194201.GA6687@dcvr/

> Thanks again,

You're welcome :>

  reply index

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-26 19:45 Eric Wong
2020-11-28 22:34 ` Eric Wong
2020-12-05 20:07   ` Eric Wong
2020-12-08 14:01     ` Konstantin Ryabitsev
2020-12-08 18:02       ` Eric Wong
2020-12-08 18:11         ` Konstantin Ryabitsev
2020-12-01 14:00 ` Konstantin Ryabitsev
2020-12-01 18:48   ` Eric Wong [this message]
2021-03-17  7:11 ` Eric Wong
2021-03-17 13:27   ` Konstantin Ryabitsev
2021-03-17 18:18     ` Eric Wong
2021-03-17 18:37       ` Konstantin Ryabitsev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201201184814.GA32272@dcvr \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    --cc=workflows@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Workflows Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/workflows/0 workflows/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 workflows workflows/ https://lore.kernel.org/workflows \
		workflows@vger.kernel.org
	public-inbox-index workflows

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.workflows


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git