All of lore.kernel.org
 help / color / mirror / Atom feed
* Generalised bisection
@ 2009-03-09  1:40 Ealdwulf Wuffinga
  2009-03-10  7:08 ` Christian Couder
  0 siblings, 1 reply; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-09  1:40 UTC (permalink / raw)
  To: Git List; +Cc: John Tapsell, Ingo Molnar, Christian Couder

[whoops, mail server does not like html. trying again...]

Hi,

I have developed a generalised bisection algorithm, for the case where
a bug is intermittent. The code may be found at
git://github.com/Ealdwulf/bbchop.git. It should be considered
experimental.

This should cover the use cases requested by Ingo
(http://article.gmane.org/gmane.comp.version-control.git/108565) and
John  (http://article.gmane.org/gmane.comp.version-control.git/112280),
although it does not work in the same way as your proposed solutions -
it is intended to be more general, working when the bug is not
almost-deterministic. It is based on Bayesian Search Theory
(http://en.wikipedia.org/wiki/Bayesian_search_theory, although that
description is a bit simplistic) which is usually used to find
submarines, or people lost on mountains.

To try it out, you need python and mpmath (from
http://code.google.com/p/mpmath/ or your distribution).
Once your have obtained the source, you can immediately run BBChop/source/bbchop
which is the main driver program.

It is not currently integrated into git, although doing so should only
involve minor scriptery.

The simplest way to try it out to is run it in manual mode:


>   bbchop -l 10 -c 0.9

This means, search in a linear history of 10 revisions, numbered 0 to
9, until bbchop thinks it has found
the faulty location with probability at least 0.9. It will start
asking questions:

] Most likely location is 0 (probability 0.100000).
] Please test at location 3.
] Target detected at 3? Y/N/S(kip)

Eventually the search will terminate and BBChop will print out its
conclusion, with evidence, eg:

] Search complete.  Most likely location is 3 (probability 0.919614).
] Number of tests at 3: 3 of which 1 detected
] Number of tests at parents of 3:
]         At 2, 5 of which 0 detected

More information canbe found in the readme file.

Feel free to reply with any comments, questions, etc. If anyone tries
it out on a real bug, please let me know how it goes.

regards,

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-09  1:40 Generalised bisection Ealdwulf Wuffinga
@ 2009-03-10  7:08 ` Christian Couder
  2009-03-11  8:59   ` Ealdwulf Wuffinga
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Couder @ 2009-03-10  7:08 UTC (permalink / raw)
  To: Ealdwulf Wuffinga; +Cc: Git List, John Tapsell, Ingo Molnar

Le lundi 9 mars 2009, Ealdwulf Wuffinga a écrit :
> [whoops, mail server does not like html. trying again...]
>
> Hi,
>
> I have developed a generalised bisection algorithm, for the case where
> a bug is intermittent. The code may be found at
> git://github.com/Ealdwulf/bbchop.git. It should be considered
> experimental.

I will try to have a look at the end of this week.
But do you want it to be integrated with Git or do you want it to be an 
independant project that works with many different version control system?

Best regards,
Christian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-10  7:08 ` Christian Couder
@ 2009-03-11  8:59   ` Ealdwulf Wuffinga
  2009-03-11  9:35     ` John Tapsell
  0 siblings, 1 reply; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-11  8:59 UTC (permalink / raw)
  To: Christian Couder; +Cc: Git List, John Tapsell, Ingo Molnar

On Tue, Mar 10, 2009 at 7:08 AM, Christian Couder
<chriscool@tuxfamily.org> wrote:

> I will try to have a look at the end of this week.
> But do you want it to be integrated with Git or do you want it to be an
> independant project that works with many different version control system?

Hmm. Whatever works, I guess. On the one hand the code does seem
naturally generic. On the other hand, it's good if users don't
have to separately obtain an extra package to use it. Supposing that
the algorithm proves useful, would the git project  be okay with an
extra dependency, or would you want to integrate it? Right now it's in
python, which I understand is an obstacle to integration.

In the short term, I assume the algorithm needs to prove its
usefulness before either being integrated or added as a dependency.
It seems the majority of potential users - developers of code of the
sort likely to have intermittent bugs, such as the kernel or xorg -
use git,
so I would like it to be as easy as possible for git users to try it
out. Maybe it could live in the contrib directory for a while?

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11  8:59   ` Ealdwulf Wuffinga
@ 2009-03-11  9:35     ` John Tapsell
  2009-03-11 12:05       ` Johannes Schindelin
  2009-03-11 22:15       ` Ealdwulf Wuffinga
  0 siblings, 2 replies; 25+ messages in thread
From: John Tapsell @ 2009-03-11  9:35 UTC (permalink / raw)
  To: Ealdwulf Wuffinga; +Cc: Christian Couder, Git List, Ingo Molnar

2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
> On Tue, Mar 10, 2009 at 7:08 AM, Christian Couder
> <chriscool@tuxfamily.org> wrote:
>
>> I will try to have a look at the end of this week.
>> But do you want it to be integrated with Git or do you want it to be an
>> independant project that works with many different version control system?
>
> Hmm. Whatever works, I guess. On the one hand the code does seem
> naturally generic. On the other hand, it's good if users don't
> have to separately obtain an extra package to use it. Supposing that
> the algorithm proves useful, would the git project  be okay with an
> extra dependency, or would you want to integrate it? Right now it's in
> python, which I understand is an obstacle to integration.

There used to be a dependency on python.  git-merge-recursive for
example, before it was converted to C.

mpmath might be the more annoying dependency - what functions do you
use from it?  Could they trivially be reimplemented?

John Tapsell

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11  9:35     ` John Tapsell
@ 2009-03-11 12:05       ` Johannes Schindelin
  2009-03-11 12:08         ` John Tapsell
  2009-03-11 22:15       ` Ealdwulf Wuffinga
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2009-03-11 12:05 UTC (permalink / raw)
  To: John Tapsell; +Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1161 bytes --]

Hi,

On Wed, 11 Mar 2009, John Tapsell wrote:

> 2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
> > On Tue, Mar 10, 2009 at 7:08 AM, Christian Couder
> > <chriscool@tuxfamily.org> wrote:
> >
> >> I will try to have a look at the end of this week.
> >> But do you want it to be integrated with Git or do you want it to be an
> >> independant project that works with many different version control system?
> >
> > Hmm. Whatever works, I guess. On the one hand the code does seem
> > naturally generic. On the other hand, it's good if users don't
> > have to separately obtain an extra package to use it. Supposing that
> > the algorithm proves useful, would the git project  be okay with an
> > extra dependency, or would you want to integrate it? Right now it's in
> > python, which I understand is an obstacle to integration.
> 
> There used to be a dependency on python.  git-merge-recursive for
> example, before it was converted to C.

Not "for example".  It was the only dependency of git.git on Python, and 
the rewrite of merge-recursive was only done to break that dependency, as 
I had a platform where I could not install Python.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11 12:05       ` Johannes Schindelin
@ 2009-03-11 12:08         ` John Tapsell
  2009-03-11 13:04           ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: John Tapsell @ 2009-03-11 12:08 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

2009/3/11 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Wed, 11 Mar 2009, John Tapsell wrote:
>
>> 2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
>> > On Tue, Mar 10, 2009 at 7:08 AM, Christian Couder
>> > <chriscool@tuxfamily.org> wrote:
>> >
>> >> I will try to have a look at the end of this week.
>> >> But do you want it to be integrated with Git or do you want it to be an
>> >> independant project that works with many different version control system?
>> >
>> > Hmm. Whatever works, I guess. On the one hand the code does seem
>> > naturally generic. On the other hand, it's good if users don't
>> > have to separately obtain an extra package to use it. Supposing that
>> > the algorithm proves useful, would the git project  be okay with an
>> > extra dependency, or would you want to integrate it? Right now it's in
>> > python, which I understand is an obstacle to integration.
>>
>> There used to be a dependency on python.  git-merge-recursive for
>> example, before it was converted to C.
>
> Not "for example".  It was the only dependency of git.git on Python, and
> the rewrite of merge-recursive was only done to break that dependency, as
> I had a platform where I could not install Python.

But installing perl was no problem?  (Just curious)

John

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11 12:08         ` John Tapsell
@ 2009-03-11 13:04           ` Johannes Schindelin
  2009-03-11 13:24             ` John Tapsell
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2009-03-11 13:04 UTC (permalink / raw)
  To: John Tapsell; +Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1220 bytes --]

Hi,

On Wed, 11 Mar 2009, John Tapsell wrote:

> 2009/3/11 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
>
> > On Wed, 11 Mar 2009, John Tapsell wrote:
> >
> >> There used to be a dependency on python.  git-merge-recursive for 
> >> example, before it was converted to C.
> >
> > Not "for example".  It was the only dependency of git.git on Python, 
> > and the rewrite of merge-recursive was only done to break that 
> > dependency, as I had a platform where I could not install Python.
> 
> But installing perl was no problem?  (Just curious)

Perl was installed, albeit in an ancient version, and compiling Perl 
modules written in C was out.  It just did not work.

But the good part was: after converting rerere to be a builtin, there 
were no Perl scripts left that I wanted/needed to use.

These days, we only have these Perl scripts left: add--interactive, 
archimport, cvsexportcommit, cvsimport, cvsserver, relink, send-email and 
svn.

Ignoring the scripts to interact with other SCMs (which I can do on 
another computer), that leaves add--interactive, relink (which could be 
moved to contrib/ AFAIAC) and send-email.

I use "add -e" instead of "add -i", and stay away from send-email...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11 13:04           ` Johannes Schindelin
@ 2009-03-11 13:24             ` John Tapsell
  2009-03-11 22:14               ` Ealdwulf Wuffinga
  0 siblings, 1 reply; 25+ messages in thread
From: John Tapsell @ 2009-03-11 13:24 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

2009/3/11 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Wed, 11 Mar 2009, John Tapsell wrote:
>
>> 2009/3/11 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
>>
>> > On Wed, 11 Mar 2009, John Tapsell wrote:
>> >
>> >> There used to be a dependency on python.  git-merge-recursive for
>> >> example, before it was converted to C.
>> >
>> > Not "for example".  It was the only dependency of git.git on Python,
>> > and the rewrite of merge-recursive was only done to break that
>> > dependency, as I had a platform where I could not install Python.
>>
>> But installing perl was no problem?  (Just curious)
>
> Perl was installed, albeit in an ancient version, and compiling Perl
> modules written in C was out.  It just did not work.

I wonder if it would then be acceptable to have a python script for
this generalised bisect?  Since it's not core functionality.   Not
quite sure how it would fit in though (I'd rather it was called from
"git bisect" rather than adding another separate git command)

John

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11 13:24             ` John Tapsell
@ 2009-03-11 22:14               ` Ealdwulf Wuffinga
  0 siblings, 0 replies; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-11 22:14 UTC (permalink / raw)
  To: John Tapsell; +Cc: Johannes Schindelin, Christian Couder, Git List, Ingo Molnar

[John will get this twice, sorry]

On Wed, Mar 11, 2009 at 1:24 PM, John Tapsell <johnflux@gmail.com> wrote:

>   Not
> quite sure how it would fit in though (I'd rather it was called from
> "git bisect" rather than adding another separate git command)

I guess the most obvious route would be to add an option to 'git bisect start'
 to specify that it should be used instead of the usual algorithm.

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11  9:35     ` John Tapsell
  2009-03-11 12:05       ` Johannes Schindelin
@ 2009-03-11 22:15       ` Ealdwulf Wuffinga
  2009-03-12  6:45         ` John Tapsell
  1 sibling, 1 reply; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-11 22:15 UTC (permalink / raw)
  To: John Tapsell; +Cc: Christian Couder, Git List, Ingo Molnar

[John will get this twice, sorry; not used to this mail interface yet.]

On Wed, Mar 11, 2009 at 9:35 AM, John Tapsell <johnflux@gmail.com> wrote:

> mpmath might be the more annoying dependency - what functions do you
> use from it?  Could they trivially be reimplemented?

What I use is the multiprecision floating point number class. doubles
don't seem to be long enough.
The reason for using mpmath rather than the more  widespread GMP (and
its python wrapper gmpy) is that the latter only supports
integer powers, whereas BBChop needs fractional powers.

So, it might be possible to switch to gmpy,  or some other widespread
library,  by implementing a pow() which supports fractional powers.
I think I only use the normal arithmetic operators, log, and pow, so
in principle those could be reimplemented, to eliminate the dependency
altogether.
It seems a little bit of a waste of time, though.

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-11 22:15       ` Ealdwulf Wuffinga
@ 2009-03-12  6:45         ` John Tapsell
  2009-03-12 10:55           ` Johannes Schindelin
  2009-03-13  9:58           ` Ealdwulf Wuffinga
  0 siblings, 2 replies; 25+ messages in thread
From: John Tapsell @ 2009-03-12  6:45 UTC (permalink / raw)
  To: Ealdwulf Wuffinga; +Cc: Christian Couder, Git List, Ingo Molnar

2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
> [John will get this twice, sorry; not used to this mail interface yet.]
>
> On Wed, Mar 11, 2009 at 9:35 AM, John Tapsell <johnflux@gmail.com> wrote:
>
>> mpmath might be the more annoying dependency - what functions do you
>> use from it?  Could they trivially be reimplemented?
>
> What I use is the multiprecision floating point number class. doubles
> don't seem to be long enough.

Hmm, really really?  Sometimes this sort of thing can be fixed by just
readjusting the formulas.  What formulas are you using that require
more precision than doubles?

> The reason for using mpmath rather than the more  widespread GMP (and
> its python wrapper gmpy) is that the latter only supports
> integer powers, whereas BBChop needs fractional powers.
>
> So, it might be possible to switch to gmpy,  or some other widespread
> library,  by implementing a pow() which supports fractional powers.
> I think I only use the normal arithmetic operators, log, and pow, so
> in principle those could be reimplemented, to eliminate the dependency
> altogether.
> It seems a little bit of a waste of time, though.

A little bit of math trickery helps here :-)

y =  x^b

log(y) = log(x^b) = b * log(x)
e^log(y) = e^(b log(x))

y = exp(b * log(x))

So as long as you have 'exp' and 'log' functions, you can raise x to
the power of b, even if b is fractional.

Just to prove it, square root of 2 is:

$ echo "e(0.5*l(2))" | bc -l
1.41421356237309504878

John

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-12  6:45         ` John Tapsell
@ 2009-03-12 10:55           ` Johannes Schindelin
  2009-03-12 18:02             ` Steven Tweed
  2009-03-13  9:58           ` Ealdwulf Wuffinga
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2009-03-12 10:55 UTC (permalink / raw)
  To: John Tapsell; +Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

[-- Attachment #1: Type: TEXT/PLAIN, Size: 803 bytes --]

Hi,

On Thu, 12 Mar 2009, John Tapsell wrote:

> 2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
> > [John will get this twice, sorry; not used to this mail interface yet.]
> >
> > On Wed, Mar 11, 2009 at 9:35 AM, John Tapsell <johnflux@gmail.com> wrote:
> >
> >> mpmath might be the more annoying dependency - what functions do you
> >> use from it?  Could they trivially be reimplemented?
> >
> > What I use is the multiprecision floating point number class. doubles
> > don't seem to be long enough.
> 
> Hmm, really really?  Sometimes this sort of thing can be fixed by just 
> readjusting the formulas.  What formulas are you using that require more 
> precision than doubles?

Maybe you could post the formulae instead of forcing people to deduct them 
from the source code?

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-12 10:55           ` Johannes Schindelin
@ 2009-03-12 18:02             ` Steven Tweed
  2009-03-13 10:00               ` Ealdwulf Wuffinga
  2009-03-13 12:49               ` Ealdwulf Wuffinga
  0 siblings, 2 replies; 25+ messages in thread
From: Steven Tweed @ 2009-03-12 18:02 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: John Tapsell, Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

On Thu, Mar 12, 2009 at 10:55 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> On Thu, 12 Mar 2009, John Tapsell wrote:
> > 2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
> > > What I use is the multiprecision floating point number class. doubles
> > > don't seem to be long enough.
> >
> > Hmm, really really?  Sometimes this sort of thing can be fixed by just
> > readjusting the formulas.  What formulas are you using that require more
> > precision than doubles?
>
> Maybe you could post the formulae instead of forcing people to deduct them
> from the source code?

I haven't even looked at the source code so a description of the
mathematical algorithm would help, but I'll just point out that
underflow (in the case of working with probabilities) and overflow
(when working with their negated logarithms) is inherent in most
multi-step Bayesian algorithms. The only solution is to rescale things
as you go so that things stay in a "computable" range. (You're almost
never interested in absolute probabilities anyway but rather relative
probabilities or, in extreme cases, just the biggest probability, so
rescaling isn't losing any useful information.)

cheers,
dave tweed

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-12  6:45         ` John Tapsell
  2009-03-12 10:55           ` Johannes Schindelin
@ 2009-03-13  9:58           ` Ealdwulf Wuffinga
  2009-03-13 10:55             ` Johannes Schindelin
  1 sibling, 1 reply; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-13  9:58 UTC (permalink / raw)
  To: John Tapsell; +Cc: Christian Couder, Git List, Ingo Molnar

On Thu, Mar 12, 2009 at 6:45 AM, John Tapsell <johnflux@gmail.com> wrote:
> 2009/3/11 Ealdwulf Wuffinga <ealdwulf@googlemail.com>:
>> On Wed, Mar 11, 2009 at 9:35 AM, John Tapsell <johnflux@gmail.com> wrote:
>> What I use is the multiprecision floating point number class. doubles
>> don't seem to be long enough.
>
> Hmm, really really?  Sometimes this sort of thing can be fixed by just
> readjusting the formulas.  What formulas are you using that require
> more precision than doubles?

I'll have to reply to this later when I have more time. However, there
is a (rather verbose)
file in the  doc directory which describes them - in texmacs format,
but I've just uploaded
a pdf version as well. It is BayesianSearch_Debugging.pdf. The
description of this code starts in
section 2.2 (since I wrote that, I have generalised it to the DAG case
as in git).


> A little bit of math trickery helps here :-)
>
> y =  x^b
>
> log(y) = log(x^b) = b * log(x)
> e^log(y) = e^(b log(x))
>
> y = exp(b * log(x))
>
> So as long as you have 'exp' and 'log' functions, you can raise x to
> the power of b, even if b is fractional.

Sadly gmp does not have log or exp. mpfr does, but it does not have a python
interface.

Alex

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-12 18:02             ` Steven Tweed
@ 2009-03-13 10:00               ` Ealdwulf Wuffinga
  2009-03-13 12:49               ` Ealdwulf Wuffinga
  1 sibling, 0 replies; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-13 10:00 UTC (permalink / raw)
  To: Steven Tweed
  Cc: Johannes Schindelin, John Tapsell, Christian Couder, Git List

On Thu, Mar 12, 2009 at 6:02 PM, Steven Tweed <orthochronous@gmail.com> wrote:

> I haven't even looked at the source code so a description of the
> mathematical algorithm would help, but I'll just point out that
> underflow (in the case of working with probabilities) and overflow
> (when working with their negated logarithms) is inherent in most
> multi-step Bayesian algorithms. The only solution is to rescale things
> as you go so that things stay in a "computable" range. (You're almost
> never interested in absolute probabilities anyway but rather relative
> probabilities or, in extreme cases, just the biggest probability, so
> rescaling isn't losing any useful information.)

Hmm, I'll have to think about that one.

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-13  9:58           ` Ealdwulf Wuffinga
@ 2009-03-13 10:55             ` Johannes Schindelin
  2009-03-13 12:42               ` John Tapsell
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2009-03-13 10:55 UTC (permalink / raw)
  To: Ealdwulf Wuffinga; +Cc: John Tapsell, Christian Couder, Git List, Ingo Molnar

Hi,

On Fri, 13 Mar 2009, Ealdwulf Wuffinga wrote:

> It is BayesianSearch_Debugging.pdf.

Now I'll only need a bayesian-search-for-original-url-in-mails.py.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-13 10:55             ` Johannes Schindelin
@ 2009-03-13 12:42               ` John Tapsell
  2009-03-13 13:56                 ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: John Tapsell @ 2009-03-13 12:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

2009/3/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Fri, 13 Mar 2009, Ealdwulf Wuffinga wrote:
>
>> It is BayesianSearch_Debugging.pdf.
>
> Now I'll only need a bayesian-search-for-original-url-in-mails.py.

I managed to work it out:

http://github.com/Ealdwulf/bbchop.git/BBChop/doc/BayesianSearch_Debugging.pdf

John

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-12 18:02             ` Steven Tweed
  2009-03-13 10:00               ` Ealdwulf Wuffinga
@ 2009-03-13 12:49               ` Ealdwulf Wuffinga
  2009-03-13 15:19                 ` Steven Tweed
  1 sibling, 1 reply; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-13 12:49 UTC (permalink / raw)
  To: Steven Tweed
  Cc: Johannes Schindelin, John Tapsell, Christian Couder, Git List

On Thu, Mar 12, 2009 at 6:02 PM, Steven Tweed <orthochronous@gmail.com> wrote:

> I haven't even looked at the source code so a description of the
> mathematical algorithm would help, but I'll just point out that
> underflow (in the case of working with probabilities) and overflow
> (when working with their negated logarithms) is inherent in most
> multi-step Bayesian algorithms. The only solution is to rescale things
> as you go so that things stay in a "computable" range. (You're almost
> never interested in absolute probabilities anyway but rather relative
> probabilities or, in extreme cases, just the biggest probability, so
> rescaling isn't losing any useful information.)

Are you sure you aren't thinking of when you are using fixed point? I
was under the impression
that Bayesian algorithms usually worked okay in floating point.

One issue in BBChop which should be easy to fix, is that I use a dumb
way of calculating Beta functions. These
are ratios of factorials, so the subexpressions get stupidly big very
quickly. But I don't think that is the only problem.


Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-13 12:42               ` John Tapsell
@ 2009-03-13 13:56                 ` Johannes Schindelin
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Schindelin @ 2009-03-13 13:56 UTC (permalink / raw)
  To: John Tapsell; +Cc: Ealdwulf Wuffinga, Christian Couder, Git List, Ingo Molnar

Hi,

On Fri, 13 Mar 2009, John Tapsell wrote:

> 2009/3/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
>
> > On Fri, 13 Mar 2009, Ealdwulf Wuffinga wrote:
> >
> >> It is BayesianSearch_Debugging.pdf.
> >
> > Now I'll only need a bayesian-search-for-original-url-in-mails.py.
> 
> I managed to work it out:
> 
> http://github.com/Ealdwulf/bbchop.git/BBChop/doc/BayesianSearch_Debugging.pdf

Thanks, very much appreciated,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-13 12:49               ` Ealdwulf Wuffinga
@ 2009-03-13 15:19                 ` Steven Tweed
  2009-03-15 19:16                   ` Ealdwulf Wuffinga
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Tweed @ 2009-03-13 15:19 UTC (permalink / raw)
  To: Ealdwulf Wuffinga
  Cc: Johannes Schindelin, John Tapsell, Christian Couder, Git List

On Fri, Mar 13, 2009 at 12:49 PM, Ealdwulf Wuffinga
<ealdwulf@googlemail.com> wrote:
> On Thu, Mar 12, 2009 at 6:02 PM, Steven Tweed <orthochronous@gmail.com> wrote:
>> I haven't even looked at the source code so a description of the
>> mathematical algorithm would help, but I'll just point out that
>> underflow (in the case of working with probabilities) and overflow
>> (when working with their negated logarithms) is inherent in most
>> multi-step Bayesian algorithms. The only solution is to rescale things
>> as you go so that things stay in a "computable" range. (You're almost
>> never interested in absolute probabilities anyway but rather relative
>> probabilities or, in extreme cases, just the biggest probability, so
>> rescaling isn't losing any useful information.)
>
> Are you sure you aren't thinking of when you are using fixed point? I
> was under the impression
> that Bayesian algorithms usually worked okay in floating point.

Underflow when using probabilities and lack of precision (rather than
overflow) when using negated logarithms are well known problems in the
kind of probabilistic object tracking, inference in graphical networks
and object identification processes I work with (in computer vision).
I there may well be other areas of Bayesian decision theory where this
doesn't happen, and indeed a _very_ quick scan through your document
suggests that you're adding to tallying information on each timestep
and recalcuating the entire model from those tallys, which is one of
the few cases where you can't really do rescaling. I'll try and have a
more detailled read over the weekend.

> One issue in BBChop which should be easy to fix, is that I use a dumb
> way of calculating Beta functions. These
> are ratios of factorials, so the subexpressions get stupidly big very
> quickly. But I don't think that is the only problem.

Yes, "Numerical Recipes" seems to suggest that computing with
log-factorials and exponentiating works reasonably, although I've
never tried it and NR does occasionally get things completely wrong...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-13 15:19                 ` Steven Tweed
@ 2009-03-15 19:16                   ` Ealdwulf Wuffinga
  2009-03-16 10:29                     ` Steven Tweed
  0 siblings, 1 reply; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-15 19:16 UTC (permalink / raw)
  To: Steven Tweed
  Cc: Johannes Schindelin, John Tapsell, Christian Couder, Git List

On Fri, Mar 13, 2009 at 3:19 PM, Steven Tweed <orthochronous@gmail.com> wrote:

> Underflow when using probabilities and lack of precision (rather than
> overflow) when using negated logarithms are well known problems in the
> kind of probabilistic object tracking, inference in graphical networks
> and object identification processes I work with (in computer vision).
> I there may well be other areas of Bayesian decision theory where this
> doesn't happen, and indeed a _very_ quick scan through your document
> suggests that you're adding to tallying information on each timestep
> and recalcuating the entire model from those tallys, which is one of
> the few cases where you can't really do rescaling. I'll try and have a
> more detailled read over the weekend.

That is useful information, thanks.

It is not obvious how to perform this algorithm incrementally, because
of the need to
marginalise out the fault rate. As I understand it, marginalisation
has to be done after you
have incorporated all your information into the model, which means we
can't use the
usual bayesian updating.

> On Fri, Mar 13, 2009 at 12:49 PM, Ealdwulf Wuffinga
> <ealdwulf@googlemail.com> wrote:
>> One issue in BBChop which should be easy to fix, is that I use a dumb
>> way of calculating Beta functions. These
>> are ratios of factorials, so the subexpressions get stupidly big very
>> quickly. But I don't think that is the only problem.
>
> Yes, "Numerical Recipes" seems to suggest that computing with
> log-factorials and exponentiating works reasonably, although I've
> never tried it and NR does occasionally get things completely wrong...

I have implemented this and it does indeed allow the program to work
in more cases
without underflow, with ordinary floating point. However, I think the
underflow can still occur
in plausible use cases.

The problem is still the Beta function. In bbchop it is always passed
D and T where D is
the sum of the number of detecting observations in some of the
revisions, and T is the
same for nondetecting observations. Beta(x,y) underflows a python float
if both x and y are > ~550, and also in other cases when one is
smaller and the other,
larger. BBChop never looks again at a revision if the bug has been
observed there, but if
there are a large number of revisions, it might look at enough of them
to cause a problem.

Obviously no-one is going to manually do hundreds of observations, but
 I want BBChop
to work in the case where someone runs it on a machine in the corner
for a few days,
or even weeks,  to track down a bug which occurs too infrequently to
bisect manually.

Which means I'm still stuck with mpmath, or some equivalent.

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-15 19:16                   ` Ealdwulf Wuffinga
@ 2009-03-16 10:29                     ` Steven Tweed
  2009-03-16 10:37                       ` John Tapsell
  2009-03-16 22:08                       ` Ealdwulf Wuffinga
  0 siblings, 2 replies; 25+ messages in thread
From: Steven Tweed @ 2009-03-16 10:29 UTC (permalink / raw)
  To: Ealdwulf Wuffinga
  Cc: Johannes Schindelin, John Tapsell, Christian Couder, Git List

On Sun, Mar 15, 2009 at 7:16 PM, Ealdwulf Wuffinga
<ealdwulf@googlemail.com> wrote:
> On Fri, Mar 13, 2009 at 3:19 PM, Steven Tweed <orthochronous@gmail.com> wrote:
> It is not obvious how to perform this algorithm incrementally, because
> of the need to
> marginalise out the fault rate. As I understand it, marginalisation
> has to be done after you
> have incorporated all your information into the model, which means we
> can't use the
> usual bayesian updating.

I had a look over the weekend, and got a bit sidetracked on one of
your assumptions. You seem to be assuming that the bug is such that
observing a single positive observation of the symptom at a position i
in the linear history _does not_ completely rule out that the guilty
commit occurs after that point. I would have thought the generally
more applicable assumption is that, given that generally you don't
have a bug ridden system where more than one bug causes the same
symptom _within the history of interest_, that a single observation of
the symptom does totally rule out the bug after that point (whilst
intermittency clearly not having observed the bug before that point
doesn't completely rule out the guilty commit being earlier, although
it should increase the liklihood estimate of the bug being later).

I wonder what your thoughts are on this? (I started formulating a
model over the weekend, but work is a bit hectic so I may not get to
write it up in LaTeX very quickly.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-16 10:29                     ` Steven Tweed
@ 2009-03-16 10:37                       ` John Tapsell
  2009-03-16 22:47                         ` Ealdwulf Wuffinga
  2009-03-16 22:08                       ` Ealdwulf Wuffinga
  1 sibling, 1 reply; 25+ messages in thread
From: John Tapsell @ 2009-03-16 10:37 UTC (permalink / raw)
  To: Steven Tweed
  Cc: Ealdwulf Wuffinga, Johannes Schindelin, Christian Couder, Git List

2009/3/16 Steven Tweed <orthochronous@gmail.com>:
> On Sun, Mar 15, 2009 at 7:16 PM, Ealdwulf Wuffinga
> <ealdwulf@googlemail.com> wrote:
>> On Fri, Mar 13, 2009 at 3:19 PM, Steven Tweed <orthochronous@gmail.com> wrote:
>> It is not obvious how to perform this algorithm incrementally, because
>> of the need to
>> marginalise out the fault rate. As I understand it, marginalisation
>> has to be done after you
>> have incorporated all your information into the model, which means we
>> can't use the
>> usual bayesian updating.
>
> I had a look over the weekend, and got a bit sidetracked on one of
> your assumptions. You seem to be assuming that the bug is such that
> observing a single positive observation of the symptom at a position i
> in the linear history _does not_ completely rule out that the guilty
> commit occurs after that point. I would have thought the generally
> more applicable assumption is that, given that generally you don't
> have a bug ridden system where more than one bug causes the same
> symptom _within the history of interest_, that a single observation of
> the symptom does totally rule out the bug after that point (whilst
> intermittency clearly not having observed the bug before that point
> doesn't completely rule out the guilty commit being earlier, although
> it should increase the liklihood estimate of the bug being later).

I think it's reasonable to expect false-positives as well as
false-negatives.  e.g. you're looking for a commit that slows down the
frame rate.  But on one of the good commits the hard disk hits a bad
sector and takes a bit longer to retrieve data and so you get a
false-positive.

It's a bit contrived, but I'm sure you can think of better example

John

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-16 10:29                     ` Steven Tweed
  2009-03-16 10:37                       ` John Tapsell
@ 2009-03-16 22:08                       ` Ealdwulf Wuffinga
  1 sibling, 0 replies; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-16 22:08 UTC (permalink / raw)
  To: Steven Tweed
  Cc: Johannes Schindelin, John Tapsell, Christian Couder, Git List

On Mon, Mar 16, 2009 at 10:29 AM, Steven Tweed <orthochronous@gmail.com> wrote:

> I had a look over the weekend, and got a bit sidetracked on one of
> your assumptions. You seem to be assuming that the bug is such that
> observing a single positive observation of the symptom at a position i
> in the linear history _does not_ completely rule out that the guilty
> commit occurs after that point. I would have thought the generally
> more applicable assumption is that, given that generally you don't
> have a bug ridden system where more than one bug causes the same
> symptom _within the history of interest_, that a single observation of
> the symptom does totally rule out the bug after that point (whilst
> intermittency clearly not having observed the bug before that point
> doesn't completely rule out the guilty commit being earlier, although
> it should increase the liklihood estimate of the bug being later).

I must have been unclear somewhere, because I do indeed assume that
a single observation of the symptom rules out the origin of the bug being
later than that observation.

 If you have a play with the software in manual mode, you can see that its
behaviour reflects this  - it starts out doing something like an
ordinary binary search, albeit skewed towards older revisions (it
seems to go for
a roughly 1:2 division, rather than the usual 1:1). Then once it has got to the
point where a deterministic search would end, it hammers away at the parent(s)
of the newest revision where the fault was observed, until it is satisfied that
it cannot be found there. All this just emerges from the generic least-entropy
algorithm; I didn't know what it would do until I got it running.

> I wonder what your thoughts are on this? (I started formulating a
> model over the weekend, but work is a bit hectic so I may not get to
> write it up in LaTeX very quickly.)

It will be interesting to see whether your model turns out to be different from
mine. I'm only doing this in my spare time, so I'm in no hurry.

Ealdwulf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Generalised bisection
  2009-03-16 10:37                       ` John Tapsell
@ 2009-03-16 22:47                         ` Ealdwulf Wuffinga
  0 siblings, 0 replies; 25+ messages in thread
From: Ealdwulf Wuffinga @ 2009-03-16 22:47 UTC (permalink / raw)
  To: John Tapsell
  Cc: Steven Tweed, Johannes Schindelin, Christian Couder, Git List

On Mon, Mar 16, 2009 at 10:37 AM, John Tapsell <johnflux@gmail.com> wrote:

> I think it's reasonable to expect false-positives as well as
> false-negatives.  e.g. you're looking for a commit that slows down the
> frame rate.  But on one of the good commits the hard disk hits a bad
> sector and takes a bit longer to retrieve data and so you get a
> false-positive.

It's true that you could get false positives, as you say. What's less
obvious to me is whether it would be a good idea for the algorithm to try
to deal with them, or just report the earliest revision that failed
and leave it
up to the intelligence of the user to decide whether it is a false positive,
and what to do about it.

In the absence of some user-provided way of  discriminating, the only way
I can see for the algorithm can distinguish between revisions affected by
 the real bug as opposed to ones affected by false positives is to
assume that
the false positives occur at some lower rate. There are two
difficulties with this:
first, presumably it would have to start sampling more times at some locations
in order to figure out what the rate is at them. This sounds like it
would be expensive -
currently the algorithm can usually get away with looking at most locations no
more than once.  Secondly, I'm not sure how to justify this
assumption, or model it.
In short, false positives look like a can of worms to me; I'm hoping
the algorithm is
useful without considering them.

The algorithm actually has one potentially problematic assumptions
already - or rather,
it has two alternative assumptions, neither of which is completely believable.
It can either assume that the bug causes faults at the same rate in
all affected revisions,
or that the affected revisions each have their own completely
independent rate. Originally
I thought that the latter would be the more conservative assumption -
it certainly assumes less.
However, the following argument convinces me that the other one is
actually more conservative:

Suppose that in the latest revision, we observe a fault in one run out
of ten. Under the second
assumption, this observed rate has no effect on our belief about the
fault rate in other affected
revisions, if any. This means that with a uniform prior on the fault
rate, we more or less start out
assuming a fault rate of 50% on any other affected revisions - much
higher, implausably so.
 If any of them  are only affected at a rate of one in ten, the
algorithm is quite likely to terminate without
seeing a fault there, concluding that the bug was introduced later
than it really was.

On the other hand, we know quite well that the fault rate isn't
necessarily going to  be
identical either.  Of the two, I think the assumption of identical
rates is the more practical one, more
likely to actually identify the correct location. It does leave me
wondering whether some intermediate
assumption would more accurately represent our experience of fault
rates, but I haven't thought of a
really convincing one.

Ealdwulf.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-03-16 22:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-09  1:40 Generalised bisection Ealdwulf Wuffinga
2009-03-10  7:08 ` Christian Couder
2009-03-11  8:59   ` Ealdwulf Wuffinga
2009-03-11  9:35     ` John Tapsell
2009-03-11 12:05       ` Johannes Schindelin
2009-03-11 12:08         ` John Tapsell
2009-03-11 13:04           ` Johannes Schindelin
2009-03-11 13:24             ` John Tapsell
2009-03-11 22:14               ` Ealdwulf Wuffinga
2009-03-11 22:15       ` Ealdwulf Wuffinga
2009-03-12  6:45         ` John Tapsell
2009-03-12 10:55           ` Johannes Schindelin
2009-03-12 18:02             ` Steven Tweed
2009-03-13 10:00               ` Ealdwulf Wuffinga
2009-03-13 12:49               ` Ealdwulf Wuffinga
2009-03-13 15:19                 ` Steven Tweed
2009-03-15 19:16                   ` Ealdwulf Wuffinga
2009-03-16 10:29                     ` Steven Tweed
2009-03-16 10:37                       ` John Tapsell
2009-03-16 22:47                         ` Ealdwulf Wuffinga
2009-03-16 22:08                       ` Ealdwulf Wuffinga
2009-03-13  9:58           ` Ealdwulf Wuffinga
2009-03-13 10:55             ` Johannes Schindelin
2009-03-13 12:42               ` John Tapsell
2009-03-13 13:56                 ` Johannes Schindelin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.