Dear all, I am still testing the new statistics module and I found two cases were the behavior of the module seems suboptimal to me. My most important concern is the module's internal _sum function and its implications, the other one about passing Counter objects to module functions.
As for the first subject: Specifically, I am not happy with the way the function handles different types. Currently _coerce_types gets called for every element in the function's input sequence and type conversion follows quite complicated rules, and - what is worst - make the outcome of _sum() and thereby mean() dependent on the order of items in the input sequence, e.g.:
mean((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))
1.9944444444444445
mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))
Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5))) File "C:\Python33\statistics.py", line 369, in mean return _sum(data)/n File "C:\Python33\statistics.py", line 157, in _sum T = _coerce_types(T, type(x)) File "C:\Python33\statistics.py", line 327, in _coerce_types raise TypeError('cannot coerce types %r and %r' % (T1, T2)) TypeError: cannot coerce types <class 'fractions.Fraction'> and <class 'decimal.Decimal'>
(this is because when _sum iterates over the input type Fraction wins over int, then float wins over Fraction and over everything else that follows in the first example, but in the second case Fraction wins over int, but then Fraction vs Decimal is undefined and throws an error).
Confusing, isn't it? So here's the code of the _sum function:
def _sum(data, start=0): """_sum(data [, start]) -> value
Return a high-precision sum of the given numeric data. If optional argument ``start`` is given, it is added to the total. If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples --------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75) 11.0
Some sources of round-off error will be avoided:
>>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero. 1000.0
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)]) Fraction(63, 20)
>>> from decimal import Decimal as D >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")] >>> _sum(data) Decimal('0.6963')
"""
n, d = _exact_ratio(start) T = type(start) partials = {d: n} # map {denominator: sum of numerators} # Micro-optimizations. coerce_types = _coerce_types exact_ratio = _exact_ratio partials_get = partials.get # Add numerators for each denominator, and track the "current" type. for x in data: T = _coerce_types(T, type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total)
Internally, the function uses exact ratios for its calculations (which I think is very nice) and only goes through all the pain of coercing types to return T(total.numerator)/total.denominator where T is the final type resulting from the chain of conversions.
I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way. My tentative implementation:
def _sum2(data, start=None): if start is not None: t = set((type(start),)) n, d = _exact_ratio(start) else: t = set() n = 0 d = 1 partials = {d: n} # map {denominator: sum of numerators}
# Micro-optimizations. exact_ratio = _exact_ratio partials_get = partials.get
# Add numerators for each denominator, and build up a set of all types. for x in data: t.add(type(x)) n, d = exact_ratio(x) partials[d] = partials_get(d, 0) + n T = _coerce_types(t) # decide which type to use based on set of all types if None in partials: assert issubclass(T, (float, Decimal)) assert not math.isfinite(partials[None]) return T(partials[None]) total = Fraction() for d, n in sorted(partials.items()): total += Fraction(n, d) if issubclass(T, int): assert total.denominator == 1 return T(total.numerator) if issubclass(T, Decimal): return T(total.numerator)/total.denominator return T(total)
this leaves the re-implementation of _coerce_types. Personally, I'd prefer something as simple as possible, maybe even:
def _coerce_types (types): if len(types) == 1: return next(iter(types)) return float
, but that's just a suggestion.
In this case then:
_sum2((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))/6
1.9944444444444445
_sum2((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))/6
1.9944444444444445
lets check the examples from the _sum docstring just to be sure:
_sum2([3, 2.25, 4.5, -0.5, 1.0], 0.75)
11.0
_sum2([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.
1000.0
from fractions import Fraction as F _sum2([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
Fraction(63, 20)
from decimal import Decimal as D data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")] _sum2(data)
Decimal('0.6963')
Now the second issue: It is maybe more a matter of taste and concerns the effects of passing a Counter() object to various functions in the module. I know this is undocumented and it's probably the user's fault if he tries that, but still:
from collections import Counter c=Counter((1,1,1,1,2,2,2,2,2,3,3,3,3)) c
Counter({1: 4, 2: 5, 3: 4})
mode(c)
2 Cool, mode knows how to work with Counters (interpreting them as frequency tables)
median(c)
2 Looks good
mean(c)
2.0 Very well
But the truth is that only mode really works as you may think and we were just lucky with the other two:
c=Counter((1,1,2)) mean(c)
1.5 oops
median(c)
1.5 hmm
From a quick look at the code you can see that mode actually converts your
input to a Counter behind the scenes anyway, so it has no problem. mean and median, on the other hand, are simply iterating over their input, so if that input happens to be a mapping, they'll use just the keys.
I think there are two simple ways to avoid this pitfall: 1) add an explicit warning to the docs explaining this behavior or 2) make mean and median do the same magic with Counters as mode does, i.e. make them check for Counter as the input type and deal with it as if it were a frequency table. I'd favor this behavior because it looks like little extra code, but may be very useful in many situations. I'm not quite sure whether maybe even all mappings should be treated that way?
Ok, that's it for now I guess. Opinions anyone? Best, Wolfgang
On 1/27/2014, 12:41 PM, Wolfgang wrote: [snip]
As for the first subject: Specifically, I am not happy with the way the function handles different types. Currently _coerce_types gets called for every element in the function's input sequence and type conversion follows quite complicated rules, and - what is worst - make the outcome of _sum() and thereby mean() dependent on the order of items in the input sequence, e.g.:
mean((1,Fraction(2,3),1.0,Decimal(2.3),2.0, Decimal(5)))
1.9944444444444445
mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5)))
Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> mean((1,Fraction(2,3),Decimal(2.3),1.0,2.0, Decimal(5))) File "C:\Python33\statistics.py", line 369, in mean return _sum(data)/n File "C:\Python33\statistics.py", line 157, in _sum T = _coerce_types(T, type(x)) File "C:\Python33\statistics.py", line 327, in _coerce_types raise TypeError('cannot coerce types %r and %r' % (T1, T2)) TypeError: cannot coerce types <class 'fractions.Fraction'> and <class 'decimal.Decimal'>
FWIW, I find some of the concerns Wolfgang raised quite valid.
Steven, what do you think?
Yury
On 27/01/2014 17:41, Wolfgang wrote:
Ok, that's it for now I guess. Opinions anyone? Best, Wolfgang
So this doesn't get lost I'd be inclined to raise two issues on the bug tracker. It's also much easier for people to follow the issues there and better still, see what the actual outcome is.
On 01/30/2014 03:27 PM, Mark Lawrence wrote:
On 27/01/2014 17:41, Wolfgang wrote:
Ok, that's it for now I guess. Opinions anyone? Best, Wolfgang
So this doesn't get lost I'd be inclined to raise two issues on the bug tracker. It's also much easier for people to follow the issues there and better still, see what the actual outcome is.
Checking first is usually good policy, but now that you've had positive feed-back some issues on the bug tracker [1] is definitely a good idea.
-- ~Ethan~
On Mon, Jan 27, 2014 at 09:41:02AM -0800, Wolfgang wrote:
Dear all, I am still testing the new statistics module and I found two cases were the behavior of the module seems suboptimal to me. My most important concern is the module's internal _sum function and its implications, the other one about passing Counter objects to module functions.
As the author of the module, I'm also concerned with the internal _sum function. That's why it's now a private function -- I originally intended for it to be a public function (see PEP 450).
As for the first subject: Specifically, I am not happy with the way the function handles different types. Currently _coerce_types gets called for every element in the function's input sequence and type conversion follows quite complicated rules, and - what is worst - make the outcome of _sum() and thereby mean() dependent on the order of items in the input sequence, e.g.:
[...]
(this is because when _sum iterates over the input type Fraction wins over int, then float wins over Fraction and over everything else that follows in the first example, but in the second case Fraction wins over int, but then Fraction vs Decimal is undefined and throws an error).
Confusing, isn't it?
I don't think so. The idea is that _sum() ought to reflect the standard, dare I say intuitive, behaviour of repeated application of the __add__ and __radd__ methods, as used by the plus operator. For example, int + <any numeric type> coerces to the other numeric type. What else would you expect?
In mathematics the number 0.4 is the same whether you write it as 0.4, 2/5, 0.4+0j, [0; 2, 2] or any other notation you care to invent. (That last one is a continued fraction.) In Python, the number 0.4 is represented by a value and a type, and managing the coercion rules for the different types can be fiddly and annoying. But they shouldn't be *confusing* -- we have a numeric tower, and if I've written the code correctly, the coercion rules ought to follow the tower as closely as possible.
So here's the code of the _sum function:
[...]
You should expect that to change, if for no other reason than performance. At the moment, _sum is about two orders of magnitude times slower than the built-in sum. I think I can get it to about one order of magnitude slower.
I think a much cleaner (and probably faster) implementation would be to gather first all the types in the input sequence, then decide what to return in an input order independent way. My tentative implementation:
[...]
Thanks for this. I will add that to my collection of alternate versions of _sum.
this leaves the re-implementation of _coerce_types. Personally, I'd prefer something as simple as possible, maybe even:
def _coerce_types (types): if len(types) == 1: return next(iter(types)) return float
I don't want to coerce everything to float unnecessarily. Floats are, in some ways, the worst choice for numeric values, at least from the perspective of accuracy and correctness. Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative:
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False
One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types.
[...]
Now the second issue: It is maybe more a matter of taste and concerns the effects of passing a Counter() object to various functions in the module.
Interesting. If you think there is a use-case for passing Counters to the statistics functions (weighted data?) then perhaps they can be explicitly supported in 3.5. It's way too late for 3.4 to introduce new functionality.
[...]
From a quick look at the code you can see that mode actually converts your input to a Counter behind the scenes anyway, so it has no problem. mean and median, on the other hand, are simply iterating over their input, so if that input happens to be a mapping, they'll use just the keys.
Well yes :-)
I'm open to the suggestion that Counters should be treated specially. Would you be so kind as to raise an issue in the bug tracker?
Thanks for the feedback,
On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano steve@pearwood.info wrote:
One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types.
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
ChrisA
On Jan 30, 2014, at 17:32, Chris Angelico rosuav@gmail.com wrote:
On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano steve@pearwood.info wrote:
One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types.
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
Except that large enough int values lose information, and even larger ones raise an exception:
>>> float(pow(3, 50)) == pow(3, 50) False >>> float(1<<2000) OverflowError: int too large to convert to float
And that first one is the reason why statistics needs a custom sum in the first place.
When there are only 2 types involved in the sequence, you get the answer you wanted. The only problem raised by the examples in this thread is that with 3 or more types that aren't all mutually coercible but do have a path through them, you can sometimes get imprecise answers and other times get exceptions, and you might come to rely on one or the other.
So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?
On Jan 30, 2014, at 19:47, Andrew Barnert abarnert@yahoo.com wrote:
So, rather than throwing out Stephen's carefully crafted and clearly worded rules
Sorry, I meant Steven there.
(At least I hope I did, otherwise this will be doubly embarrassing...)
On Thu, Jan 30, 2014 at 07:47:54PM -0800, Andrew Barnert wrote:
So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?
I am happy to have an explicit disclaimer in the docs saying the result of calculations on mixed types are not guaranteed and may be subject to change. Then for 3.5 we can consider this more carefully.
On Fri, Jan 31, 2014 at 2:47 PM, Andrew Barnert abarnert@yahoo.com wrote:
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
Except that large enough int values lose information, and even larger ones raise an exception:
>>> float(pow(3, 50)) == pow(3, 50) False >>> float(1<<2000) OverflowError: int too large to convert to float
And that first one is the reason why statistics needs a custom sum in the first place.
I don't think it'd be possible to forbid int -> float coercion - the Python community (and Steven himself) would raise an outcry. But int->float is at least as safe as it's fundamentally possible to be. Adding ".0" to the end of a literal (thus making it a float literal) is, AFAIK, absolutely identical to wrapping it in "float(" and ")". That's NOT true of float -> Fraction or float -> Decimal - going via float will cost precision, but going via int ought to be safe.
float(pow(3,50)) == pow(3.0,50)
True
The difference between int and any other type is going to be pretty much the same whether you convert first or convert last. The only distinction that I can think of is floating-point rounding errors, which are already dealt with:
statistics._sum([pow(2.0,53),1.0,1.0,1.0])
9007199254740996.0
sum([pow(2.0,53),1.0,1.0,1.0])
9007199254740992.0
Since it handles this correctly with all floats, it'll handle it just fine with some ints and some floats:
sum([pow(2,53),1,1,1.0])
9007199254740996.0
statistics._sum([pow(2,53),1,1,1.0])
9007199254740996.0
In this case, the builtin sum() happens to be correct, because it adds the first ones as ints, and then converts to float at the end. Of course, "correct" isn't quite correct - the true value based on real number arithmetic is ...95, as can be seen in Python if they're all ints. But I'm defining "correct" as "the same result that would be obtained by calculating in real numbers and then converting to the data type of the end result". And by that definition, builtin sum() is correct as long as the float is right at the end, and statistics._sum() is correct regardless of the order.
statistics._sum([1.0,pow(2,53),1,1])
9007199254740996.0
sum([1.0,pow(2,53),1,1])
9007199254740992.0
So in that sense, it's "safe" to cast all int to float if the result is going to be float, unless an individual value is itself too big to convert, but the final result (thanks to negative values) would have been: I'm not sure how it's currently handled, but this particular case is working:
statistics._sum([1.0,1<<2000,0-(1<<2000)])
1.0
The biggest problem, then, is cross-casting between float, Fraction, and Decimal. And anyone who's mixing those is asking for trouble already.
ChrisA
On Fri, Jan 31, 2014 at 3:09 PM, Steven D'Aprano steve@pearwood.info wrote:
On Thu, Jan 30, 2014 at 07:47:54PM -0800, Andrew Barnert wrote:
So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?
I am happy to have an explicit disclaimer in the docs saying the result of calculations on mixed types are not guaranteed and may be subject to change. Then for 3.5 we can consider this more carefully.
+1.
ChrisA
Steven D'Aprano writes:
Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative:
AFAIK it is.
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False
This is a failure of associativity, not commutativity. Associativity is in many ways a more fundamental property.
On Fri, Jan 31, 2014 at 02:56:39PM +0900, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative:
AFAIK it is.
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False
This is a failure of associativity, not commutativity.
Oops, you are correct. I got them mixed up.
http://en.wikipedia.org/wiki/Associativity
However, commutativity of addition can violated by Python numeric types, although not floats alone. E.g. the example I gave earlier of two int subclasses.
On Jan 30, 2014, at 21:56, "Stephen J. Turnbull" stephen@xemacs.org wrote:
Steven D'Aprano writes:
Floats violate several of the fundamental rules of mathematics, e.g. addition is not commutative:
AFAIK it is.
py> 1e19 + (-1e19 + 0.1) == (1e19 + -1e19) + 0.1 False
This is a failure of associativity, not commutativity. Associativity is in many ways a more fundamental property.
Yeah, the only way commutativity can fail with IEEE floats is if you treat nan as a number and have at least two nans, at least one of them quiet.
But associativity failing isn't really fundamental. This example fails as a consequence of the axiom of (additive) identity not holding. (There is a unique "zero", but it's not true that, for all y, x+y=y implies x is that zero.) The overflow example fails because of closure not holding (unless you count inf and nan as numbers, in which case it again fails because zero fails even more badly).
If you just meant that you lose commutativity before associativity in compositions over fields, then yeah, I guess in that sense associativity is more fundamental.
Chris Angelico <rosuav@...> writes:
On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano <steve@...> wrote:
One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types.
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere".
Well, that's simple to stick to as long as you are dealing with explicitly typed input data sets, but what about things like:
a = transform_a_series_of_data_somehow(data) b = transform_this_series_differently(data)
statistics.mean(a+b) # assuming a and b are lists of transformed values
potentially different types are far more difficult to spot here and the fact that the result of the above might not be the same as, e.g.,:
statistics.mean(b+a)
is not making things easier to debug.
(Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
It should be mentioned here that complex numbers are not currently dealt with by statistics._sum .
statistics._sum((complex(1),))
Traceback (most recent call last): File "<pyshell#62>", line 1, in <module> s._sum((complex(1),)) File ".\statistics.py", line 158, in _sum n, d = exact_ratio(x) File ".\statistics.py", line 257, in _exact_ratio raise TypeError(msg.format(type(x).__name__)) from None TypeError: can't convert type 'complex' to numerator/denominator
Best, Wolfgang
On 31 January 2014 03:47, Andrew Barnert abarnert@yahoo.com wrote:
On Jan 30, 2014, at 17:32, Chris Angelico rosuav@gmail.com wrote:
On Fri, Jan 31, 2014 at 12:07 PM, Steven D'Aprano steve@pearwood.info wrote:
One of my aims is to avoid raising TypeError unnecessarily. The statistics module is aimed at casual users who may not understand, or care about, the subtleties of numeric coercions, they just want to take the average of two values regardless of what sort of number they are. But having said that, I realise that mixed-type arithmetic is difficult, and I've avoided documenting the fact that the module will work on mixed types.
Based on the current docs and common sense, I would expect that Fraction and Decimal should normally be there exclusively, and that the only type coercions would be int->float->complex (because it makes natural sense to write a list of "floats" as [1.4, 2, 3.7], but it doesn't make sense to write a list of Fractions as [Fraction(1,2), 7.8, Fraction(12,35)]). Any mishandling of Fraction or Decimal with the other three types can be answered with "Well, you should be using the same type everywhere". (Though it might be useful to allow int->anything coercion, since that one's easy and safe.)
Except that large enough int values lose information, and even larger ones raise an exception:
>>> float(pow(3, 50)) == pow(3, 50) False >>> float(1<<2000) OverflowError: int too large to convert to float
And that first one is the reason why statistics needs a custom sum in the first place.
When there are only 2 types involved in the sequence, you get the answer you wanted. The only problem raised by the examples in this thread is that with 3 or more types that aren't all mutually coercible but do have a path through them, you can sometimes get imprecise answers and other times get exceptions, and you might come to rely on one or the other.
So, rather than throwing out Stephen's carefully crafted and clearly worded rules and trying to come up with new ones, why not (for 3.4) just say that the order of coercions given values of 3 or more types is not documented and subject to change in the future (maybe even giving the examples from the initial email)?
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type
See also this thread (that I started during extensive off-list discussions about the statistics.sum function with Steven): https://mail.python.org/pipermail//python-ideas/2013-August/023034.html
Decimal makes the following concessions for mixing numeric types: 1) It will promote integers in arithmetic. 2) It will compare correctly against all numeric types (as long as FloatOperation isn't trapped). 3) It will coerce int and float in its constructor.
The recently added FloatOperation trap suggests that there's more interest in prohibiting the mixing of Decimals with other numeric types than facilitating it. I can imagine getting in that camp myself: speaking as someone who finds uses for both the fractions module and the decimal module I feel qualified to say that there is no good use case for mixing these types. Similarly there's no good use-case for mixing floats with Fractions or Decimals although mixing float/Fraction does work. If you choose to use Decimals then it is precisely because you do need to care about the numeric types you use and the sort of accuracy they provide. If you find yourself mixing Decimals with other numeric types then it's more likely a mistake/bug than a convenience.
In any case the current implementation of statistics._sum (AIUI, I don't have it to hand for testing) will do the right thing for any mix of types in the numeric tower. It will also do the right thing for Decimals: it will compute the exact result and then round once according to the current decimal context. It's also possible to mix int and Decimal but there's no sensible way to handle mixing Decimal with anything else.
If there is to be a documented limitation on mixing types then it should be explicitly about Decimal: The statistics module works very well with Decimal but doesn't really support mixing Decimal with other types. This is a limitation of Python rather than the statistics module itself. That being said I think that guaranteeing an error is better than the current order-dependent behaviour (and agree that that should be considered a bug).
If there is to be a more drastic rearrangement of the _sum function then it should actually be to solve the problem that the current implementation of mean, variance etc. uses Fractions for all the heavy lifting but then rounds in the wrong place (when returning from _sum()) rather than in the mean, variance function itself.
The clever algorithm in the variance function (unless it changed since I last looked) is entirely unnecessary when all of the intensive computation is performed with exact arithmetic. In the absence of rounding error you could compute a perfectly good variance using the computational formula for variance in a single pass. Similarly although the _sum() function is correctly rounded, the mean() function calls _sum() and then rounds again so that the return value from mean() is rounded twice. _sum() computes an exact value as a fraction and then coerces it with
return T(total_numerator) / total_denominator
so that the division causes it to be correctly rounded. However the mean function effectively ends up doing
return (T(total_numerator) / total_denominator) / num_items
which uses 2 divisions and hence rounds twice. It's trivial to rearrange that so that you round once
return T(total_numerator) / (total_denominator * num_items)
except that to do this the _sum function should be changed to return the exact result as a Fraction (and perhaps the type T). Similar changes would need to be made to the some of squares function (_ss() IIRC). The double rounding in mean() isn't a big deal but the corresponding effect for the variance functions is significant. It was after realising this that the sum function was renamed _sum and made nominally private.
To be clear, statistics.variance(list_of_decimals) is very accurate. However it uses more passes than is necessary and it can be inaccurate in the situation that you have Decimals whose precision exceeds that of the current decimal context e.g.:
import decimal d = decimal.Decimal('300000000000000000000000000000000000000000') d
Decimal('300000000000000000000000000000000000000000')
d+1 # Any arithmetic operation loses precision
Decimal('3.000000000000000000000000000E+41')
+d # Use context precision
Decimal('3.000000000000000000000000000E+41')
If you're using Fractions for all of your computation then you can change this since no precision is lost when calling Fraction(Decimal):
import fractions fractions.Fraction(d)+1
Fraction(300000000000000000000000000000000000000001, 1)
Oscar
On 1 February 2014 23:32, Oscar Benjamin oscar.j.benjamin@gmail.com wrote:
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type
http://bugs.python.org/issue20481 now covers the concerns over avoiding making any guarantees that the current type coercion behaviour of the statistics module will be preserved indefinitely (it includes a link back to the archived copy of Oscar's post on mail.python.org).
Cheers, Nick.
Nick Coghlan <ncoghlan@...> writes:
On 1 February 2014 23:32, Oscar Benjamin <oscar.j.benjamin@...> wrote:
You're making this sound a lot more complicated than it is. The problem is simple: Decimal doesn't integrate with the numeric tower. This is explicit in the PEP that brought in the numeric tower: http://www.python.org/dev/peps/pep-3141/#the-decimal-type
http://bugs.python.org/issue20481 now covers the concerns over avoiding making any guarantees that the current type coercion behaviour of the statistics module will be preserved indefinitely (it includes a link back to the archived copy of Oscar's post on mail.python.org).
Cheers, Nick.
Thanks a lot, Nick, for all your efforts in filing the bugs. I just added a possible patch for http://bugs.python.org/issue20481 to the bug tracker. Best, Wolfgang