Why is there a significant difference between the
r2_score function in scikit-learn and the formula for the Coefficient of Determination as described in Wikipedia? Which is the correct one?
I’m using with Python 3.5 to predict linear and quadratic models, and one of the measures of goodness of fit that I’m trying out is the . However, while testing, there’s a marked difference between the
r2_score metric in
scikit-learn and the calculation provided in Wikipedia.
I’m providing my code here as reference, which computes the example in the Wikipedia page linked above.
from sklearn.metrics import r2_score import numpy y = [1, 2, 3, 4, 5] f = [1.9, 3.7, 5.8, 8.0, 9.6] # Convert to numpy array and ensure double precision to avoid single precision errors observed = numpy.array(y, dtype=numpy.float64) predicted = numpy.array(f, dtype=numpy.float64) scipy_value = r2_score(observed, predicted) >>> scipy_value:
As is evident, the
scipy calculated value is
-3.8699999999999992while the reference value in Wikipedia is
UPDATE: This is different from this question about how R^2 is calculated in scikit-learn as what I’m trying to understand and have clarified is the discrepancy between both results. That question states that the formula used in scikit is the same as Wikipedia’s which should not result in different values.
UPDATE #2: It turns out I made a mistake reading the Wikipedia article’s example. Answers and comments below mention that the example I provide is for the linear, least squares fit of the (x, y) values in the example. For that, the answer in Wikipedia’s article is correct. For that, the R^2 calue provided is 0.998. For the R^2 between both vectors, scikit’s answer is also correct. Thanks a lot for your help!
The referred question is correct — if you work through the calculation for the residual sum of squares and the total sum of squares, you get the same value as sklearn:
In : import numpy as np In : y = [1,2,3,4,5] In : f = [1.9, 3.7, 5.8, 8.0, 9.6] In : SSres = sum(map(lambda x: (x-x)**2, zip(y, f))) In : SStot = sum([(x-np.mean(y))**2 for x in y]) In : SSres, SStot Out: (48.699999999999996, 10.0) In : 1-(SSres/SStot) Out: -3.8699999999999992
The idea behind a negative value is that you’d have been closer to the actual values had you just predicted the mean each time (which would correspond to an r2 = 0).
Answered By – Randy
Answer Checked By – Cary Denson (BugsFixing Admin)