# Correlation Finding Software?

Does anybody know of a program that can take a spreadsheet with numerical data in each column, and find correlations between columns? This is for something scouting related.

Going back to statistics, just calculate the r2-value of the lists of data to find the correlation coefficient.

You can do this in Excel, and you can also compute it on TI-83+/84 graphing calculators, by entering one set of values into one list (such as L1) and the other set of values into another list (such as L2), and then using the LinReg(ax+b) L1,L2 function (which can be found in the [STAT] then “Calc” menu) to find the r-values, as well as the a and b values of the line of best fit for the data.

Since the r2-value explains the amount of the variance in the data, a perfect correlation of the two lists of data would have a r2-value of 1.0, and a r2-value of 0.228 would indicate the 22.8% of the variance of the latter set of data can be explained by the first set of data. The range of the r2-value is from 0 to 1.

If you want to find whether or not the data is positively or negatively correlated, use the plain r-value. These values will range from -1 to 1, depending on the correlation. (-1 = perfect negative correlation; 0 = no correlation; 1 = perfect positive correlation)

R^2 correlation coefficients are certainly a good start for any statistical analysis of a dataset. However, keep in mind that there are many other approaches to calculating correlation than the Pearson method (a.k.a. R^2 values).

If you’re really serious about it, I would suggest picking up a book on basic statistics. While you might not understand all of the underlying theory and mathematics behind some of the tests (I certainly don’t), at the very least it will give you a basic understanding of them and (because statistics is, after all, math) formulas for each test that can easily be integrated into any piece of software (be it Excel or your own creation).

All that being said, ALWAYS REMEMBER: CORRELATION DOES NOT IMPLY CAUSATION. No number alone can ever give you any useful information about a link between two variables. To be useful, it has to be coupled with an understanding of the mechanics of the situation and a healthy dose of common sense.

Happy number crunching!

I suggest downloading Minitab. It is a really nice stat package that does exactly what you are asking for, and it comes with a free (30 day?) trial.

Once you’ve installed it, copy your data into two columns. Then go wild with all the different regression tests. If the test exists, Minitab has it.

You could also try MATLAB, but that is a more generic math tool kit. Minitab is just stats.

In excel there are functions called SLOPE and INT that will calculate the best fit linear equation values (y=SLOPE*x+INT) from an x & y array of values. There are also formulas for exponential distributions.

Additionally, if you are still looking for the which best fit eqn to use try plotting the arrays on an XY graph. From the graph select a point from the dataset, right click and select the trendline option. This will allow you select from a variety of eqn forms (polynomial, log, exp…) and well as display the equation for the best fit curve and the R^2 value. You can use the equation for secondary processing, but it does introduce a minor amount of manual processing.

That goes back to my favorite statistical fallacy example: buying more ice cream leads to more shark attacks! :yikes:

To second notaPINKtruck, you’re going to need at least a stat book and a good bit of time if you want to get anything useful out of it. The suggestion to use excel’s linear fit is a bit misguided. In that it only means something if you have a linear relationship between two variables. If you’re trying to predict winning percentage from your various scouting data, then you’re not really going to get anywhere with excel. In fact, you’re highly unlikely to get anywhere without some sort of guess at a relationship between all your variables. In technical terms, you need a model to fit to your data. Statistics isn’t going to come up with one for you. At any rate, I’m told that The R Project is a good free statistics package, if possibly a bit more complicated than you’d like. I think when it comes down to it, this whole endeavor is going to be a bit more complicated than, say, dumping scouting data into a program and having a winning percentage or score pop out.

For the record I am not suggesting that excel’s linear fit should be used as I have no idea what data is being worked with. The question was about finding correlation between data ranges in a spreadsheet. I was simply pointing out that excel has built in functions for that… I see no reason to reinvent the wheel if one is just trying to convert the data into information. However, if one is trying to learn about the technique then, yeah, grab a book.

I also found this page that gives the excel calcs for other trendlines… http://j-walk.com/ss/excel/tips/tip101.htm

Yeeesss… But, you see, everything Excel does is still basic single variable linear regression. You’ll notice the formulas you linked to there are just a series of linear regressions on transformed data. If you need to do a logistic regression, you’re flat out of luck. And if reality fundamentally depends on two independent variables, then you’re unlikely to get any useful information out of Excel. Since variations in your data due to the second variable will mostly appear as so much noise if your regression only accounts for one variable. Unless you only look at data points where all other variables are the same, which isn’t terribly useful. On the whole, you’re much more likely to get useful information out of the data by actually, you know, doing statistics on it. as for analysis I second the basic statistics approach… You could use excel to implement it or code it up yourself (it’s suprisingly easy to process formated data like an excel outputted text file)