# Excel Statistics for SEO and Data Analysis

Everybody has probably already realized that there is almost no data that we cannot get. We can get data about our website by using free tools, but we also spend tons of money on paid tools to get even more. Analyzing the competition is just as easy, competitive intelligence tools are everywhere, we often use Compete or Hitwise. Opens Site Explorer is great for getting more data about our and competitors backlink profile. No matter what information we are trying to get, we can, by spending fortunes or no money. My favorite part is that almost every tool has one common feature and that is the “Export” button. This is the most powerful feature of all these tools because by exporting the data into Excel and we can sort it, filter it and model it in any way we want. Most of us use Excel on the regular basis, we are familiar with the basic functions but Excel can do way more than that. In the following article I will try to present the most common statistical techniques and the best part it is that we don’t have to memorize complicated statistical equations, it’s everything built into Excel!

Statistics is all about collecting, analyzing and interpreting data. It comes very handy when decision making faces uncertainty. By using statistics, we can overcome these situations and generate actionable analysis.

Statistics is divided into two major branches, **descriptive **and **inferential**.

**Descriptive statistics** are used when you know all the values in the dataset. For example, you take a survey of 1000 people asking if they like oranges, with two choices (Yes and No). You collect the results and you find out that 900 answered Yes, and 100 answered No. You find the proportion 90% is Yes 10 is No. Pretty simple right?

But what happens when we cannot observe all the data?

When you know only part of your data than you have to use **inferential statistics**. Inferential statistics is used when you know only a sample (a small part) from your data and you make guesses about the entire population (data).

Let’s consider you want to calculate the email open rate for the last 24 months, but you have data only from the last six months. In this case, assuming that from 1000 emails you had 200 people opening the email, which resulted in 800 emails that didn’t convert. This equates to 20% open rate and 80% who did not open. This data is true for the last six months, but it might not be true for 24 months. Inferential statistics helps us understand how close we are to the entire population and how confident we are in this assumption.

The open rate for the sample may be 20% but it may vary a little. Therefore, let’s consider +- 3% in this case the range is from 17% to 23%. This sounds pretty good but how confident are we in these data? Alternatively, what percentage of a random sample taken from the entire population (data set) will fall in the range of 17%-23%?

In statistics, the 95% confidence level is considered to be reliable data. This means 95% of the sample data we take from the entire population will produce an open rate of 17-23%, the other 5% will be either above 23% or below 17%. But we are 95% sure that the open rate is 20% +- 3%

The term **data **stands for any value that describes an object or an event such as visitors, surveys, emails.

The term **data set** has two components, *observation unit*, which is for example visitors and the variables that can represent the demographic characteristics of your visitors such as age, salary or education level. *Population *refers to every member of your group, or in web analytics all the visitors. Let’s assume 10,000 visitors.

A **sample **is only a part of your population, based on a date range, visitors who converted, etc. but in statistics the most valuable sample is considered a random sample.

The **data distribution** is given by the frequency with which the values in the data set occur. By plotting the frequencies on a chart, with the range of the values on the horizontal axis and the frequencies on the vertical axis, we obtain the distribution curve. The most commonly used distribution is the normal distribution or the bell-shaped curve.

An easy way to understand this is by considering the number of visitors a website has. For example the number of visits are on average of 2000/day but it happens to have more visits such as 3000 or less 1000.

Here, **probability theory** comes in handy.

**Probability **stands for the likelihood of an event happening such as having 3,000 visitors/day and is expressed in percentages.

The most common example of probability that probably everybody knows is the coin flip. A coin has two faces, head and tail, what is the probability when flipping a coin to have head? Well there are two possibilities so 100%/2=50%.

Enough with theories and let’s get a little bit more practical.

Excel is an amazing tool that can help us with statistics, it’s not the best but we all know how to use it so let’s dive right into it.

First, install the Analysis ToolPack.

Open Excel, Go to Options -> Add-ins->at the bottom we will find

Hit Go ->select Analysis ToolPack->and click OK.

Now under the Data tab we will find Data Analysis.

The Data Analysis tool can give you supper fancy statistical information but first let’s start with something easier.

**Mean, Median, and Mode**

**Mean **is the statistical meaning of average, for example the mean or average of 4,5,6 is 5 how we calculate in excel the mean? =average(number1,number2,etc)

Mean=AVERAGE(AC16:AC21)

By calculating the mean we know how much we sold on average. This information is valuable when there are no extreme values (or outliers). Why? It looks like we sold on average $3000 worth of products, but actually we were lucky that somebody spent more on September 6. But actually we did pretty poorly during the previous six days, with an average of only $618. Excluding the extreme values from the mean can reflect a more relevant performance rate.

The **median **is the observation situated in the middle of the data set. For example, the median of 224, 298, 304 is 298. In order to calculate the mean for a large set of data we can use the following formula =MEDIAN(224,298,304)

When is the median useful? Well, the median is useful when you have a skewed distribution, for example you are selling candies for $3 up to $15 a bag but you have some very expensive candies for $100 a bag that nobody really purchases on a regular basis. At the end of your month you have to make a report and you will see that you sold mostly cheap candies and only a couple of the $100. In this case calculating median is more beneficial.

The easiest way to determine when to use the median vs. the mean is by creating a histogram. If your histogram is skewed with an extreme, then you know that the best way to go is by calculating the median.

The **mode **is the most common value, for example the mode for: 4,6,7,7,7,7,9,10 is 7

In Excel you can calculate the mode by using the =MODE(4,6,7,7,7,7,9,10) formula.

Although this looks nice keep in mind that in Excel the lowest mode is considered, or in other words, if you have to calculate the mode for the following data set **2,2,2,**4,5,6,**7,7,7**,8,9 you can see that you have two modes, 2 and 7 but Excel will show you only the smallest value: 2.

When can we use the mode function? Calculating the mode is beneficial only for whole numbers such as 1, 2 and 3. It is not useful for fractional numbers such as 1,744; 2.443; 3,323, as the chance to have duplicated numbers, or a mode, is very small.

A great example of calculating the mode, or the most frequent number, will be probably on a survey.

**Histograms**

Let’s say your blog recently received hundreds of guest posts, some of them are very good ones but some of them are just not that good. Maybe you want to see how many of your blog posts received 10 backlinks, 20, 30 and so on, or maybe you are interested in social shares such as tweets or likes, but why not just simply visits.

Here we will categorize them into groups by using *a visual representation called histograms*. In this example I will use visits/articles as an easy example. The way I setup my Google Analytics account is as follows. I have a profile that tracks only my blog, nothing else. If you don’t have such profile setup yet, then you can create a segment on the fly.

How are you doing this? Pretty simple:

Now go to export->CSV

Open the excel spread sheet and delete all the columns except for Landing Page and Visits. Now create the ranges (also called bins) that you want to be categorized into. Let’s say we want to see how many articles generated 100 visits, 300, 500 and so on.

Got to Data -> Data Analysis->Histograms->OK

- Input range will be the visits column
- Bin Range will be the groups
- Output Range, click on the cell where you want your histogram to show up
- Check Chart Output
- Click OK

Now you have a nice histogram that shows you the number of articles categorized by visits. To make it easier to understand this histogram, click on any cell from the Bin and Frequency table and sort the frequency by low to high.

Analyzing now the data is even easier. Now go back and sort all the articles with less or equal to 100 visits (Visit drop down->Number filters->Between…0-100->Ok) in the last month and update them, or promote them.

**Reference**:-http://www.seomoz.org/blog/excel-statistics-for-seo-and-data-analysis

**Ebriks Infotech**:-SEO Company