Machine Learning 101 - Part 1

It’s hot, it’s trendy, it’s powerful. Let’s dive in together using Python.
So, you have a set of data you want to explore. In our case, we’re going to use an example dataset from the UCI Machine Learning Repository. In particular, we’re going to explore Abalone data from:
Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994)
The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and   
Islands of Bass Strait
Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)
What You See Is What You Get
The first thing you want to do, is find out what your dataset actually looks like.
We need to import the Pandas package for Python, which we’ll import as pd:
import Pandas as pd
Then we need to connect to the data structure, by calling:
pd.read_csv(location_of_your_target_data, names = column_names_in_your_dataset)
Set your target location:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
We can look at the dataset directly, and find out the order and column names, which we’ll set as “columns”:
columns = ['Sex','Length','Diameter','Height','WholeWeight', 'ShuckedWeight', 'VisceraWeight', 'ShellWeight', 'Rings']
Now, we can connect to our data with the proper parameters using the earlier generic model:
data = pd.read_csv(data_url, names=columns)
Done!
Next, we should explore the data more intimately. First, let’s find out how many rows and columns we’re actually dealing with.
Find Rows and Columns
We can view the total number of Rows and Columns by using data.shape:
print(data.shape)
| output | 
|---|
| (4177, 9) | 
4000+ rows over 9 categories!
View Your Data
We can see what a sample of our dataset looks like by using data.head(#of rows):
print(data.head(3))
Sex  Length  Diameter  Height  WholeWeight  ShuckedWeight  VisceraWeight  \
0   M   0.455     0.365   0.095       0.5140         0.2245         0.1010   
1   M   0.350     0.265   0.090       0.2255         0.0995         0.0485   
2   F   0.530     0.420   0.135       0.6770         0.2565         0.1415   
   ShellWeight  Rings  
0         0.15     15  
1         0.07      7  
2         0.21      9  
Assay Missing Values
We should also check for missing values, too many can cause problems, but we can massage the data by applying an average in place of NA values, if the number missing is low enough, without introducing an extreme error rate
print(data.isnull().sum())
| Category | Missing Values | 
|---|---|
| Sex | 0 | 
| Length | 0 | 
| Diameter | 0 | 
| Height | 0 | 
| WholeWeight | 0 | 
| ShuckedWeight | 0 | 
| VisceraWeight | 0 | 
| ShellWeight | 0 | 
| Rings | 0 | 
| dtype: | int64 | 
No values are missing, this is ideal!
But What If Values ARE Missing?
In this case, you have two choices, A and B.
Choice A is simple, use only the rows which contain complete datasets, called a Complete Case Analysis. Unfortunately, depending on the number of missing elements, you sacrifice N and skew even further from good statistics and fundamental research design.
Choice B is a pathway of choices, categorized as Available Case Analysis.
I. Substitution via Imputation
1. Mean
 from sklearn.preprocessing import Imputer
 Imputer(missing_values=0, strategy='mean', axis=0)
You can engage in simple substition as suggested earlier, through imputation of a mean in place of the null position. Once again, you’re introducing an error factor here, but with limited missing values, your impact will be lessened.
2. Regression
If you’re studying something reasonably well documented, you could substitute values from previous research with similar metrics.
3. Sample Matching
Assuming a large N, you could engage in matching imputation. This is where you’d extract an element of consideration from another sample with similar characteristics, and use its value you’re missing in the counter-part.
There are more, but these are the relatively common approaches.
Set Bins
We can also bin our categories by Sex(or any other column if it is a categorical type) and see a count for each:
print(data.groupby('Sex').size())
| Sex | Count | 
|---|---|
| F | 1307 | 
| I | 1342 | 
| M | 1528 | 
| dtype: | int64 | 
Basic Statistical Summary
You’ll probably want to get some quantification of your groups quickly! We can use .describe() to yield a simple
statistical summary of your columns. You’ll retreive the count, mean, standard deviation, minimum, maximum, and 25-75% quantiles.
print(data.describe())
            Length     Diameter       Height  WholeWeight  ShuckedWeight  \
count  4177.000000  4177.000000  4177.000000  4177.000000    4177.000000   
mean      0.523992     0.407881     0.139516     0.828742       0.359367   
std       0.120093     0.099240     0.041827     0.490389       0.221963   
min       0.075000     0.055000     0.000000     0.002000       0.001000   
25%       0.450000     0.350000     0.115000     0.441500       0.186000   
50%       0.545000     0.425000     0.140000     0.799500       0.336000   
75%       0.615000     0.480000     0.165000     1.153000       0.502000   
max       0.815000     0.650000     1.130000     2.825500       1.488000   
       VisceraWeight  ShellWeight        Rings  
count    4177.000000  4177.000000  4177.000000  
mean        0.180594     0.238831     9.933684  
std         0.109614     0.139203     3.224169  
min         0.000500     0.001500     1.000000  
25%         0.093500     0.130000     8.000000  
50%         0.171000     0.234000     9.000000  
75%         0.253000     0.329000    11.000000  
max         0.760000     1.005000    29.000000  
R simplicity in Python is a beautiful thing.
Visualizing Your Data
We can also produce visualizations of our groups using matplotlib, which we’ll import as mplt
First up, we’ll produce a histogram of our column groups:
import matplotlib as mplt
#Histograms
data.hist()
mplt.show()

Nice to see the distributions, right?
If you’re like me, you want the biggest bang for your code buck. In that case, you’ll want to use a feature
called scatter_matrix(your_dataset)
We’ll need to invoke pandas again, and form our request as pd.scatter_matrix(your_dataset)
Hang on, because this is where things get exciting! We get to spot correlations between our columnar categories and see if our hunches have been correct.
#Correlation plots
scatter_matrix(data)
mplt.show()

Now you can eyeball relationships you might want to pay particular attention to.
In an earlier post we discussed linearity of correlations, we can see some clear linear correlations(that nicely fit our intuition as well, such as length and diameter). You can also observe some non-linear relationships we will explore later.
The great thing is, we can pull all of these tests together in a single call, by using the summary script below
Summary of code:
import pandas as pd
import matplotlib as mplt
def surveyData(data_url, columns)
   #Access data, apply titles
   data = pd.read_csv(data_url, names=columns)
   #find rows and columns
   print(data.shape)
   #check data structure
   print(data.head(3))
   #find missing values
   print(data.isnull().sum())
   #categorize
   print(data.groupby('Sex').size())
   #statistics
   print(data.describe())
   #Histograms
   data.hist()
   mplt.show()
   #Correlation plots
   scatter_matrix(data)
   mplt.show()
#set target
data_url = "https://archive.ics.uci.edu/ml/
           machine-learning-databases/abalone/abalone.data"
#set titles                                                          
columns = ['Sex','Length','Diameter','Height','WholeWeight', 
           'ShuckedWeight', 'VisceraWeight', 'ShellWeight', 'Rings']
#pass target and titles to surveyData and run the program                                     
surveyData(data_url, columns)
