Exploratory Data Analysis

Before you go data spelunking, you need to do your homework!

As a data scientist, you should approach your discipline with a strong basis rooted in hard science. One of your primary goals, is to keep an open mind. Follow the data even if it doesn’t match your intuition. Sometimes your intuition is a great asset! But, odds are, there are patterns and relationships in the data you’re not going to divine by intuition alone. You’ll need some help.

One way to do this is to engage in Exploratory Data Analysis (EDA)

Exploratory Data Analysis

EDA allows us to do exactly as I stated prior, peek into the layers of information and find relationships, regardless of your intuition or knowledge. Best of all, you can do this visually with relative ease. In a previous post ‘Machine learning 101’ we utilized the scatter_matrix(). This will be a frequently used tool in your career to spot relationships between independent and dependant variables. But there are other options as well, and you’ll want to pick one that fits your dataset.

EDA isn’t just useful for finding predictive opportunities, it’s also a fundamental step in understanding your data. We’ve already utilized the approach, for example, when parsing our data to look for missing or null values, view total entries, assay data types. You use EDA to understand your data.

According to Carnegie Mellon University, the purpose of EDA can be summarized as follows:

EDA Uses
Detection of mistakes
Checking of assumptions
Preliminary selection of appropriate models
Determining relationships among the explanatory variables
Assessing the direction and rough size of relationships between independant and dependant variables

In addition, the primary groups of EDA include:

EDA Types
Univariate
Multivariate
Non-Graphical
Graphical

You’ll also want to consider whether your data/question include qualitative, quantitative, categorical, and continuous features when developing your analytical approach.

For example, let’s say you were working exclusively with categorical data, which results in a binary set of choices. Yes/no, with/without, higher/lower etc. Here, you’d want to utilize something like countplot() from the Seaborn package to follow your EDA investigative thread rather than the scatter_matrix() approach.

ex. countplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None, palette=None, saturation=0.75, dodge=True, ax=None, **kwargs) Shows the counts of observations in each categorical bin using bars. We’ll add this to a modified chunk of code from another of my posts that returns a set of exploratory data, but with the scatter_matrix portion removed, as this set is largely composed of non-numeric data.

In this case, we’re going to use data about post-operatory patients, and in particular, we will focus on the self-reported comfort level assigned as an integer, along with the doctors decision as to where patients should be moved during their post-op recovery:

def surveyData(data_url, columns):
   #Access data, apply titles
   data = pd.read_csv(data_url, names=columns)
   #find rows and columns
   print(data.shape)
   #check data structure
   print(data.head(3))
   #find missing values
   print(data.isnull().sum())
   #categorize
   print(data.groupby('Decision').size())
   print(data.describe())
   #Seaborn countplot
   sns.set(style="darkgrid")
   sns.countplot(x="Decision", hue="COMFORT", data=data)
   mplt.show()

Then, we’ll run it via:

#Add column names
columns = ['L-CORE','L-SURF','L-O2','L-BP','SURF-STBL',
          'CORE-STBL', 'BP-STBL', 'COMFORT', 'Decision']

#Fetch and load data
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/"
                       +"postoperative-patient-data/post-operative.data"
data = pd.read_csv(data_url, names=columns)

#pass target and titles to surveyData and run the program
surveyData(data_url, columns)

Dimensions

(90, 9)

90 rows, 9 columns

Head

   L-CORE L-SURF       L-O2  L-BP SURF-STBL CORE-STBL     BP-STBL COMFORT  \
  mid    low  excellent   mid    stable    stable      stable      15   
  mid   high  excellent  high    stable    stable      stable      10   
 high    low  excellent  high    stable    stable  mod-stable      10   

      Decision  
      A  
      S  
      A  

Missing Values

L-CORE       0
L-SURF       0
L-O2         0
L-BP         0
SURF-STBL    0
CORE-STBL    0
BP-STBL      0
COMFORT      0
Decision     0
dtype: int64

There are no missing values, however, we’ll see this might be misleading when we see a plot.

Classes

Decision
A     63
A      1
I      2
S     24
dtype: int64

Hmmm, ‘A’ appears twice…curious

Class Key:
I (patient sent to Intensive Care Unit)
S (patient prepared to go home)
A (patient sent to general hospital floor)

Summary Statistics

         L-CORE L-SURF  L-O2 L-BP SURF-STBL CORE-STBL BP-STBL COMFORT Decision
count      90     90    90   90        90        90      90      90       90
unique      3      3     2    3         2         3       3       5        4
top       mid    mid  good  mid    stable    stable  stable      10        A
freq       58     48    47   57        45        83      46      65       63

So, ‘10’ is the most common post-op response, we should find out what scale they are using to make contextual sense of the value!

Plot

Our resulting graph allows us to see our categorical data, and associations with reported comfort levels of patients:

Remember when we noticed there were no missing values? It turns out someone had coded an entry as ‘?’, so while there’s not a value missing from the index, it DOES represent a missing value. In addition, it appears there may be multiple char types floating around, as we saw with the duplicate classification of the ‘A’ discharge decision.

Based on the results of this plot, self-reported comfort does not seem to be a strong predictor for the doctor’s decision. I suspect that the more appropriate predictor would be in the non-numeric groups, which reflect physiological statuses, such as core temperatures, blood pressure, O2 levels, etc.

Glad we did some EDA!

Non-Categorical

Other times, you’ll be working with continuous data, and you may just want to look for outliers or compare means, plot histograms, probabilities, box plots, and standard deviations.

Summary

For the most part, EDA is a simple procedure that yields tremendous value in both information, and in time efficiency.

So, again, DO YOUR HOMEWORK!

Written on December 1, 2017