Easy data mining via python

Note: This essay uses python for data mining.

Agenda

  • Pandas
  • Pandas-Profiling
  • Statsmodels
  • Missingno
  • Wordcloud

Pandas

Pandas is a Python library for exploring, processing, and model data.

Here we take a dataset named mimic-III as an example.

Basic stats

# First load df from a file
df.head()
df.shape
df[a column].mean()
df[a column].std()
df[a column].max()
df[a column].min()
df[a column].quantile()
df.describe() # brief description
df.isna().any() # check every columns whether it has missing values
df.isna().sum() # count NAN values

Additional tips

Charting a tabular dataset

Supported charts

DataFrame.plot([x, y], kind)

 - kind :

    - 'line': line plot (default)
    - 'bar': vertical bar plot
    - 'barh': horizontal bar plot
    - 'hist': histogram
    - 'box': boxplot
    - 'kde': Kernel Density Estimation plot
    - 'density': same as 'kde'
    - 'area': stacked area plot
    - 'pie': pie plot
    - 'scatter': scatter plot
    - 'hexbin': Hexagonal binning plot

show


import pandas as pd
a = pd.read_csv("/path/mimic_demo/admissions.csv")
a.columns = map(str.lower, a.columns)
a.groupby(['marital_status']).count()['row_id'].plot(kind='pie')
pie

a.groupby(['religion']).count()['row_id'].plot(kind = 'barh')
pie-bar

p = pd.read_csv("/path/mimic_demo/patients.csv")
p.columns = map(str.lower, p.columns)
ap = pd.merge(a, p, on = 'subject_id' , how = 'inner')
ap.groupby(['religion','gender']).size().unstack().plot(kind="barh", stacked=True)
brah-partients

c = pd.read_csv("/path/mimic_demo/cptevents.csv")
c.columns = map(str.lower, c.columns)
ac = pd.merge(a, c, on = 'hadm_id' , how = 'inner')
ac.groupby(['discharge_location','sectionheader']).size().unstack().plot(kind="barh", stacked=True)
barh-cptevents

Pandas-profiling

Pandas-Profiling is a Python library for exploratory data analysis.

A quick example

import pandas as pd
import pandas_profiling
a = pd.read_csv("/path/mimic_demo/admissions.csv")
a.columns = map(str.lower, a.columns)
# ignore the times when profiling since they are uninteresting
cols = [c for c in a.columns if not c.endswith('time')]
pandas_profiling.ProfileReport(a[cols], explorative=True)

Save generated profile to a ".html".

profile.to_file("/path/data_profile.html")

Statsmodels

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Basic stats

For simplicity, we use statsmodels' describe (Note: Describe has been deprecated in favor of Description and it's simplified functional version, describe. Describe will be removed after 0.13) for quick stats. The selectable statistics include:

  • “nobs” - Number of observations

  • “missing” - Number of missing observations

  • “mean” - Mean

  • “std_err” - Standard Error of the mean assuming no correlation

  • “ci” - Confidence interval with coverage (1 - alpha) using the normal or t. This option creates two entries in any tables: lower_ci and upper_ci.

  • “std” - Standard Deviation

  • “iqr” - Interquartile range

  • “iqr_normal” - Interquartile range relative to a Normal

  • “mad” - Mean absolute deviation

  • “mad_normal” - Mean absolute deviation relative to a Normal

  • “coef_var” - Coefficient of variation

  • “range” - Range between the maximum and the minimum

  • “max” - The maximum

  • “min” - The minimum

  • “skew” - The skewness defined as the standardized 3rd central moment

  • “kurtosis” - The kurtosis defined as the standardized 4th central moment

  • “jarque_bera” - The Jarque-Bera test statistic for normality based on the skewness and kurtosis. This option creates two entries, jarque_bera and jarque_beta_pval.

  • “mode” - The mode of the data. This option creates two entries in all tables, mode and mode_freq which is the empirical frequency of the modal value.

  • “median” - The median of the data.

  • “percentiles” - The percentiles. Values included depend on the input value of percentiles.

  • “distinct” - The number of distinct categories in a categorical.

  • “top” - The mode common categories. Labeled top_n for n in 1, 2, …, ntop.

  • “freq” - The frequency of the common categories. Labeled freq_n for n in 1, 2, …, ntop.

import pandas as pd
import statsmodels.stats.descriptivestats as dst

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

a = pd.read_csv("/path/xx.csv") # load your file via pd.read_xx()
de = dst.describe(a)

# Save the description to excel (pandas also support other formats)
df.describe().to_excel("./pd_des.xlsx")
de.to_excel("./sm_des.xlsx")

Missingno

Missingno offers a visual summary of the completeness of a dataset. This example brings some intuitive thoughts about ADMISSIONS table:

  • Not every patient is admitted to the emergency department as there are many missing values in edregtime and edouttime.
  • Language data of patients is mendatory field, but it used to be not.
import missingno as msno
a = pd.read_csv("/path/mimic_demo/admissions.csv")
msno.matrix(a)
msno-show

Missingsno also supports bar charts, heatmaps and dendrograms, check it out at github.

Wordcloud

Wordcloud visualizes a given text in a word-cloud format

This example illustrates that majority of patients suffered from sepsis.

from wordcloud import WordCloud
text = str(a['diagnosis'].values) #Prepare an input text in string
wordcloud = WordCloud().generate(text) #Generate a word-cloud from the input text

# Plot the word-cloud 
import matplotlib.pyplot as plt
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()
wordcloud

Reference