Easy data mining via python
Note: This essay uses python for data mining.
Agenda
- Pandas
- Pandas-Profiling
- Statsmodels
- Missingno
- Wordcloud
Pandas
Pandas is a Python library for exploring, processing, and model data.
Here we take a dataset named mimic-III as an example.
Basic stats
# First load df from a file
df.head()
df.shape
df[a column].mean()
df[a column].std()
df[a column].max()
df[a column].min()
df[a column].quantile()
df.describe() # brief description
df.isna().any() # check every columns whether it has missing values
df.isna().sum() # count NAN values
Additional tips
Charting a tabular dataset
Supported charts
DataFrame.plot([x, y], kind)
- kind :
- 'line': line plot (default)
- 'bar': vertical bar plot
- 'barh': horizontal bar plot
- 'hist': histogram
- 'box': boxplot
- 'kde': Kernel Density Estimation plot
- 'density': same as 'kde'
- 'area': stacked area plot
- 'pie': pie plot
- 'scatter': scatter plot
- 'hexbin': Hexagonal binning plot
import pandas as pd
a = pd.read_csv("/path/mimic_demo/admissions.csv")
a.columns = map(str.lower, a.columns)
a.groupby(['marital_status']).count()['row_id'].plot(kind='pie')
data:image/s3,"s3://crabby-images/d0dbe/d0dbe1c1246ec59bb57d6b249fd83dcf1eb950f1" alt=""
a.groupby(['religion']).count()['row_id'].plot(kind = 'barh')
data:image/s3,"s3://crabby-images/457f9/457f91a8be72e0923ea1ee8fa697cdcd8ceeb903" alt=""
p = pd.read_csv("/path/mimic_demo/patients.csv")
p.columns = map(str.lower, p.columns)
ap = pd.merge(a, p, on = 'subject_id' , how = 'inner')
ap.groupby(['religion','gender']).size().unstack().plot(kind="barh", stacked=True)
data:image/s3,"s3://crabby-images/251f2/251f24f7c031d361cd3ae34d1404f8928c1679cb" alt=""
c = pd.read_csv("/path/mimic_demo/cptevents.csv")
c.columns = map(str.lower, c.columns)
ac = pd.merge(a, c, on = 'hadm_id' , how = 'inner')
ac.groupby(['discharge_location','sectionheader']).size().unstack().plot(kind="barh", stacked=True)
data:image/s3,"s3://crabby-images/c95d5/c95d5e29a7df93bc108f8a6673113b20456cf97c" alt=""
Pandas-profiling
Pandas-Profiling is a Python library for exploratory data analysis.
A quick example
import pandas as pd
import pandas_profiling
a = pd.read_csv("/path/mimic_demo/admissions.csv")
a.columns = map(str.lower, a.columns)
# ignore the times when profiling since they are uninteresting
cols = [c for c in a.columns if not c.endswith('time')]
pandas_profiling.ProfileReport(a[cols], explorative=True)
Save generated profile to a ".html".
profile.to_file("/path/data_profile.html")
Statsmodels
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
Basic stats
For simplicity, we use statsmodels' describe
(Note: Describe
has been deprecated in favor of Description
and it's simplified functional version, describe
. Describe
will be removed after 0.13) for quick stats. The selectable statistics include:
“nobs” - Number of observations
“missing” - Number of missing observations
“mean” - Mean
“std_err” - Standard Error of the mean assuming no correlation
“ci” - Confidence interval with coverage (1 - alpha) using the normal or t. This option creates two entries in any tables: lower_ci and upper_ci.
“std” - Standard Deviation
“iqr” - Interquartile range
“iqr_normal” - Interquartile range relative to a Normal
“mad” - Mean absolute deviation
“mad_normal” - Mean absolute deviation relative to a Normal
“coef_var” - Coefficient of variation
“range” - Range between the maximum and the minimum
“max” - The maximum
“min” - The minimum
“skew” - The skewness defined as the standardized 3rd central moment
“kurtosis” - The kurtosis defined as the standardized 4th central moment
“jarque_bera” - The Jarque-Bera test statistic for normality based on the skewness and kurtosis. This option creates two entries, jarque_bera and jarque_beta_pval.
“mode” - The mode of the data. This option creates two entries in all tables, mode and mode_freq which is the empirical frequency of the modal value.
“median” - The median of the data.
“percentiles” - The percentiles. Values included depend on the input value of percentiles.
“distinct” - The number of distinct categories in a categorical.
“top” - The mode common categories. Labeled top_n for n in 1, 2, …, ntop.
“freq” - The frequency of the common categories. Labeled freq_n for n in 1, 2, …, ntop.
import pandas as pd
import statsmodels.stats.descriptivestats as dst
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
a = pd.read_csv("/path/xx.csv") # load your file via pd.read_xx()
de = dst.describe(a)
# Save the description to excel (pandas also support other formats)
df.describe().to_excel("./pd_des.xlsx")
de.to_excel("./sm_des.xlsx")
Missingno
Missingno offers a visual summary of the completeness of a dataset. This example brings some intuitive thoughts about ADMISSIONS
table:
- Not every patient is admitted to the emergency department as there are many missing values in edregtime and edouttime.
Language
data of patients is mendatory field, but it used to be not.
import missingno as msno
a = pd.read_csv("/path/mimic_demo/admissions.csv")
msno.matrix(a)
data:image/s3,"s3://crabby-images/4d79d/4d79d2eb4fc09b950fce48886ac2a81043371cb2" alt=""
Missingsno also supports bar charts, heatmaps and dendrograms, check it out at github.
Wordcloud
Wordcloud visualizes a given text in a word-cloud format
This example illustrates that majority of patients suffered from sepsis.
from wordcloud import WordCloud
text = str(a['diagnosis'].values) #Prepare an input text in string
wordcloud = WordCloud().generate(text) #Generate a word-cloud from the input text
# Plot the word-cloud
import matplotlib.pyplot as plt
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show()
data:image/s3,"s3://crabby-images/76cf7/76cf7687eb6ccd9d85f3add12a47f4dfffafe518" alt=""