Project 2: Brain Stroke

Varshita Yarabadi
Oct 12, 2022
4 min read

Introduction

A stroke is a medical condition in which poor blood flow to the brain causes cell death. There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding. Both cause parts of the brain to stop functioning properly. Signs and symptoms of a stroke may include an inability to move or feel on one side of the body, problems understanding or speaking, dizziness, or loss of vision to one side. Signs and symptoms often appear soon after the stroke has occurred. If symptoms last less than one or two hours, the stroke is a transient ischemic attack (TIA), also called a mini-stroke. A hemorrhagic stroke may also be associated with a severe headache. The symptoms of a stroke can be permanent. Long-term complications may include pneumonia and loss of bladder control.

Introduction to the Problem

Classification is an important aspect of data mining and there are many different methods that can be used to categorize or predict the class of certain data values. For this project, I will be exploring several algorithms to analyze the risk factors associated with brain stroke using different features from the database. Some classification algorithms which will help to figure out the problem of brain stroke risk factors:

Naive Bayes
Support Vector Machine
Random Forest Classifier

Introduction to Data

The data used for this project provides information on two different types of brain stroke: ischemic, due to lack of blood flow and hemorrhagic, due to bleeding founded through Kaggle. There is one file in the dataset which is called “brain_stroke.scv”, which contains attributes like gender, age, hypertension, heart disease, ever-married, worktype, residence type, avg glucose level, bmi, smoking status, and stroke.

Pre-processing the Data

The first step I took was to check the source behind the data and found that the database was created for the purpose of learning classification methods. Then I went through the .csv file and checked for datatypes if they need to be converted. After that I checked for all the null values and there were no null values and checked for any missing values but no missing values were found.

Next, I used the df.describe() to output all the numerical values which I can later use to compare any values to figure out my solution to the risk factor problem. Lastly, I checked for all the features that were assigned to categorical variables for the object data type. I checked this because it will help me in my visualization process to compare different categorical variables.

Data Understanding/Visualization

I used seaborn to analyze and get a counter plot for each attribute that had an object as their data type. This will help me to compare and contrast each attribute's categorical value and have a better understanding if patients have a higher or lower chance of getting a stroke.

From this visualization it was clearly shown that the target class has an uneven distribution of observations. From this visualization I was able to see how balanced/unbalanced the data set was based on 1 being if the patient had a stroke and 0 if the patient did not have a stroke.

For gender attribute count plot we can observed that females have high risk of stroke compared to males

For hypertension attribute count plot we can see that people with hypertension are more likely to get stroke

For heart_disease attribute count plot people with heart disease have higher cases of stroke

For even_married attribute count plot people had drastically higher cases of stroke compared to non-married people

For work_type attribute count plot people working in private sectors have higher number of stroke cases compared to people working in other sectors

For residence_type count plot people that leave in both urban and rural area have no connection on having stroke

For the smoking_status count plot people who never smoked have no chance of getting a stroke.

Modeling/Evaluation

Before I started modeling, first I had to identify our target independent variable. In this case, I am interested to see how well risk factors could help predict whether a patient will have a stroke or not. Then I separated the data into dependent(x) and independent(y) sets. Next, I had to separate the data into a training and testing set. This is essential so that we can understand how well our model is performing.

I will now be exploring several classifiers for this dataset into different classes based on the varying features that they have.

Naive Bayes

The first model that I used to classify if a patient had a stroke or not into their specific classes is with Naive Bayes classifier. The Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.

Support Vector Machine

Second I looked into support vector machines which is another supervised learning algorithm. It can be used for both classification and regression problems as well. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future.

Random Forest

The last model that I implemented was Random Forest. This is a supervised machine learning algorithm that is used for classification and regression problems. This algorithm produces “more accurate and stable results by relying on a multitude of trees rather than a single decision tree”. The main advantage of this model is that it fixes the issue of over-fitting that single decision trees have. It also works better with larger datasets.

Storytelling/Impact:

This project helped me gain an understanding of using a dataset to make predictions. Even though my dataset wasn’t the best to work with, it helped me learn how to prepare such a complex data set for modeling. I also gained a good insight into the decision trees algorithm and how the metrics result in different accuracy scores. I knew the accuracy of my data would not be so great while I was on the data understanding/visualization stage. After visualizing a great quantity of the data, I have a valuable understanding of the common attributes that relate to Brain stroke.

Code: Here is my code

References

[1] https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset

[2]https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

[3]https://www.javatpoint.com/machine-learning-naive-bayes-classifier

[4]https://scikit-learn.org/stable/modules/naive_bayes.html

[5]https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm

[6]https://www.datacamp.com/tutorial/random-forests-classifier-python

Project 2: Brain Stroke

Recent Posts

Comments