Data Analysis of the No-Show Appointment Dataset
This dataset has the medical records of various patients in Brazil. I got this dataset from Kaggle and I am about to give you a walk through of the data analysis (no machine learning here). The dataset contains many variables such as:
So, what questions are we meant to ask here? The main thing is to try and analyze and find a reason to why some patients do not show up. On the other hand though, there are many questions and answers to get from this dataset, some of which are;
There are definitely many other questions that can arise from this dataset but we would focus on these for now. Watch these space for more updates.
Firstly, the introduction was done (which I have just did). Then the data was wrangled(cleaned). Most datasets have some errors due to various factors.
I then check how many rows have a value of Age as less than 0. Thankfully, it is only one row. I drop the row from the dataset.
In the Handcap column, the values that not either of 0 or 1 are all changed to 1. (The person that had 4 as the value probably had 4 different disabilities, anyways, he is still disable).
The above is the bulk of the cleaning for this dataset (there should be more).
Now that we have cleaned the data, this is the main part of the analysis
Question 1: What is the ratio of honoured appointments to the ones that were not honoured?
Question 2: How many Males and Female patients were taken into consideration?
Actually, there are 62298 number of patients recorded here not 110,526 as most people might think. The later number refers to the number of appointments while the former number is the number of individuals. Many people booked more than one appointment.
Question 3: In male and female, who tend to have more appointments?
We know that the women are more than men, but do men in this dataset have more appointments or is it the opposite?
Question 4: What age category misses their appointments the most?
The ages were divided into the following categories; 0–17 years old are children, 18–35 for youths, 36–60 for adults then 60 and above for elders.
QUESTION 5: Is sending SMS really effective?
From the above plot, we can see that most people did not receive reminder texts(the left part is way smaller than the right part), but what you would also notice is that a very big percentage of people that did not receive a reminder text still showed up(2nd blue bar) compared to the percentage of people that received reminder texts.
The people that did not show up(the 2 orange bars) are almost the same in size even. Therefore, the SMS sending is not actually effective, it is even behaving in a reverse manner. Why? That is a question for another day. lol
There are some checks that I did in the background that I did not put here(for length sake):
Before I removed the individual with the -1 age, I checked if the individual appeared somewhere else in the dataset using the patient ID. She did appear multiple times that is why I dropped the row.
I replaced Yes and No in the No-show columns with ones and zeroes respectively.
There are more things I can do to analyze this dataset more thoroughly, for example, I did not even use the scheduling and appointment days at all. This article would be updated in the future as regards that though.