Data Analysis of the No-Show Appointment Dataset

4 min readJun 13, 2022

This dataset has the medical records of various patients in Brazil. I got this dataset from Kaggle and I am about to give you a walk through of the data analysis (no machine learning here). The dataset contains many variables such as:

The different columns with descriptions in the dataset

So, what questions are we meant to ask here? The main thing is to try and analyze and find a reason to why some patients do not show up. On the other hand though, there are many questions and answers to get from this dataset, some of which are;

There are definitely many other questions that can arise from this dataset but we would focus on these for now. Watch these space for more updates.

Firstly, the introduction was done (which I have just did). Then the data was wrangled(cleaned). Most datasets have some errors due to various factors.

Importing appropriate libraries and viewing first 5 heads

There are some issues with the Age and Handcap column. The Age column has negative figures and Handcap has value(s) more than 1.

I then check how many rows have a value of Age as less than 0. Thankfully, it is only one row. I drop the row from the dataset.

In the Handcap column, the values that not either of 0 or 1 are all changed to 1. (The person that had 4 as the value probably had 4 different disabilities, anyways, he is still disable).

The above is the bulk of the cleaning for this dataset (there should be more).

Now that we have cleaned the data, this is the main part of the analysis

Question 1: What is the ratio of honoured appointments to the ones that were not honoured?

As we can see, the blue portion shows that the vast majority of appointments were honoured while 20% of appointments did not see the light of the day.

Question 2: How many Males and Female patients were taken into consideration?

Actually, there are 62298 number of patients recorded here not 110,526 as most people might think. The later number refers to the number of appointments while the former number is the number of individuals. Many people booked more than one appointment.

Bar plot of the gender of each patient, not the gender of each appointment

Question 3: In male and female, who tend to have more appointments?

We know that the women are more than men, but do men in this dataset have more appointments or is it the opposite?

Male and female folk tend to make appointments at the same rate

Question 4: What age category misses their appointments the most?

The ages were divided into the following categories; 0–17 years old are children, 18–35 for youths, 36–60 for adults then 60 and above for elders.

From this plot, we can see that the youths are most likely to miss their appointments, while the elderly are likely to keep to the appointments

QUESTION 5: Is sending SMS really effective?

From the above plot, we can see that most people did not receive reminder texts(the left part is way smaller than the right part), but what you would also notice is that a very big percentage of people that did not receive a reminder text still showed up(2nd blue bar) compared to the percentage of people that received reminder texts.

The people that did not show up(the 2 orange bars) are almost the same in size even. Therefore, the SMS sending is not actually effective, it is even behaving in a reverse manner. Why? That is a question for another day. lol

There are some checks that I did in the background that I did not put here(for length sake):

Before I removed the individual with the -1 age, I checked if the individual appeared somewhere else in the dataset using the patient ID. She did appear multiple times that is why I dropped the row.

I replaced Yes and No in the No-show columns with ones and zeroes respectively.

There are more things I can do to analyze this dataset more thoroughly, for example, I did not even use the scheduling and appointment days at all. This article would be updated in the future as regards that though.