WeRateDogs Data Wrangling

4 min readJul 2, 2022

This a data wrangling project by Udacity on the ratings of dogs from a Twitter handle WeRateDogs. This twitter account consists of pictures of various dogs and their ratings which are most times rated over 10. What the project entails is for the data analyst (you and I) to wrangle the data then analyze, visualize and also report the findings. Data wrangling is basically gathering, assessing and cleaning data.

This is Olly. Took a snooze through a rain storm and created a rain angel. Is very pleased with his ability to make new friends even in his sleep. 13/10 for both — Meet Olly, the rain angel dog

Introduction

This project is on data wrangling and is one of the many projects to be completed in partial fulfillment of the Udacity Data analysis nanodegree.

How do we start?

Gathering Data

The first dataset to be downloaded was a csv file that was sent to Udacity by Twitter and it was already worked on though not fully. The gathering process here was straightforward. Using the pandas function read_csv to download the file into the Jupyter dashboard.

The second dataset was to download image predictions using the Requests library. Since it was a tsv file, it was read into a dataframe using ‘\t’ as the separator.

The third dataset was to query Twitter’s API using the Tweepy library. I first had sign up to a twitter account and apply for developer’s account. After Twitter had known my reasons for applying, it granted me access and gave me some secret keys. I used the tweet_id from the first dataset to query the twitter API, writing the content into a text file which was converted to a json file and read in a dataframe using the read_json pandas function. This dataset is required because the first dataset did not have 2 important variables, retweet count and favorite count.

Assessing the Data

Assessing the datasets was pretty straight forward. Visual assessment was used using Microsoft Excel. Pandas functions were also used. For example, describe, info, head etc to understand the structure of the data. Doing these, some quality and tidiness issues in the dataset came to light. Some of which I saw were:

· Quality Issues

1. Dataframe 1: Timestamp column is not of the correct datatype.

2. Dataframe 1: Some rating denominators have wrong values.

3. Dataframe 1: Stop words are being wrongly interpreted as dog names.

4. Dataframe 1: Some dogs have 2 dog stages.

5. Dataframe 1: Some observations are retweets and replies. Only tweets are needed.

6. Dataframe 2: Some images are not pictures of dogs.

7. Dataframe 2: There is a lower/upper case consistency issue concerning the p1, p2 and p3 columns.

8. Dataframe 3: The id column should be in sync (same spelling) with the tweet_id column in other dataframes.

· Tidiness issues

1. Dataframe 1: The doggo, fluffer, pupper, poppo should be on one column.

2. Dataframe 2: The p1, p2 and p3 columns are merged to get the most likely breed of dog.

Cleaning

Copies of the original dataset was created and the copies were worked on(cleaned) with respect to the quality and tidiness issues raised.

For example, this is a screenshot of how to solve the first tidiness issue of variables making up columns.

Melting the doggo, pupper, puppo and floofer stages into one column

Then, solving the 6th quality issue. Some of the jpg_url in the image prediction dataset are not pictures of dogs. We have to remove those observations.

Then we get the most likely breeds of the dogs. As seen below, the golden_retriever is the most popular dog breed.

Checking the names column of the first dataframe, I notice that some dogs had wrong names. Some of the wrong names were [a,an,quite,some]. These are obviously not real names and had to be cleaned. Lucy and Charlie are the most popular dog names.

Storing

After cleaning, the 3 datasets were merged using tweet_id as the common column and the merged file was saved in a csv file.

Analyzing and Visualization

The correlation between retweet and favorite is positive and that is not surprising. The higher the likes on a Twitter post, the higher the probability of retweets.

On another hand though, perhaps surprisingly, the rating numerator did not correlate very positively with retweet count.

These and many more insights could and were gotten from the dataset. One of the many was the breed of dog that tends to get more likes on an average. Someone would think that it is the golden reteiver without plotting the data, but plotting the data reveals that it is instead the bedlington_terrier.

The average of each breed’s retweets and likes

Top 10 average likes for each dog in WeRateDogs

WeRateDogs Data Wrangling

Written by Obi-Okonkwo Chisom

Responses (1)