library(dplyr)
library(ggplot2)
library(skimr)
library(dplyr)
library(tidyverse)
library(readxl)
library(rpart)
library(rpart.plot)
library(formattable)
library(rattle)
library(knitr)
jobs <- read_excel("jobs_2021.xlsx", col_names = TRUE)

Overview

This project serves as a guide to exploratory data analysis of a dataset of data-related careers. The result is an exploration of my own career interests with the intent of giving the reader ideas of how to perform their own analysis based on their personal career preferences.

Preliminary Data Prep

The dataset used for this project was collected by Kaggle user Nikhil Bhathi, and was scraped from Glassdoor.com.

  • Original dataset from Kaggle had 742 observations
  • 4 rows were removed because they did not provide enough information on the company of the listing.
  • Unneccesary or redudant columns, like company age, were removed.
  • Some listings had missing information on the year the company was founded and its industry and sector. These entries were manually corrected in Excel
  • Added a column called timezone to the dataset using Excel, which transforms the state names into timezones via a method given here.
  • Added a column called value_of_dollar that includes the value of a dollar based on the state of the listing
  • Added a column called median_rent that includes the median monthly rent based on the state of the listing
  • The last two columns were added in Excel using VLOOKUP and a table provided here

Column Titles Explained

index: An index for ease of reference
job_title: Job title provided with the listing
role_type: Simplified job title
company_name: Name of the company
rating: Rating of the company at time of scraping
timezone: Timezone of state of listing (majority ruling for states with multiple timezones)
city: City of job listing
state: State of job listing
median_rent: Median rent in the state of the job listing
value_of_dollar: Value of a dollar based on state of listing
location: City, State of listing
hq_location: City, State of headquarters of company of listing
size: Size of the company of the listing
founded: Year the company of the listing was founded
ownership: Ownership type of the company of the listing
industry: Industry of the company of the listing
sector: Sector of the industry of the listing
revenue: Total revenue of the company per year
lower_salary: Lower end of the Glassdoor estimate of salary for position
upper_salary: Upper end of the Glassdoor estimate of salary for position
avg_salary: Average of upper and lower salary for position
seniority_by_title: Level of seniority determined by job title
degree_req: Degree requirement by listing
python: 1 If Python skill is required, 0 Otherwise
spark: 1 If Spark skill is required, 0 Otherwise
aws: 1 If AWS skill is required, 0 Otherwise
excel: 1 If Excel skill is required, 0 Otherwise
sql: 1 If SQL skill is required, 0 Otherwise
sas: 1 If SAS skill is required, 0 Otherwise
keras: 1 If Keras skill is required, 0 Otherwise
pytorch: 1 If Pytortch skill is required, 0 Otherwise
scikit: 1 If Scikit skill is required, 0 Otherwise
tensor: 1 If Tensor skill is required, 0 Otherwise
tableau: 1 If Tableau skill is required, 0 Otherwise
bi: 1 If PowerBI skill is required, 0 Otherwise
flink: 1 If Flink skill is required, 0 Otherwise
mongo: 1 If MongoDB skill is required, 0 Otherwise
google_an: 1 If Google Analytics skill is required, 0 Otherwise

Skim

Let’s skim through our data to see what datatypes the columns were imported as.

skim(jobs)
Data summary
Name jobs
Number of rows 738
Number of columns 39
_______________________
Column type frequency:
character 15
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
job_title 0 1 9 98 0 263 0
role_type 0 1 7 16 0 9 0
company_name 0 1 2 51 0 340 0
timezone 0 1 7 8 0 4 0
city 0 1 4 29 0 195 0
state 0 1 2 2 0 37 0
location 0 1 8 33 0 200 0
hq_location 0 1 8 26 0 196 0
size 0 1 6 12 0 8 0
ownership 0 1 8 30 0 9 0
industry 0 1 6 40 0 59 0
sector 0 1 5 34 0 24 0
revenue 0 1 18 32 0 13 0
seniority_by_title 0 1 2 2 0 3 0
degree_req 0 1 1 2 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
index 0 1 369.50 213.19 1.00 185.25 369.50 553.75 738.00 ▇▇▇▇▇
rating 0 1 3.63 0.76 -1.00 3.30 3.70 4.00 5.00 ▁▁▁▇▅
median_rent 0 1 1768.20 548.27 1047.00 1253.00 1590.00 2252.00 2518.00 ▇▃▂▁▇
value_of_dollar 0 1 0.99 0.10 0.85 0.91 0.99 1.10 1.14 ▆▇▂▆▇
founded 0 1 1969.71 54.28 1744.00 1958.00 1992.00 2007.00 2019.00 ▁▁▁▂▇
lower_salary 0 1 74.61 30.94 15.00 52.00 69.00 91.00 202.00 ▅▇▃▁▁
upper_salary 0 1 128.16 45.20 16.00 96.00 124.00 154.75 306.00 ▂▇▆▂▁
avg_salary 0 1 101.39 37.53 15.50 73.50 97.25 122.50 254.00 ▂▇▅▁▁
python 0 1 0.53 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
spark 0 1 0.22 0.42 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
aws 0 1 0.24 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
excel 0 1 0.52 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
sql 0 1 0.51 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
sas 0 1 0.09 0.29 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
keras 0 1 0.04 0.19 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
pytorch 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
scikit 0 1 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
tensor 0 1 0.10 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
hadoop 0 1 0.17 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
tableau 0 1 0.20 0.40 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
bi 0 1 0.08 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
flink 0 1 0.01 0.12 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
mongo 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
google_an 0 1 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁

There are a number of character columns that should be changed to factors.

job_title has 263 unique values, which is too many to change the column to factor type. However, role_type only has 9 unique values, which is a reasonable amount to change the column to factor type. Other columns we should change to factors are:

  • role_type
  • state
  • size
  • ownership
  • industry
  • sector
  • revenue
  • seniority_by_title
  • degree_req
# Change the columns listed above from character to factor type
jobs$role_type <- as.factor(jobs$role_type)
jobs$state <- as.factor(jobs$state)
jobs$ownership <- as.factor(jobs$ownership)
jobs$industry <- as.factor(jobs$industry)
jobs$sector <- as.factor(jobs$sector)
jobs$seniority_by_title <- as.factor(jobs$seniority_by_title)
jobs$degree_req <- as.factor(jobs$degree_req)

Since size and revenue are ordinal variables, we should also make sure to set them with the correct ordering.

Let’s check what their levels are:

unique(jobs$size)
## [1] "501 - 1000"   "10000+"       "1001 - 5000"  "51 - 200"     "201 - 500"   
## [6] "5001 - 10000" "unknown"      "1 to 50"
unique(jobs$revenue)
##  [1] "$50 to $100 million (USD)"        "$2 to $5 billion (USD)"          
##  [3] "$100 to $500 million (USD)"       "$500 million to $1 billion (USD)"
##  [5] "Unknown / Non-Applicable"         "$1 to $2 billion (USD)"          
##  [7] "$25 to $50 million (USD)"         "$10+ billion (USD)"              
##  [9] "$1 to $5 million (USD)"           "$10 to $25 million (USD)"        
## [11] "$5 to $10 billion (USD)"          "Less than $1 million (USD)"      
## [13] "$5 to $10 million (USD)"

We also want to order timezone from west to east. Its levels are:

unique(jobs$timezone)
## [1] "Mountain" "Eastern"  "Pacific"  "Central"
jobs$size <- factor(c(jobs$size),
                    levels = c("unknown", "1 to 50", "51 - 200", "201 - 500", "501 - 1000", "1001 - 5000", 
                               "5001 - 10000", "10000+"), ordered = T)
jobs$revenue <- factor(c(jobs$revenue),
                    levels = c( "Unknown / Non-Applicable", "Less than $1 million (USD)", 
                                "$1 to $5 million (USD)","$5 to $10 million (USD)",
                                "$10 to $25 million (USD)",  "$25 to $50 million (USD)", 
                                "$50 to $100 million (USD)","$100 to $500 million (USD)",
                                "$500 million to $1 billion (USD)", "$1 to $2 billion (USD)",
                                "$2 to $5 billion (USD)", "$5 to $10 billion (USD)",
                                "$10+ billion (USD)"), ordered = T)
jobs$timezone <- factor(c(jobs$timezone),
                        levels = c("Pacific","Mountain","Central","Eastern"), ordered = T)

Now that we have made these adjustments, it’s time to start digging into some exploratory data analysis!

EDA

We want to answer some questions about the dataset. These questions include:

  1. What is the overall distributions of salaries within the dataset?
  2. Can we standardize salaries for comparison of value?
  3. What are the distributions of salaries by role_type, timezone, state, and sector?
  4. What are skill requirements by role_type overall and by individual timezone and state?

Overall distribution of salaries

What is the overall distribution of the salaries?

(avg_sal_hist <- ggplot(jobs, aes(x = avg_salary)) 
  + geom_histogram(bins = 50,color=1, fill="olivedrab3") 
  + scale_x_continuous(labels = scales::comma)) + labs(x = "Average Salary", y = "Number of Listings")

The distribution of salaries is right skewed, with more outliers toward the high end of the salary spectrum. The values appear to range from $15k to $210k and the median appears to fall around $90k.

Here are some summary statistics for avg_salary.

summary(jobs$avg_salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.50   73.50   97.25  101.39  122.50  254.00

Analysis by Role Type

Faceted histograms and boxplots can be used to show the distribution of avg_salary by role_type.

Histograms of Salaries by Role

# Histograms of salary faceted by job title
histos_salary_title <- ggplot(jobs, aes(x = lower_salary, color = 2,  fill = role_type)) + 
                              geom_histogram(binwidth = 10) + 
                              labs(x = "Avg Salary (k)", y = "Listings", 
                                   title = "Listings by Role") + 
                              facet_wrap(~role_type, nrow = 3) + 
                              theme(legend.position="none")
histos_salary_title

Some role types have a lot fewer listings than others. From the histograms it appears that data scientist roles have the highest number of listings as well as the highest upper end salary. Director and data modeler roles have the fewest listings.

Listing Count by Role

How many listings there are for each position type?

groupby_role <- group_by(jobs,role_type) 

groupby_role_count <- groupby_role %>% 
                        summarize(Count=n(), Proportion = round(n()/nrow(jobs),2)) %>% 
                        arrange(desc(Count))
kable(groupby_role_count,"simple")
role_type Count Proportion
data scientist 310 0.42
other scientist 143 0.19
data engineer 120 0.16
analyst 109 0.15
ml engineer 22 0.03
ds proj. manager 16 0.02
other data prof. 8 0.01
data modeler 5 0.01
director 5 0.01

At 310 listings, data scientist roles take up nearly half of the dataset, while directors, data modelers, and other data professionals each take up only ~1% of the set.

Average Salary by Role

groupby_role_avg_salary <- groupby_role %>% 
                        summarize(Avg_SalaryK=round(mean(avg_salary),2)) %>% 
                        arrange(desc(Avg_SalaryK))
kable(groupby_role_avg_salary, "simple")
role_type Avg_SalaryK
director 130.80
ml engineer 126.43
data scientist 118.18
data engineer 105.35
other data prof. 91.06
other scientist 88.06
data modeler 77.50
ds proj. manager 73.22
analyst 66.34

Boxplots of Salaries by Role

Using boxplots we can visualize the 25th, 50th, and 75th percentiles of our data for each role type, along with their outliers.

ggplot(data=jobs, aes(fill=role_type)) +
  geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
                y=avg_salary)) + 
                coord_flip() +
                labs(x = "Role Type", 
                      y = "Average Salary", 
                      title = "Boxplots of Average Salary by Role")+
                              theme(legend.position="none")

Now we are starting to see some points of interest in our data.

Director roles have the highest mean salary, but also have the least amount of data points (along with data modelers).

Roles with more data points tend to have more outliers, with the highest avg_salary outliers falling in the data scientist category. Data scientists also have an outlier with the lowest salary.

I’m curious about what that outlier is, so let’s find it.

Locating an Outlier in the Data

The outlier is the only observation falling under avg_salary of 25k, so we can use a filter to locate this oberservation.

outlier <-groupby_role %>% filter(avg_salary<25)
kable(outlier[,c(1,2,4,8,19,20,21)],"simple")
index job_title company_name state lower_salary upper_salary avg_salary
239 Data Scientist NPD NY 15 16 15.5

Ah, so this is a data scientist position for NPD. Looking to Glassdoor.com, I found that there must be an error with this outlier, as a data scientist position for NPD is listed as being between $128-204k per year. The scraper tool may have mistakenly scraped the cash bonus values instead of the salary bonus values.

Let’s fix this value:

jobs[239, "lower_salary"] = 128
jobs[239, "upper_salary"] = 204
# Take the average of both these values for the "avg_salary" column
jobs[239, "avg_salary"] = (jobs[239, "lower_salary"] + jobs[239, "upper_salary"])/2
kable(jobs[239, c(c(1,2,4,8,19,20,21))],"simple")
index job_title company_name state lower_salary upper_salary avg_salary
239 Data Scientist NPD NY 128 204 166

Skills Most Required by Role

From the skim output above I immediately noticed that there are certain skills than are more required than others. Let’s visualize this with more clarity.

For each job title, we would like to look at the percentage of job listings that require certain technical skills. In our dataset we have information on the requirement of 16 different technical tools for the job. In the ungrouped data, requirement is denoted by an indicator variable, where 1 means the skill is required and 0 means it is not.

By finding the mean of each skill column we get the percentage of listings that require that skill. Let’s create a table that helps us visualize which skills are most required for each simplified job title.

First we create a grouped tbl where we group our jobs dataframe by role_type.

Now we summarize groupby_role by the means of each tech skill.

Role Skill Requirements
skill_importance <- groupby_role %>% 
                      summarize(Python = round(mean(python), 2),
                                Spark = round(mean(spark),2),
                                AWS = round(mean(aws),2),
                                Excel = round(mean(excel),2),
                                SQL = round(mean(sql),2),
                                SAS = round(mean(sas),2),
                                Keras = round(mean(keras),2),
                                Pytorch = round(mean(pytorch),2),
                                Scikit = round(mean(scikit),2),
                                Tensor = round(mean(tensor),2),
                                Hadoop = round(mean(hadoop),2),
                                Tableau = round(mean(sql),2),
                                BI = round(mean(bi),2),
                                Flink = round(mean(flink),2),
                                Mongo = round(mean(mongo),2),
                                GoogleAn = round(mean(google_an),2))
kable(skill_importance, "simple")
role_type Python Spark AWS Excel SQL SAS Keras Pytorch Scikit Tensor Hadoop Tableau BI Flink Mongo GoogleAn
analyst 0.33 0.06 0.09 0.74 0.72 0.12 0.00 0.00 0.00 0.00 0.03 0.72 0.17 0.00 0.03 0.10
data engineer 0.65 0.57 0.50 0.45 0.73 0.01 0.00 0.00 0.00 0.02 0.42 0.73 0.03 0.05 0.11 0.00
data modeler 0.40 0.20 0.40 0.80 0.80 0.00 0.00 0.00 0.00 0.00 0.20 0.80 0.40 0.00 0.00 0.00
data scientist 0.76 0.26 0.25 0.50 0.56 0.16 0.09 0.11 0.15 0.19 0.19 0.56 0.09 0.01 0.06 0.01
director 0.00 0.00 0.20 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ds proj. manager 0.25 0.00 0.12 0.69 0.69 0.00 0.00 0.00 0.00 0.00 0.00 0.69 0.12 0.00 0.00 0.00
ml engineer 0.82 0.18 0.27 0.32 0.59 0.00 0.00 0.23 0.32 0.41 0.18 0.59 0.00 0.00 0.00 0.00
other data prof. 0.25 0.50 0.25 1.00 0.62 0.00 0.00 0.00 0.00 0.00 0.62 0.62 0.25 0.00 0.25 0.00
other scientist 0.08 0.00 0.10 0.45 0.02 0.01 0.00 0.01 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.00

Great, but we want to make it easier to see at a glance which skills are most important for the most roles. We can rearrange the table so that the skill columns are organized in descending order by their column means.

First we need to get the means of the second through 17th columns.

skill_means <- colMeans(skill_importance[2:17])
skill_means
##      Python       Spark         AWS       Excel         SQL         SAS 
## 0.393333333 0.196666667 0.242222222 0.594444444 0.525555556 0.033333333 
##       Keras     Pytorch      Scikit      Tensor      Hadoop     Tableau 
## 0.010000000 0.038888889 0.052222222 0.070000000 0.182222222 0.525555556 
##          BI       Flink       Mongo    GoogleAn 
## 0.117777778 0.006666667 0.050000000 0.012222222

And now we want to find the order of these means from highest to lowest and keep track of that order so that we can rearrange our tibble. Because the ordering excludes the first column we must add 1 to the value of the column orders and then append 1 to the beginning to keep the role_type column in its original position.

skill_order <- rev(order(skill_means)+1)
skill_order <- append(1, skill_order)
skill_order
##  [1]  1  5 13  6  2  4  3 12 14 11 10 16  9  7 17  8 15

Now use these order values to rearrange the order of our columns by their column means.

Role Skill Requirents (Ordered)
skill_importance2 <- skill_importance[skill_order]
kable(skill_importance2, "simple")
role_type Excel Tableau SQL Python AWS Spark Hadoop BI Tensor Scikit Mongo Pytorch SAS GoogleAn Keras Flink
analyst 0.74 0.72 0.72 0.33 0.09 0.06 0.03 0.17 0.00 0.00 0.03 0.00 0.12 0.10 0.00 0.00
data engineer 0.45 0.73 0.73 0.65 0.50 0.57 0.42 0.03 0.02 0.00 0.11 0.00 0.01 0.00 0.00 0.05
data modeler 0.80 0.80 0.80 0.40 0.40 0.20 0.20 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
data scientist 0.50 0.56 0.56 0.76 0.25 0.26 0.19 0.09 0.19 0.15 0.06 0.11 0.16 0.01 0.09 0.01
director 0.40 0.00 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ds proj. manager 0.69 0.69 0.69 0.25 0.12 0.00 0.00 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ml engineer 0.32 0.59 0.59 0.82 0.27 0.18 0.18 0.00 0.41 0.32 0.00 0.23 0.00 0.00 0.00 0.00
other data prof. 1.00 0.62 0.62 0.25 0.25 0.50 0.62 0.25 0.00 0.00 0.25 0.00 0.00 0.00 0.00 0.00
other scientist 0.45 0.02 0.02 0.08 0.10 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00

Great! Now we can clearly see that for the most role types, Excel is required most often, followed by Tableau, SQL and Python.

It’s important to remember that these may be different from the overall importance of each skill. This is because each position type has a different count, so taking the mean of means is going to be different than the overall mean. Still, it makes information easier to read in this tibble when it is ordered in such a way.

But let’s also look at the overall ‘importance’ of each skill to compare.

skill_count <- groupby_role %>% 
                              summarise(
                                Excel = mean(excel),
                                Tableau = mean(tableau),
                                SQL = mean(sql),
                                Python = mean(python),
                                AWS = mean(aws),
                                Spark = mean(spark),
                                Hadoop = mean(hadoop),
                                BI = mean(bi),
                                Tensor = mean(tensor),
                                Scikit = mean(scikit),
                                Mongo = mean(mongo),
                                Pytorch = mean(pytorch),
                                SAS = mean(sas),
                                GoogleAn = mean(google_an),
                                Keras = mean(keras),
                                Flink = mean(flink)) %>% 
                          gather( "category", "mean", -role_type)

ggplot(skill_count,aes(category, mean)) + 
  geom_col(aes(fill = category), position = "dodge",)+ facet_wrap(.~role_type, ncol=3) + 
                              theme(legend.position="none")+ 
                          labs(x = "Skill", 
                                y = "Mean Required by Role", 
                                title = "Skill Requirements by Role") + 
                                theme(axis.text.x = element_text(angle = 45, 
                                                                 hjust=1,size = 8,))

  • Python is required most for data scientist and machine learning engineer positions.
  • Excel is required most for other data professionals, data modelers, and analysts
  • SQL is required most for data modelers, data engineers and analysts

By Timezone

Examining our dataset by timezine can help us uncover locational differences in roles and salaries.

by_timezone <- group_by(jobs, timezone)

Listing Count

listings_by_timezone_bar <- ggplot(jobs, aes(x = timezone, fill=timezone)) + 
                              geom_bar() +
                              labs(x = "Timezone", 
                                   y = "Listing Count", 
                                   title = "Listing Count by Timezone")
listings_by_timezone_bar

Boxplots of Salaries by Timezone

salary_by_timezone_box <- ggplot(jobs, aes(x = timezone, y = avg_salary, fill= timezone)) + geom_boxplot() +
                                 labs(x = "Timezone", 
                                   y = "Average Salary", 
                                   title = "Boxplots of Average Salary by Timezone")+
                              theme(legend.position="none")
salary_by_timezone_box

Histograms of Salaries by Timezone

# Histograms of salary faceted by timezone
histos_salary_timezone <- ggplot(jobs, aes(x = avg_salary, color = 1,  fill = timezone)) + 
                              geom_histogram(binwidth = 10) + 
                              labs(x = "Average Salary") + facet_wrap(~timezone, nrow = 3) +
                              theme(legend.position="none")
histos_salary_timezone

Eastern Timezone

I’m interested in looking at jobs in the Eastern timezone, so I’m going to filter the grouped dataframe by_timezome by Eastern.

# Filter by_timezone by timezone="Eastern"
eastern_timezone_filter <- filter(jobs, timezone=="Eastern")

Listing Count by State in Eastern Timezone

Which states in the Eastern timezone have the highest listing counts?

count_by_state_eastern <- filter(group_by(jobs, state), timezone=="Eastern") %>% 
                    summarize(Count=n()) %>% 
                    arrange(desc(Count))
kable(count_by_state_eastern, "simple")
state Count
MA 103
NY 69
VA 41
MD 35
PA 33
NC 21
NJ 17
FL 16
OH 14
DC 11
IN 10
DE 6
GA 6
KY 6
MI 6
CT 5
RI 1
SC 1

And let’s again look at the boxplots of salaries by roles within the Eastern timezone.

eastern_salary_by_role_box <- ggplot(data=eastern_timezone_filter, aes(fill=role_type)) +
                          geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
                          y=avg_salary)) + coord_flip() + 
                          labs(x = "Role Type", 
                                y = "Average Salary", 
                                title = "Boxplots of Average Salary by Role in Eastern Timezone") +
                              theme(legend.position="none")
eastern_salary_by_role_box

Overall, the trend of these boxplots are the same as those in “Boxplots of Average Salary by Role.”

Standardized Salaries

We’ve got the variable value_of_dollar, so we can use this to standardize the salaries in the dataset to get a better idea of how valuable a salary is based on the state of the job. We will make a new column in the dataframe called std_salary that multiplies avg_salary by value_of_dollar.

jobs$std_salary <- jobs$avg_salary*jobs$value_of_dollar

Looking at boxplots of std_salary by timezone:

std_salary_by_timezone_box <- ggplot(jobs, aes(x = timezone, y = std_salary, fill=timezone)) + geom_boxplot() +
                                 labs(x = "Timezone", 
                                   y = "Standardized Salary", 
                                   title = "Boxplots of Standardized Salary by Timezone")+
                              theme(legend.position="none")
std_salary_by_timezone_box

Here we see that the median salaries between the timezones are much closer together, which makes sense since they have all been standardized.

boxplots_std_sal_job_title <- ggplot(data=jobs, aes(fill=role_type)) +
                geom_boxplot(mapping = aes(x= reorder(role_type, std_salary, FUN= median),
                y=avg_salary)) + 
                coord_flip() +
                labs(x = "Role Type", y = "Average Salary", 
                     title = "Boxplots of Standardized Salary by Role")+
                              theme(legend.position="none")
boxplots_std_sal_job_title

Standardized Salaries By State

Here’s a table with the top 10 un-standardized avg_salary by state:

# Chart same as std_sal_by_state but with unstandardized salaries.
by_state <- group_by(jobs, state)
avg_sal_by_state <- summarize(by_state, 
          avg_salary=round(mean(avg_salary),2))
avg_sal_by_state <- avg_sal_by_state %>%
                      arrange(desc(avg_salary))
kable(head(avg_sal_by_state, 10),"simple")
state avg_salary
CA 123.75
IL 116.66
DC 110.18
MA 107.50
NJ 104.56
MI 100.25
RI 100.00
NY 99.46
NC 98.45
MD 97.36

Compare that with the table of the top standardized salaries by state.

std_sal_by_state <- summarize(by_state, 
          std_salary=round(mean(std_salary),2))
std_sal_by_state <- std_sal_by_state %>%
                      arrange(desc(std_salary))
kable(head(std_sal_by_state, 10))
state std_salary
IL 130.66
NJ 117.11
KY 110.58
NY 109.41
MI 108.27
CA 107.66
NC 105.34
DC 103.57
IN 101.14
RI 101.00

Interesting! California slides down from 1st place to 5th. Kentucky moves up from 11th to 3rd. These, crudely put, are the top 10 states for getting the best value out of your salary.

count_by_state <- by_state %>% 
                    summarize(Count=n()) %>% 
                    arrange(desc(Count))
kable(count_by_state, "simple")
state Count
CA 151
MA 103
NY 69
VA 41
IL 40
MD 35
PA 33
TX 28
NC 21
WA 21
NJ 17
FL 16
OH 14
TN 13
CO 11
DC 11
IN 10
UT 10
WI 10
AZ 9
MO 9
AL 8
DE 6
GA 6
KY 6
MI 6
CT 5
IA 5
LA 4
NE 4
OR 4
KS 3
NM 3
ID 2
MN 2
RI 1
SC 1

North Carolina Jobs

I have heard that the job market is hot in North Carolina, so let’s check out what kind of jobs are listed there.

Listings

# Filter jobs by state="NC"
filter_by_state_nc <- filter(jobs, state=="NC")
kable(filter_by_state_nc[,c(1,2,4,21)])
index job_title company_name avg_salary
104 Data Engineer CapTech 95.5
118 Senior Data Engineer Red Ventures 134.0
250 Data Scientist Genworth 102.0
318 Senior Data Scientist Brighthouse Financial 124.5
356 Scientist - Analytical Services Reynolds American 99.5
362 Senior Scientist - Regulatory Submissions Reynolds American 117.5
377 Senior Data Engineer Genesys 112.5
411 Senior Scientist - Toxicologist - Product Integrity (Stewardship) Reynolds American 74.0
413 Senior Research Analytical Scientist-Non-Targeted Analysis RTI International 65.5
419 Senior Scientist - Biostatistician Reynolds American 80.5
448 Senior Data Scientist Brighthouse Financial 124.5
507 Data Engineer - Consultant (Charlotte Based) Clarity Insights 85.5
531 Scientist - Analytical Services Reynolds American 99.5
540 Senior Scientist - Regulatory Submissions Reynolds American 117.5
567 Senior Data Engineer Genesys 112.5
619 Senior Scientist - Toxicologist - Product Integrity (Stewardship) Reynolds American 74.0
621 Senior Research Analytical Scientist-Non-Targeted Analysis RTI International 65.5
633 Senior Scientist - Biostatistician Reynolds American 80.5
684 Data Engineer - Consultant (Charlotte Based) Clarity Insights 85.5
712 Scientist - Analytical Services Reynolds American 99.5
722 Senior Scientist - Regulatory Submissions Reynolds American 117.5

Hottest Industry in NC

What’s the ‘hottest’ industry in North Carolina based on our data?

filter_by_state_nc %$% industry %>% table %>% which.max %>% names
## [1] "Consumer Products Manufacturing"

Hottest Sector in NC

What’s the ‘hottest’ sector in North Carolina based on our data?

filter_by_state_nc %$% sector %>% table %>% which.max %>% names
## [1] "Manufacturing"

Hottest Role in NC

What’s the ‘hottest’ data role in North Carolina?

filter_by_state_nc %$% role_type %>% table %>% which.max %>% names
## [1] "other scientist"

Salary Distribution by Sector

What do the avg_salary distributions look like by sector in North Carolina?

nc_salary_by_sector_box <- ggplot(data=filter_by_state_nc, aes(fill=sector)) +
                          geom_boxplot(mapping = aes(x= reorder(sector, avg_salary, 
                                                                FUN= median),
                          y=avg_salary)) + coord_flip() + 
                          labs(x = "Sector", 
                                y = "Average Salary", 
                                title = "Boxplots of Average Salary by Sector in NC")+
                              theme(legend.position="none")
nc_salary_by_sector_box

Salary Distribution by Role

What do the avg_salary distributions look like by role_type in North Carolina?

nc_salary_by_role_box <- ggplot(data=filter_by_state_nc, aes(fill=role_type)) +
                          geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
                          y=avg_salary)) + coord_flip() + 
                          labs(x = "Role Type", 
                                y = "Average Salary", 
                                title = "Boxplots of Average Salary by Role in NC")+
                              theme(legend.position="none")
nc_salary_by_role_box

Skills Required

# What skills are required most in the roles in North Carolina?
skill_count_nc <- filter_by_state_nc %>% 
                  group_by(role_type) %>% 
                                    summarise(
                                      Excel = mean(excel),
                                      Tableau = mean(tableau),
                                      SQL = mean(sql),
                                      Python = mean(python),
                                      AWS = mean(aws),
                                      Spark = mean(spark),
                                      Hadoop = mean(hadoop),
                                      BI = mean(bi),
                                      Tensor = mean(tensor),
                                      Scikit = mean(scikit),
                                      Mongo = mean(mongo),
                                      Pytorch = mean(pytorch),
                                      SAS = mean(sas),
                                      GoogleAn = mean(google_an),
                                      Keras = mean(keras),
                                      Flink = mean(flink)) %>% 
                                    gather( "category", "mean", -role_type)

ggplot(skill_count_nc,aes(category, mean)) + 
  geom_col(aes(fill = category), position = "dodge",)+ facet_grid(.~role_type) +coord_flip()+ 
                              theme(legend.position="none")+ 
                          labs(x = "Skill", 
                                y = "Mean Required by Role", 
                                title = "Skill Requirements by Role in NC")

Between data engineers, data scientists and other scientists in NC:

  • A higher proportion of data scientist positions require knowledge of Python, Spark, and Excel than other roles
  • A higher proportion of data engineering positions require knowledge of SQL,Hadoop, and AWS than other positions
  • Of the information on skills gathered, other scientists only require experience of Excel

What are the other scientist jobs in North Carolina?

other_scientist_nc <- filter(jobs, state == "NC" & role_type=="other scientist") %>% 
                    arrange(desc(std_salary))
kable(other_scientist_nc[,c(1,2,4,21)], "simple")
index job_title company_name avg_salary
362 Senior Scientist - Regulatory Submissions Reynolds American 117.5
540 Senior Scientist - Regulatory Submissions Reynolds American 117.5
722 Senior Scientist - Regulatory Submissions Reynolds American 117.5
356 Scientist - Analytical Services Reynolds American 99.5
531 Scientist - Analytical Services Reynolds American 99.5
712 Scientist - Analytical Services Reynolds American 99.5
419 Senior Scientist - Biostatistician Reynolds American 80.5
633 Senior Scientist - Biostatistician Reynolds American 80.5
411 Senior Scientist - Toxicologist - Product Integrity (Stewardship) Reynolds American 74.0
619 Senior Scientist - Toxicologist - Product Integrity (Stewardship) Reynolds American 74.0
413 Senior Research Analytical Scientist-Non-Targeted Analysis RTI International 65.5
621 Senior Research Analytical Scientist-Non-Targeted Analysis RTI International 65.5

Oh wow! Most of these jobs are for a company called Reynolds American, which is a tobacco company. Positions for toxicologists does not seem like it should be in this dataset.

These jobs don’t look appealing to me. But there were two other role types that do look interesting to me: data engineer and data modeler. I want to look at these positions in the Eastern timezone and sort them by std_salary.

Eastern Timezone Data Engineers and Modelers

engineer_modeler_eastern <- filter(jobs, timezone == "Eastern" & xor(role_type=="data engineer", role_type=="data modeler")) %>% 
  arrange(desc(std_salary))

kable(engineer_modeler_eastern[,c(1,2,4,21)],"simple")
index job_title company_name avg_salary
15 Data Engineer I Audible 146.0
118 Senior Data Engineer Red Ventures 134.0
340 Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence Takeda Pharmaceuticals 154.5
500 Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence Takeda Pharmaceuticals 154.5
673 Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence Takeda Pharmaceuticals 154.5
358 Sr. Data Engineer Moser Consulting 122.5
534 Sr. Data Engineer Moser Consulting 122.5
715 Sr. Data Engineer Moser Consulting 122.5
137 Data Engineer Gridiron IT 132.5
584 Big Data Engineer Peraton 118.5
460 Data Engineer GNY Insurance Companies 110.5
377 Senior Data Engineer Genesys 112.5
567 Senior Data Engineer Genesys 112.5
39 Data Engineer Lancer Insurance 106.0
121 Data Engineer Upside Business Travel 122.0
315 Data Engineer SpringML 109.0
443 Data Engineer SpringML 109.0
595 Data Engineer Maxar Technologies 105.5
104 Data Engineer CapTech 95.5
299 Big Data Engineer SpringML 90.5
566 Senior Data Engineer Novetta 97.0
351 Data Engineer MetroStar Systems 109.0
523 Data Engineer MetroStar Systems 109.0
703 Data Engineer MetroStar Systems 109.0
93 Data Modeler IT Concepts 106.5
295 Data Engineer ManTech 92.0
692 Data Architect / Data Modeler Medidata Solutions 86.5
436 Data Engineer Numeric, LLC 91.5
288 Data Engineer, Data Engineering and Artifical Intelligence Takeda Pharmaceuticals 102.5
507 Data Engineer - Consultant (Charlotte Based) Clarity Insights 85.5
684 Data Engineer - Consultant (Charlotte Based) Clarity Insights 85.5
84 Data Engineer Excella Consulting 87.0
41 Data Engineer Persivia 98.5
383 Data Engineer BRMi 85.0
580 Data Engineer BRMi 85.0
110 Data Engineer Sapphire Digital 78.0
625 Data Engineer Carilion Clinic 81.0
307 Data Engineer MassMutual 92.0
423 Data Engineer MassMutual 92.0
568 Data Modeler (Analytical Systems) MassMutual 91.5
367 IT - Data Engineer II Arbella Insurance 90.0
547 IT - Data Engineer II Arbella Insurance 90.0
728 IT - Data Engineer II Arbella Insurance 90.0
144 Data Engineer AVANADE 87.0
151 Data Engineer Cogo Labs 84.5
213 Data Engineer Cogo Labs 84.5
338 Data Engineer Guidepoint 64.5
497 Data Engineer Guidepoint 64.5
169 Data Engineer IZEA 71.5
256 Data Engineer IZEA 71.5
317 Associate Data Engineer EAB 65.5
442 Associate Data Engineer EAB 65.5
379 Data Modeler - Data Solutions Engineer Liberty Mutual Insurance 51.5
569 Data Modeler - Data Solutions Engineer Liberty Mutual Insurance 51.5

Now these jobs look interesting.

What kind of skills do they require?

Skills Required

# What skills are required most in the roles in North Carolina?
filterby_eastern_tz <- filter(jobs, timezone == "Eastern")
skill_count_eastern_mod_eng <- filterby_eastern_tz %>% 
                                group_by(role_type) %>% 
                                filter(xor(role_type=="data modeler",role_type=="data engineer")) %>% 
                                    summarise(
                                      Excel = mean(excel),
                                      Tableau = mean(tableau),
                                      SQL = mean(sql),
                                      Python = mean(python),
                                      AWS = mean(aws),
                                      Spark = mean(spark),
                                      Hadoop = mean(hadoop),
                                      BI = mean(bi),
                                      Tensor = mean(tensor),
                                      Scikit = mean(scikit),
                                      Mongo = mean(mongo),
                                      Pytorch = mean(pytorch),
                                      SAS = mean(sas),
                                      GoogleAn = mean(google_an),
                                      Keras = mean(keras),
                                      Flink = mean(flink)) %>% 
                                    gather( "category", "mean", -role_type)

ggplot(skill_count_eastern_mod_eng,aes(category, mean)) + 
  geom_col(aes(fill = category), position = "dodge",)+ facet_grid(.~role_type) +coord_flip()+ 
                              theme(legend.position="none")+ 
                          labs(x = "Skill", 
                                y = "Mean Required by Role", 
                                title = "Skill Requirements of Eastern Data Engineers and Modelers")

Between data engineers and data modelers in the Eastern timezone:

  • A higher proportion of data engineering positions require knowledge of Python than data modeling positions
  • A higher proportion of data modeling positions require knowledge of SQL and Excel than data engineering positions

Closing Thoughts

It’s amazing how powerful of a tool EDA is to aid in decision-making. We were able to ask questions of the data and receive answers with data to back them up. Overall, we were able to

  1. Determine the overall distributions of salaries within the dataset
  2. Standardize salaries for comparison of value
  3. Determine distributions of salaries by role_type, timezone, state, and sector
  4. Determine skill requirements by role_type overall and by timezone and state.