library(dplyr)
library(ggplot2)
library(skimr)
library(dplyr)
library(tidyverse)
library(readxl)
library(rpart)
library(rpart.plot)
library(formattable)
library(rattle)
library(knitr)

jobs <- read_excel("jobs_2021.xlsx", col_names = TRUE)

Overview

This project serves as a guide to exploratory data analysis of a dataset of data-related careers. The result is an exploration of my own career interests with the intent of giving the reader ideas of how to perform their own analysis based on their personal career preferences.

Preliminary Data Prep

The dataset used for this project was collected by Kaggle user Nikhil Bhathi, and was scraped from Glassdoor.com.

Original dataset from Kaggle had 742 observations
4 rows were removed because they did not provide enough information on the company of the listing.
Unneccesary or redudant columns, like company age, were removed.
Some listings had missing information on the year the company was founded and its industry and sector. These entries were manually corrected in Excel
Added a column called timezone to the dataset using Excel, which transforms the state names into timezones via a method given here.
Added a column called value_of_dollar that includes the value of a dollar based on the state of the listing
Added a column called median_rent that includes the median monthly rent based on the state of the listing
The last two columns were added in Excel using VLOOKUP and a table provided here

Column Titles Explained

index: An index for ease of reference
job_title: Job title provided with the listing
role_type: Simplified job title
company_name: Name of the company
rating: Rating of the company at time of scraping
timezone: Timezone of state of listing (majority ruling for states with multiple timezones)
city: City of job listing
state: State of job listing
median_rent: Median rent in the state of the job listing
value_of_dollar: Value of a dollar based on state of listing
location: City, State of listing
hq_location: City, State of headquarters of company of listing
size: Size of the company of the listing
founded: Year the company of the listing was founded
ownership: Ownership type of the company of the listing
industry: Industry of the company of the listing
sector: Sector of the industry of the listing
revenue: Total revenue of the company per year
lower_salary: Lower end of the Glassdoor estimate of salary for position
upper_salary: Upper end of the Glassdoor estimate of salary for position
avg_salary: Average of upper and lower salary for position
seniority_by_title: Level of seniority determined by job title
degree_req: Degree requirement by listing
python: 1 If Python skill is required, 0 Otherwise
spark: 1 If Spark skill is required, 0 Otherwise
aws: 1 If AWS skill is required, 0 Otherwise
excel: 1 If Excel skill is required, 0 Otherwise
sql: 1 If SQL skill is required, 0 Otherwise
sas: 1 If SAS skill is required, 0 Otherwise
keras: 1 If Keras skill is required, 0 Otherwise
pytorch: 1 If Pytortch skill is required, 0 Otherwise
scikit: 1 If Scikit skill is required, 0 Otherwise
tensor: 1 If Tensor skill is required, 0 Otherwise
tableau: 1 If Tableau skill is required, 0 Otherwise
bi: 1 If PowerBI skill is required, 0 Otherwise
flink: 1 If Flink skill is required, 0 Otherwise
mongo: 1 If MongoDB skill is required, 0 Otherwise
google_an: 1 If Google Analytics skill is required, 0 Otherwise

Skim

Let’s skim through our data to see what datatypes the columns were imported as.

skim(jobs)

Data summary
Name	jobs
Number of rows	738
Number of columns	39
_______________________
Column type frequency:
character	15
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
job_title	1	9	98	263
role_type	1	7	16	9
company_name	1	2	51	340
timezone	1	7	8	4
city	1	4	29	195
state	1	2	2	37
location	1	8	33	200
hq_location	1	8	26	196
size	1	6	12	8
ownership	1	8	30	9
industry	1	6	40	59
sector	1	5	34	24
revenue	1	18	32	13
seniority_by_title	1	2	2	3
degree_req	1	1	2	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
index	1	369.50	213.19	1.00	185.25	369.50	553.75	738.00	▇▇▇▇▇
rating	1	3.63	0.76	-1.00	3.30	3.70	4.00	5.00	▁▁▁▇▅
median_rent	1	1768.20	548.27	1047.00	1253.00	1590.00	2252.00	2518.00	▇▃▂▁▇
value_of_dollar	1	0.99	0.10	0.85	0.91	0.99	1.10	1.14	▆▇▂▆▇
founded	1	1969.71	54.28	1744.00	1958.00	1992.00	2007.00	2019.00	▁▁▁▂▇
lower_salary	1	74.61	30.94	15.00	52.00	69.00	91.00	202.00	▅▇▃▁▁
upper_salary	1	128.16	45.20	16.00	96.00	124.00	154.75	306.00	▂▇▆▂▁
avg_salary	1	101.39	37.53	15.50	73.50	97.25	122.50	254.00	▂▇▅▁▁
python	1	0.53	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
spark	1	0.22	0.42	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
aws	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
excel	1	0.52	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
sql	1	0.51	0.50	0.00	0.00	1.00	1.00	1.00	▇▁▁▁▇
sas	1	0.09	0.29	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
keras	1	0.04	0.19	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
pytorch	1	0.05	0.22	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
scikit	1	0.07	0.26	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
tensor	1	0.10	0.30	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
hadoop	1	0.17	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
tableau	1	0.20	0.40	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
bi	1	0.08	0.26	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
flink	1	0.01	0.12	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
mongo	1	0.05	0.22	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
google_an	1	0.02	0.14	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁

There are a number of character columns that should be changed to factors.

job_title has 263 unique values, which is too many to change the column to factor type. However, role_type only has 9 unique values, which is a reasonable amount to change the column to factor type. Other columns we should change to factors are:

role_type
state
size
ownership
industry
sector
revenue
seniority_by_title
degree_req

# Change the columns listed above from character to factor type
jobs$role_type <- as.factor(jobs$role_type)
jobs$state <- as.factor(jobs$state)
jobs$ownership <- as.factor(jobs$ownership)
jobs$industry <- as.factor(jobs$industry)
jobs$sector <- as.factor(jobs$sector)
jobs$seniority_by_title <- as.factor(jobs$seniority_by_title)
jobs$degree_req <- as.factor(jobs$degree_req)

Since size and revenue are ordinal variables, we should also make sure to set them with the correct ordering.

Let’s check what their levels are:

unique(jobs$size)

## [1] "501 - 1000"   "10000+"       "1001 - 5000"  "51 - 200"     "201 - 500"   
## [6] "5001 - 10000" "unknown"      "1 to 50"

unique(jobs$revenue)

##  [1] "$50 to $100 million (USD)"        "$2 to $5 billion (USD)"          
##  [3] "$100 to $500 million (USD)"       "$500 million to $1 billion (USD)"
##  [5] "Unknown / Non-Applicable"         "$1 to $2 billion (USD)"          
##  [7] "$25 to $50 million (USD)"         "$10+ billion (USD)"              
##  [9] "$1 to $5 million (USD)"           "$10 to $25 million (USD)"        
## [11] "$5 to $10 billion (USD)"          "Less than $1 million (USD)"      
## [13] "$5 to $10 million (USD)"

We also want to order timezone from west to east. Its levels are:

unique(jobs$timezone)

## [1] "Mountain" "Eastern"  "Pacific"  "Central"

jobs$size <- factor(c(jobs$size),
                    levels = c("unknown", "1 to 50", "51 - 200", "201 - 500", "501 - 1000", "1001 - 5000", 
                               "5001 - 10000", "10000+"), ordered = T)
jobs$revenue <- factor(c(jobs$revenue),
                    levels = c( "Unknown / Non-Applicable", "Less than $1 million (USD)", 
                                "$1 to $5 million (USD)","$5 to $10 million (USD)",
                                "$10 to $25 million (USD)",  "$25 to $50 million (USD)", 
                                "$50 to $100 million (USD)","$100 to $500 million (USD)",
                                "$500 million to $1 billion (USD)", "$1 to $2 billion (USD)",
                                "$2 to $5 billion (USD)", "$5 to $10 billion (USD)",
                                "$10+ billion (USD)"), ordered = T)
jobs$timezone <- factor(c(jobs$timezone),
                        levels = c("Pacific","Mountain","Central","Eastern"), ordered = T)

Now that we have made these adjustments, it’s time to start digging into some exploratory data analysis!

EDA

We want to answer some questions about the dataset. These questions include:

What is the overall distributions of salaries within the dataset?
Can we standardize salaries for comparison of value?
What are the distributions of salaries by role_type, timezone, state, and sector?
What are skill requirements by role_type overall and by individual timezone and state?

Overall distribution of salaries

What is the overall distribution of the salaries?

(avg_sal_hist <- ggplot(jobs, aes(x = avg_salary)) 
  + geom_histogram(bins = 50,color=1, fill="olivedrab3") 
  + scale_x_continuous(labels = scales::comma)) + labs(x = "Average Salary", y = "Number of Listings")

The distribution of salaries is right skewed, with more outliers toward the high end of the salary spectrum. The values appear to range from $15k to $210k and the median appears to fall around $90k.

Here are some summary statistics for avg_salary.

summary(jobs$avg_salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.50   73.50   97.25  101.39  122.50  254.00

Analysis by Role Type

Faceted histograms and boxplots can be used to show the distribution of avg_salary by role_type.

Histograms of Salaries by Role

# Histograms of salary faceted by job title
histos_salary_title <- ggplot(jobs, aes(x = lower_salary, color = 2,  fill = role_type)) + 
                              geom_histogram(binwidth = 10) + 
                              labs(x = "Avg Salary (k)", y = "Listings", 
                                   title = "Listings by Role") + 
                              facet_wrap(~role_type, nrow = 3) + 
                              theme(legend.position="none")
histos_salary_title

Some role types have a lot fewer listings than others. From the histograms it appears that data scientist roles have the highest number of listings as well as the highest upper end salary. Director and data modeler roles have the fewest listings.

Listing Count by Role

How many listings there are for each position type?

groupby_role <- group_by(jobs,role_type) 

groupby_role_count <- groupby_role %>% 
                        summarize(Count=n(), Proportion = round(n()/nrow(jobs),2)) %>% 
                        arrange(desc(Count))
kable(groupby_role_count,"simple")

role_type	Count	Proportion
data scientist	310	0.42
other scientist	143	0.19
data engineer	120	0.16
analyst	109	0.15
ml engineer	22	0.03
ds proj. manager	16	0.02
other data prof.	8	0.01
data modeler	5	0.01
director	5	0.01

At 310 listings, data scientist roles take up nearly half of the dataset, while directors, data modelers, and other data professionals each take up only ~1% of the set.

Average Salary by Role

groupby_role_avg_salary <- groupby_role %>% 
                        summarize(Avg_SalaryK=round(mean(avg_salary),2)) %>% 
                        arrange(desc(Avg_SalaryK))
kable(groupby_role_avg_salary, "simple")

role_type	Avg_SalaryK
director	130.80
ml engineer	126.43
data scientist	118.18
data engineer	105.35
other data prof.	91.06
other scientist	88.06
data modeler	77.50
ds proj. manager	73.22
analyst	66.34

Boxplots of Salaries by Role

Using boxplots we can visualize the 25th, 50th, and 75th percentiles of our data for each role type, along with their outliers.

ggplot(data=jobs, aes(fill=role_type)) +
  geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
                y=avg_salary)) + 
                coord_flip() +
                labs(x = "Role Type", 
                      y = "Average Salary", 
                      title = "Boxplots of Average Salary by Role")+
                              theme(legend.position="none")

Now we are starting to see some points of interest in our data.

Director roles have the highest mean salary, but also have the least amount of data points (along with data modelers).

Roles with more data points tend to have more outliers, with the highest avg_salary outliers falling in the data scientist category. Data scientists also have an outlier with the lowest salary.

I’m curious about what that outlier is, so let’s find it.

Locating an Outlier in the Data

The outlier is the only observation falling under avg_salary of 25k, so we can use a filter to locate this oberservation.

outlier <-groupby_role %>% filter(avg_salary<25)
kable(outlier[,c(1,2,4,8,19,20,21)],"simple")

index	job_title	company_name	state	lower_salary	upper_salary	avg_salary
239	Data Scientist	NPD	NY	15	16	15.5

Ah, so this is a data scientist position for NPD. Looking to Glassdoor.com, I found that there must be an error with this outlier, as a data scientist position for NPD is listed as being between $128-204k per year. The scraper tool may have mistakenly scraped the cash bonus values instead of the salary bonus values.

Let’s fix this value:

jobs[239, "lower_salary"] = 128
jobs[239, "upper_salary"] = 204
# Take the average of both these values for the "avg_salary" column
jobs[239, "avg_salary"] = (jobs[239, "lower_salary"] + jobs[239, "upper_salary"])/2
kable(jobs[239, c(c(1,2,4,8,19,20,21))],"simple")

index	job_title	company_name	state	lower_salary	upper_salary	avg_salary
239	Data Scientist	NPD	NY	128	204	166

Skills Most Required by Role

From the skim output above I immediately noticed that there are certain skills than are more required than others. Let’s visualize this with more clarity.

For each job title, we would like to look at the percentage of job listings that require certain technical skills. In our dataset we have information on the requirement of 16 different technical tools for the job. In the ungrouped data, requirement is denoted by an indicator variable, where 1 means the skill is required and 0 means it is not.

By finding the mean of each skill column we get the percentage of listings that require that skill. Let’s create a table that helps us visualize which skills are most required for each simplified job title.

First we create a grouped tbl where we group our jobs dataframe by role_type.

Now we summarize groupby_role by the means of each tech skill.

Role Skill Requirements

skill_importance <- groupby_role %>% 
                      summarize(Python = round(mean(python), 2),
                                Spark = round(mean(spark),2),
                                AWS = round(mean(aws),2),
                                Excel = round(mean(excel),2),
                                SQL = round(mean(sql),2),
                                SAS = round(mean(sas),2),
                                Keras = round(mean(keras),2),
                                Pytorch = round(mean(pytorch),2),
                                Scikit = round(mean(scikit),2),
                                Tensor = round(mean(tensor),2),
                                Hadoop = round(mean(hadoop),2),
                                Tableau = round(mean(sql),2),
                                BI = round(mean(bi),2),
                                Flink = round(mean(flink),2),
                                Mongo = round(mean(mongo),2),
                                GoogleAn = round(mean(google_an),2))
kable(skill_importance, "simple")

role_type	Python	Spark	AWS	Excel	SQL	SAS	Keras	Pytorch	Scikit	Tensor	Hadoop	Tableau	BI	Flink	Mongo	GoogleAn
analyst	0.33	0.06	0.09	0.74	0.72	0.12	0.00	0.00	0.00	0.00	0.03	0.72	0.17	0.00	0.03	0.10
data engineer	0.65	0.57	0.50	0.45	0.73	0.01	0.00	0.00	0.00	0.02	0.42	0.73	0.03	0.05	0.11	0.00
data modeler	0.40	0.20	0.40	0.80	0.80	0.00	0.00	0.00	0.00	0.00	0.20	0.80	0.40	0.00	0.00	0.00
data scientist	0.76	0.26	0.25	0.50	0.56	0.16	0.09	0.11	0.15	0.19	0.19	0.56	0.09	0.01	0.06	0.01
director	0.00	0.00	0.20	0.40	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ds proj. manager	0.25	0.00	0.12	0.69	0.69	0.00	0.00	0.00	0.00	0.00	0.00	0.69	0.12	0.00	0.00	0.00
ml engineer	0.82	0.18	0.27	0.32	0.59	0.00	0.00	0.23	0.32	0.41	0.18	0.59	0.00	0.00	0.00	0.00
other data prof.	0.25	0.50	0.25	1.00	0.62	0.00	0.00	0.00	0.00	0.00	0.62	0.62	0.25	0.00	0.25	0.00
other scientist	0.08	0.00	0.10	0.45	0.02	0.01	0.00	0.01	0.00	0.01	0.00	0.02	0.00	0.00	0.00	0.00

Great, but we want to make it easier to see at a glance which skills are most important for the most roles. We can rearrange the table so that the skill columns are organized in descending order by their column means.

First we need to get the means of the second through 17th columns.

skill_means <- colMeans(skill_importance[2:17])
skill_means

##      Python       Spark         AWS       Excel         SQL         SAS 
## 0.393333333 0.196666667 0.242222222 0.594444444 0.525555556 0.033333333 
##       Keras     Pytorch      Scikit      Tensor      Hadoop     Tableau 
## 0.010000000 0.038888889 0.052222222 0.070000000 0.182222222 0.525555556 
##          BI       Flink       Mongo    GoogleAn 
## 0.117777778 0.006666667 0.050000000 0.012222222

And now we want to find the order of these means from highest to lowest and keep track of that order so that we can rearrange our tibble. Because the ordering excludes the first column we must add 1 to the value of the column orders and then append 1 to the beginning to keep the role_type column in its original position.

skill_order <- rev(order(skill_means)+1)
skill_order <- append(1, skill_order)
skill_order

##  [1]  1  5 13  6  2  4  3 12 14 11 10 16  9  7 17  8 15

Now use these order values to rearrange the order of our columns by their column means.

Role Skill Requirents (Ordered)

skill_importance2 <- skill_importance[skill_order]
kable(skill_importance2, "simple")

role_type	Excel	Tableau	SQL	Python	AWS	Spark	Hadoop	BI	Tensor	Scikit	Mongo	Pytorch	SAS	GoogleAn	Keras	Flink
analyst	0.74	0.72	0.72	0.33	0.09	0.06	0.03	0.17	0.00	0.00	0.03	0.00	0.12	0.10	0.00	0.00
data engineer	0.45	0.73	0.73	0.65	0.50	0.57	0.42	0.03	0.02	0.00	0.11	0.00	0.01	0.00	0.00	0.05
data modeler	0.80	0.80	0.80	0.40	0.40	0.20	0.20	0.40	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
data scientist	0.50	0.56	0.56	0.76	0.25	0.26	0.19	0.09	0.19	0.15	0.06	0.11	0.16	0.01	0.09	0.01
director	0.40	0.00	0.00	0.00	0.20	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ds proj. manager	0.69	0.69	0.69	0.25	0.12	0.00	0.00	0.12	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ml engineer	0.32	0.59	0.59	0.82	0.27	0.18	0.18	0.00	0.41	0.32	0.00	0.23	0.00	0.00	0.00	0.00
other data prof.	1.00	0.62	0.62	0.25	0.25	0.50	0.62	0.25	0.00	0.00	0.25	0.00	0.00	0.00	0.00	0.00
other scientist	0.45	0.02	0.02	0.08	0.10	0.00	0.00	0.00	0.01	0.00	0.00	0.01	0.01	0.00	0.00	0.00

Great! Now we can clearly see that for the most role types, Excel is required most often, followed by Tableau, SQL and Python.

It’s important to remember that these may be different from the overall importance of each skill. This is because each position type has a different count, so taking the mean of means is going to be different than the overall mean. Still, it makes information easier to read in this tibble when it is ordered in such a way.

But let’s also look at the overall ‘importance’ of each skill to compare.

skill_count <- groupby_role %>% 
                              summarise(
                                Excel = mean(excel),
                                Tableau = mean(tableau),
                                SQL = mean(sql),
                                Python = mean(python),
                                AWS = mean(aws),
                                Spark = mean(spark),
                                Hadoop = mean(hadoop),
                                BI = mean(bi),
                                Tensor = mean(tensor),
                                Scikit = mean(scikit),
                                Mongo = mean(mongo),
                                Pytorch = mean(pytorch),
                                SAS = mean(sas),
                                GoogleAn = mean(google_an),
                                Keras = mean(keras),
                                Flink = mean(flink)) %>% 
                          gather( "category", "mean", -role_type)

ggplot(skill_count,aes(category, mean)) + 
  geom_col(aes(fill = category), position = "dodge",)+ facet_wrap(.~role_type, ncol=3) + 
                              theme(legend.position="none")+ 
                          labs(x = "Skill", 
                                y = "Mean Required by Role", 
                                title = "Skill Requirements by Role") + 
                                theme(axis.text.x = element_text(angle = 45, 
                                                                 hjust=1,size = 8,))

Python is required most for data scientist and machine learning engineer positions.
Excel is required most for other data professionals, data modelers, and analysts
SQL is required most for data modelers, data engineers and analysts

By Timezone

Examining our dataset by timezine can help us uncover locational differences in roles and salaries.

by_timezone <- group_by(jobs, timezone)

Listing Count

listings_by_timezone_bar <- ggplot(jobs, aes(x = timezone, fill=timezone)) + 
                              geom_bar() +
                              labs(x = "Timezone", 
                                   y = "Listing Count", 
                                   title = "Listing Count by Timezone")
listings_by_timezone_bar

Boxplots of Salaries by Timezone

salary_by_timezone_box <- ggplot(jobs, aes(x = timezone, y = avg_salary, fill= timezone)) + geom_boxplot() +
                                 labs(x = "Timezone", 
                                   y = "Average Salary", 
                                   title = "Boxplots of Average Salary by Timezone")+
                              theme(legend.position="none")
salary_by_timezone_box

Histograms of Salaries by Timezone

# Histograms of salary faceted by timezone
histos_salary_timezone <- ggplot(jobs, aes(x = avg_salary, color = 1,  fill = timezone)) + 
                              geom_histogram(binwidth = 10) + 
                              labs(x = "Average Salary") + facet_wrap(~timezone, nrow = 3) +
                              theme(legend.position="none")
histos_salary_timezone

Eastern Timezone

I’m interested in looking at jobs in the Eastern timezone, so I’m going to filter the grouped dataframe by_timezome by Eastern.

# Filter by_timezone by timezone="Eastern"
eastern_timezone_filter <- filter(jobs, timezone=="Eastern")

Listing Count by State in Eastern Timezone

Which states in the Eastern timezone have the highest listing counts?

count_by_state_eastern <- filter(group_by(jobs, state), timezone=="Eastern") %>% 
                    summarize(Count=n()) %>% 
                    arrange(desc(Count))
kable(count_by_state_eastern, "simple")

state	Count
MA	103
NY	69
VA	41
MD	35
PA	33
NC	21
NJ	17
FL	16
OH	14
DC	11
IN	10
DE	6
GA	6
KY	6
MI	6
CT	5
RI	1
SC	1

And let’s again look at the boxplots of salaries by roles within the Eastern timezone.

eastern_salary_by_role_box <- ggplot(data=eastern_timezone_filter, aes(fill=role_type)) +
                          geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
                          y=avg_salary)) + coord_flip() + 
                          labs(x = "Role Type", 
                                y = "Average Salary", 
                                title = "Boxplots of Average Salary by Role in Eastern Timezone") +
                              theme(legend.position="none")
eastern_salary_by_role_box

Overall, the trend of these boxplots are the same as those in “Boxplots of Average Salary by Role.”

Standardized Salaries

We’ve got the variable value_of_dollar, so we can use this to standardize the salaries in the dataset to get a better idea of how valuable a salary is based on the state of the job. We will make a new column in the dataframe called std_salary that multiplies avg_salary by value_of_dollar.

jobs$std_salary <- jobs$avg_salary*jobs$value_of_dollar

Looking at boxplots of std_salary by timezone:

std_salary_by_timezone_box <- ggplot(jobs, aes(x = timezone, y = std_salary, fill=timezone)) + geom_boxplot() +
                                 labs(x = "Timezone", 
                                   y = "Standardized Salary", 
                                   title = "Boxplots of Standardized Salary by Timezone")+
                              theme(legend.position="none")
std_salary_by_timezone_box

Here we see that the median salaries between the timezones are much closer together, which makes sense since they have all been standardized.

boxplots_std_sal_job_title <- ggplot(data=jobs, aes(fill=role_type)) +
                geom_boxplot(mapping = aes(x= reorder(role_type, std_salary, FUN= median),
                y=avg_salary)) + 
                coord_flip() +
                labs(x = "Role Type", y = "Average Salary", 
                     title = "Boxplots of Standardized Salary by Role")+
                              theme(legend.position="none")
boxplots_std_sal_job_title

Standardized Salaries By State

Here’s a table with the top 10 un-standardized avg_salary by state:

# Chart same as std_sal_by_state but with unstandardized salaries.
by_state <- group_by(jobs, state)
avg_sal_by_state <- summarize(by_state, 
          avg_salary=round(mean(avg_salary),2))
avg_sal_by_state <- avg_sal_by_state %>%
                      arrange(desc(avg_salary))
kable(head(avg_sal_by_state, 10),"simple")

state	avg_salary
CA	123.75
IL	116.66
DC	110.18
MA	107.50
NJ	104.56
MI	100.25
RI	100.00
NY	99.46
NC	98.45
MD	97.36

Compare that with the table of the top standardized salaries by state.

std_sal_by_state <- summarize(by_state, 
          std_salary=round(mean(std_salary),2))
std_sal_by_state <- std_sal_by_state %>%
                      arrange(desc(std_salary))
kable(head(std_sal_by_state, 10))

state	std_salary
IL	130.66
NJ	117.11
KY	110.58
NY	109.41
MI	108.27
CA	107.66
NC	105.34
DC	103.57
IN	101.14
RI	101.00

Interesting! California slides down from 1st place to 5th. Kentucky moves up from 11th to 3rd. These, crudely put, are the top 10 states for getting the best value out of your salary.

count_by_state <- by_state %>% 
                    summarize(Count=n()) %>% 
                    arrange(desc(Count))
kable(count_by_state, "simple")

state	Count
CA	151
MA	103
NY	69
VA	41
IL	40
MD	35
PA	33
TX	28
NC	21
WA	21
NJ	17
FL	16
OH	14
TN	13
CO	11
DC	11
IN	10
UT	10
WI	10
AZ	9
MO	9
AL	8
DE	6
GA	6
KY	6
MI	6
CT	5
IA	5
LA	4
NE	4
OR	4
KS	3
NM	3
ID	2
MN	2
RI	1
SC	1

North Carolina Jobs

I have heard that the job market is hot in North Carolina, so let’s check out what kind of jobs are listed there.

Listings

# Filter jobs by state="NC"
filter_by_state_nc <- filter(jobs, state=="NC")
kable(filter_by_state_nc[,c(1,2,4,21)])

index	job_title	company_name	avg_salary
104	Data Engineer	CapTech	95.5
118	Senior Data Engineer	Red Ventures	134.0
250	Data Scientist	Genworth	102.0
318	Senior Data Scientist	Brighthouse Financial	124.5
356	Scientist - Analytical Services	Reynolds American	99.5
362	Senior Scientist - Regulatory Submissions	Reynolds American	117.5
377	Senior Data Engineer	Genesys	112.5
411	Senior Scientist - Toxicologist - Product Integrity (Stewardship)	Reynolds American	74.0
413	Senior Research Analytical Scientist-Non-Targeted Analysis	RTI International	65.5
419	Senior Scientist - Biostatistician	Reynolds American	80.5
448	Senior Data Scientist	Brighthouse Financial	124.5
507	Data Engineer - Consultant (Charlotte Based)	Clarity Insights	85.5
531	Scientist - Analytical Services	Reynolds American	99.5
540	Senior Scientist - Regulatory Submissions	Reynolds American	117.5
567	Senior Data Engineer	Genesys	112.5
619	Senior Scientist - Toxicologist - Product Integrity (Stewardship)	Reynolds American	74.0
621	Senior Research Analytical Scientist-Non-Targeted Analysis	RTI International	65.5
633	Senior Scientist - Biostatistician	Reynolds American	80.5
684	Data Engineer - Consultant (Charlotte Based)	Clarity Insights	85.5
712	Scientist - Analytical Services	Reynolds American	99.5
722	Senior Scientist - Regulatory Submissions	Reynolds American	117.5

Hottest Industry in NC

What’s the ‘hottest’ industry in North Carolina based on our data?

filter_by_state_nc %$% industry %>% table %>% which.max %>% names

## [1] "Consumer Products Manufacturing"

Hottest Sector in NC

What’s the ‘hottest’ sector in North Carolina based on our data?

filter_by_state_nc %$% sector %>% table %>% which.max %>% names

## [1] "Manufacturing"

Hottest Role in NC

What’s the ‘hottest’ data role in North Carolina?

filter_by_state_nc %$% role_type %>% table %>% which.max %>% names

## [1] "other scientist"

Salary Distribution by Sector

What do the avg_salary distributions look like by sector in North Carolina?

nc_salary_by_sector_box <- ggplot(data=filter_by_state_nc, aes(fill=sector)) +
                          geom_boxplot(mapping = aes(x= reorder(sector, avg_salary, 
                                                                FUN= median),
                          y=avg_salary)) + coord_flip() + 
                          labs(x = "Sector", 
                                y = "Average Salary", 
                                title = "Boxplots of Average Salary by Sector in NC")+
                              theme(legend.position="none")
nc_salary_by_sector_box

Salary Distribution by Role

What do the avg_salary distributions look like by role_type in North Carolina?

nc_salary_by_role_box <- ggplot(data=filter_by_state_nc, aes(fill=role_type)) +
                          geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
                          y=avg_salary)) + coord_flip() + 
                          labs(x = "Role Type", 
                                y = "Average Salary", 
                                title = "Boxplots of Average Salary by Role in NC")+
                              theme(legend.position="none")
nc_salary_by_role_box

Skills Required

# What skills are required most in the roles in North Carolina?
skill_count_nc <- filter_by_state_nc %>% 
                  group_by(role_type) %>% 
                                    summarise(
                                      Excel = mean(excel),
                                      Tableau = mean(tableau),
                                      SQL = mean(sql),
                                      Python = mean(python),
                                      AWS = mean(aws),
                                      Spark = mean(spark),
                                      Hadoop = mean(hadoop),
                                      BI = mean(bi),
                                      Tensor = mean(tensor),
                                      Scikit = mean(scikit),
                                      Mongo = mean(mongo),
                                      Pytorch = mean(pytorch),
                                      SAS = mean(sas),
                                      GoogleAn = mean(google_an),
                                      Keras = mean(keras),
                                      Flink = mean(flink)) %>% 
                                    gather( "category", "mean", -role_type)

ggplot(skill_count_nc,aes(category, mean)) + 
  geom_col(aes(fill = category), position = "dodge",)+ facet_grid(.~role_type) +coord_flip()+ 
                              theme(legend.position="none")+ 
                          labs(x = "Skill", 
                                y = "Mean Required by Role", 
                                title = "Skill Requirements by Role in NC")

Between data engineers, data scientists and other scientists in NC:

A higher proportion of data scientist positions require knowledge of Python, Spark, and Excel than other roles
A higher proportion of data engineering positions require knowledge of SQL,Hadoop, and AWS than other positions
Of the information on skills gathered, other scientists only require experience of Excel

What are the other scientist jobs in North Carolina?

other_scientist_nc <- filter(jobs, state == "NC" & role_type=="other scientist") %>% 
                    arrange(desc(std_salary))
kable(other_scientist_nc[,c(1,2,4,21)], "simple")

index	job_title	company_name	avg_salary
362	Senior Scientist - Regulatory Submissions	Reynolds American	117.5
540	Senior Scientist - Regulatory Submissions	Reynolds American	117.5
722	Senior Scientist - Regulatory Submissions	Reynolds American	117.5
356	Scientist - Analytical Services	Reynolds American	99.5
531	Scientist - Analytical Services	Reynolds American	99.5
712	Scientist - Analytical Services	Reynolds American	99.5
419	Senior Scientist - Biostatistician	Reynolds American	80.5
633	Senior Scientist - Biostatistician	Reynolds American	80.5
411	Senior Scientist - Toxicologist - Product Integrity (Stewardship)	Reynolds American	74.0
619	Senior Scientist - Toxicologist - Product Integrity (Stewardship)	Reynolds American	74.0
413	Senior Research Analytical Scientist-Non-Targeted Analysis	RTI International	65.5
621	Senior Research Analytical Scientist-Non-Targeted Analysis	RTI International	65.5

Oh wow! Most of these jobs are for a company called Reynolds American, which is a tobacco company. Positions for toxicologists does not seem like it should be in this dataset.

These jobs don’t look appealing to me. But there were two other role types that do look interesting to me: data engineer and data modeler. I want to look at these positions in the Eastern timezone and sort them by std_salary.

Eastern Timezone Data Engineers and Modelers

engineer_modeler_eastern <- filter(jobs, timezone == "Eastern" & xor(role_type=="data engineer", role_type=="data modeler")) %>% 
  arrange(desc(std_salary))

kable(engineer_modeler_eastern[,c(1,2,4,21)],"simple")

index	job_title	company_name	avg_salary
15	Data Engineer I	Audible	146.0
118	Senior Data Engineer	Red Ventures	134.0
340	Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence	Takeda Pharmaceuticals	154.5
500	Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence	Takeda Pharmaceuticals	154.5
673	Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence	Takeda Pharmaceuticals	154.5
358	Sr. Data Engineer	Moser Consulting	122.5
534	Sr. Data Engineer	Moser Consulting	122.5
715	Sr. Data Engineer	Moser Consulting	122.5
137	Data Engineer	Gridiron IT	132.5
584	Big Data Engineer	Peraton	118.5
460	Data Engineer	GNY Insurance Companies	110.5
377	Senior Data Engineer	Genesys	112.5
567	Senior Data Engineer	Genesys	112.5
39	Data Engineer	Lancer Insurance	106.0
121	Data Engineer	Upside Business Travel	122.0
315	Data Engineer	SpringML	109.0
443	Data Engineer	SpringML	109.0
595	Data Engineer	Maxar Technologies	105.5
104	Data Engineer	CapTech	95.5
299	Big Data Engineer	SpringML	90.5
566	Senior Data Engineer	Novetta	97.0
351	Data Engineer	MetroStar Systems	109.0
523	Data Engineer	MetroStar Systems	109.0
703	Data Engineer	MetroStar Systems	109.0
93	Data Modeler	IT Concepts	106.5
295	Data Engineer	ManTech	92.0
692	Data Architect / Data Modeler	Medidata Solutions	86.5
436	Data Engineer	Numeric, LLC	91.5
288	Data Engineer, Data Engineering and Artifical Intelligence	Takeda Pharmaceuticals	102.5
507	Data Engineer - Consultant (Charlotte Based)	Clarity Insights	85.5
684	Data Engineer - Consultant (Charlotte Based)	Clarity Insights	85.5
84	Data Engineer	Excella Consulting	87.0
41	Data Engineer	Persivia	98.5
383	Data Engineer	BRMi	85.0
580	Data Engineer	BRMi	85.0
110	Data Engineer	Sapphire Digital	78.0
625	Data Engineer	Carilion Clinic	81.0
307	Data Engineer	MassMutual	92.0
423	Data Engineer	MassMutual	92.0
568	Data Modeler (Analytical Systems)	MassMutual	91.5
367	IT - Data Engineer II	Arbella Insurance	90.0
547	IT - Data Engineer II	Arbella Insurance	90.0
728	IT - Data Engineer II	Arbella Insurance	90.0
144	Data Engineer	AVANADE	87.0
151	Data Engineer	Cogo Labs	84.5
213	Data Engineer	Cogo Labs	84.5
338	Data Engineer	Guidepoint	64.5
497	Data Engineer	Guidepoint	64.5
169	Data Engineer	IZEA	71.5
256	Data Engineer	IZEA	71.5
317	Associate Data Engineer	EAB	65.5
442	Associate Data Engineer	EAB	65.5
379	Data Modeler - Data Solutions Engineer	Liberty Mutual Insurance	51.5
569	Data Modeler - Data Solutions Engineer	Liberty Mutual Insurance	51.5

Now these jobs look interesting.

What kind of skills do they require?

Skills Required

# What skills are required most in the roles in North Carolina?
filterby_eastern_tz <- filter(jobs, timezone == "Eastern")
skill_count_eastern_mod_eng <- filterby_eastern_tz %>% 
                                group_by(role_type) %>% 
                                filter(xor(role_type=="data modeler",role_type=="data engineer")) %>% 
                                    summarise(
                                      Excel = mean(excel),
                                      Tableau = mean(tableau),
                                      SQL = mean(sql),
                                      Python = mean(python),
                                      AWS = mean(aws),
                                      Spark = mean(spark),
                                      Hadoop = mean(hadoop),
                                      BI = mean(bi),
                                      Tensor = mean(tensor),
                                      Scikit = mean(scikit),
                                      Mongo = mean(mongo),
                                      Pytorch = mean(pytorch),
                                      SAS = mean(sas),
                                      GoogleAn = mean(google_an),
                                      Keras = mean(keras),
                                      Flink = mean(flink)) %>% 
                                    gather( "category", "mean", -role_type)

ggplot(skill_count_eastern_mod_eng,aes(category, mean)) + 
  geom_col(aes(fill = category), position = "dodge",)+ facet_grid(.~role_type) +coord_flip()+ 
                              theme(legend.position="none")+ 
                          labs(x = "Skill", 
                                y = "Mean Required by Role", 
                                title = "Skill Requirements of Eastern Data Engineers and Modelers")

Between data engineers and data modelers in the Eastern timezone:

A higher proportion of data engineering positions require knowledge of Python than data modeling positions
A higher proportion of data modeling positions require knowledge of SQL and Excel than data engineering positions

Closing Thoughts

It’s amazing how powerful of a tool EDA is to aid in decision-making. We were able to ask questions of the data and receive answers with data to back them up. Overall, we were able to

Determine the overall distributions of salaries within the dataset
Standardize salaries for comparison of value
Determine distributions of salaries by role_type, timezone, state, and sector
Determine skill requirements by role_type overall and by timezone and state.

Exploratory Analysis of US Data Careers

Rachel Oakley

4/24/2022

Overview

Preliminary Data Prep

Column Titles Explained

Skim

EDA

Overall distribution of salaries

Analysis by Role Type

Histograms of Salaries by Role

Listing Count by Role

Average Salary by Role

Boxplots of Salaries by Role

Locating an Outlier in the Data

Skills Most Required by Role

Role Skill Requirements

Role Skill Requirents (Ordered)

By Timezone

Listing Count

Boxplots of Salaries by Timezone

Histograms of Salaries by Timezone

Eastern Timezone

Listing Count by State in Eastern Timezone

Standardized Salaries

Standardized Salaries By State

North Carolina Jobs

Listings

Hottest Industry in NC

Hottest Sector in NC

Hottest Role in NC

Salary Distribution by Sector

Salary Distribution by Role

Skills Required

Eastern Timezone Data Engineers and Modelers

Skills Required

Closing Thoughts

state	Count
MA	103
NY	69
VA	41
MD	35
PA	33
NC	21
NJ	17
FL	16
OH	14
DC	11
IN	10
DE	6
GA	6
KY	6
MI	6
CT	5
RI	1
SC	1

state	Count
CA	151
MA	103
NY	69
VA	41
IL	40
MD	35
PA	33
TX	28
NC	21
WA	21
NJ	17
FL	16
OH	14
TN	13
CO	11
DC	11
IN	10
UT	10
WI	10
AZ	9
MO	9
AL	8
DE	6
GA	6
KY	6
MI	6
CT	5
IA	5
LA	4
NE	4
OR	4
KS	3
NM	3
ID	2
MN	2
RI	1
SC	1

state	Count
MA	103
NY	69
VA	41
MD	35
PA	33
NC	21
NJ	17
FL	16
OH	14
DC	11
IN	10
DE	6
GA	6
KY	6
MI	6
CT	5
RI	1
SC	1

state	Count
CA	151
MA	103
NY	69
VA	41
IL	40
MD	35
PA	33
TX	28
NC	21
WA	21
NJ	17
FL	16
OH	14
TN	13
CO	11
DC	11
IN	10
UT	10
WI	10
AZ	9
MO	9
AL	8
DE	6
GA	6
KY	6
MI	6
CT	5
IA	5
LA	4
NE	4
OR	4
KS	3
NM	3
ID	2
MN	2
RI	1
SC	1

state	Count
MA	103
NY	69
VA	41
MD	35
PA	33
NC	21
NJ	17
FL	16
OH	14
DC	11
IN	10
DE	6
GA	6
KY	6
MI	6
CT	5
RI	1
SC	1

state	Count
CA	151
MA	103
NY	69
VA	41
IL	40
MD	35
PA	33
TX	28
NC	21
WA	21
NJ	17
FL	16
OH	14
TN	13
CO	11
DC	11
IN	10
UT	10
WI	10
AZ	9
MO	9
AL	8
DE	6
GA	6
KY	6
MI	6
CT	5
IA	5
LA	4
NE	4
OR	4
KS	3
NM	3
ID	2
MN	2
RI	1
SC	1