library(dplyr)
library(ggplot2)
library(skimr)
library(dplyr)
library(tidyverse)
library(readxl)
library(rpart)
library(rpart.plot)
library(formattable)
library(rattle)
library(knitr)
jobs <- read_excel("jobs_2021.xlsx", col_names = TRUE)
This project serves as a guide to exploratory data analysis of a dataset of data-related careers. The result is an exploration of my own career interests with the intent of giving the reader ideas of how to perform their own analysis based on their personal career preferences.
The dataset used for this project was collected by Kaggle user Nikhil Bhathi, and was scraped from Glassdoor.com.
timezone
to the dataset using Excel, which transforms the state names into timezones via a method given here.value_of_dollar
that includes the value of a dollar based on the state of the listingmedian_rent
that includes the median monthly rent based on the state of the listingVLOOKUP
and a table provided hereindex
: An index for ease of reference
job_title
: Job title provided with the listing
role_type
: Simplified job title
company_name
: Name of the company
rating
: Rating of the company at time of scraping
timezone
: Timezone of state of listing (majority ruling for states with multiple timezones)
city
: City of job listing
state
: State of job listing
median_rent
: Median rent in the state of the job listing
value_of_dollar
: Value of a dollar based on state of listing
location
: City, State of listing
hq_location
: City, State of headquarters of company of listing
size
: Size of the company of the listing
founded
: Year the company of the listing was founded
ownership
: Ownership type of the company of the listing
industry
: Industry of the company of the listing
sector
: Sector of the industry of the listing
revenue
: Total revenue of the company per year
lower_salary
: Lower end of the Glassdoor estimate of salary for position
upper_salary
: Upper end of the Glassdoor estimate of salary for position
avg_salary
: Average of upper and lower salary for position
seniority_by_title
: Level of seniority determined by job title
degree_req
: Degree requirement by listing
python
: 1 If Python skill is required, 0 Otherwise
spark
: 1 If Spark skill is required, 0 Otherwise
aws
: 1 If AWS skill is required, 0 Otherwise
excel
: 1 If Excel skill is required, 0 Otherwise
sql
: 1 If SQL skill is required, 0 Otherwise
sas
: 1 If SAS skill is required, 0 Otherwise
keras
: 1 If Keras skill is required, 0 Otherwise
pytorch
: 1 If Pytortch skill is required, 0 Otherwise
scikit
: 1 If Scikit skill is required, 0 Otherwise
tensor
: 1 If Tensor skill is required, 0 Otherwise
tableau
: 1 If Tableau skill is required, 0 Otherwise
bi
: 1 If PowerBI skill is required, 0 Otherwise
flink
: 1 If Flink skill is required, 0 Otherwise
mongo
: 1 If MongoDB skill is required, 0 Otherwise
google_an
: 1 If Google Analytics skill is required, 0 Otherwise
Let’s skim through our data to see what datatypes the columns were imported as.
skim(jobs)
Name | jobs |
Number of rows | 738 |
Number of columns | 39 |
_______________________ | |
Column type frequency: | |
character | 15 |
numeric | 24 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
job_title | 0 | 1 | 9 | 98 | 0 | 263 | 0 |
role_type | 0 | 1 | 7 | 16 | 0 | 9 | 0 |
company_name | 0 | 1 | 2 | 51 | 0 | 340 | 0 |
timezone | 0 | 1 | 7 | 8 | 0 | 4 | 0 |
city | 0 | 1 | 4 | 29 | 0 | 195 | 0 |
state | 0 | 1 | 2 | 2 | 0 | 37 | 0 |
location | 0 | 1 | 8 | 33 | 0 | 200 | 0 |
hq_location | 0 | 1 | 8 | 26 | 0 | 196 | 0 |
size | 0 | 1 | 6 | 12 | 0 | 8 | 0 |
ownership | 0 | 1 | 8 | 30 | 0 | 9 | 0 |
industry | 0 | 1 | 6 | 40 | 0 | 59 | 0 |
sector | 0 | 1 | 5 | 34 | 0 | 24 | 0 |
revenue | 0 | 1 | 18 | 32 | 0 | 13 | 0 |
seniority_by_title | 0 | 1 | 2 | 2 | 0 | 3 | 0 |
degree_req | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 | 369.50 | 213.19 | 1.00 | 185.25 | 369.50 | 553.75 | 738.00 | ▇▇▇▇▇ |
rating | 0 | 1 | 3.63 | 0.76 | -1.00 | 3.30 | 3.70 | 4.00 | 5.00 | ▁▁▁▇▅ |
median_rent | 0 | 1 | 1768.20 | 548.27 | 1047.00 | 1253.00 | 1590.00 | 2252.00 | 2518.00 | ▇▃▂▁▇ |
value_of_dollar | 0 | 1 | 0.99 | 0.10 | 0.85 | 0.91 | 0.99 | 1.10 | 1.14 | ▆▇▂▆▇ |
founded | 0 | 1 | 1969.71 | 54.28 | 1744.00 | 1958.00 | 1992.00 | 2007.00 | 2019.00 | ▁▁▁▂▇ |
lower_salary | 0 | 1 | 74.61 | 30.94 | 15.00 | 52.00 | 69.00 | 91.00 | 202.00 | ▅▇▃▁▁ |
upper_salary | 0 | 1 | 128.16 | 45.20 | 16.00 | 96.00 | 124.00 | 154.75 | 306.00 | ▂▇▆▂▁ |
avg_salary | 0 | 1 | 101.39 | 37.53 | 15.50 | 73.50 | 97.25 | 122.50 | 254.00 | ▂▇▅▁▁ |
python | 0 | 1 | 0.53 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
spark | 0 | 1 | 0.22 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
aws | 0 | 1 | 0.24 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
excel | 0 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
sql | 0 | 1 | 0.51 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
sas | 0 | 1 | 0.09 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
keras | 0 | 1 | 0.04 | 0.19 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
pytorch | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
scikit | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
tensor | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hadoop | 0 | 1 | 0.17 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
tableau | 0 | 1 | 0.20 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
bi | 0 | 1 | 0.08 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
flink | 0 | 1 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
mongo | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
google_an | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
There are a number of character columns that should be changed to factors.
job_title
has 263 unique values, which is too many to change the column to factor type. However, role_type
only has 9 unique values, which is a reasonable amount to change the column to factor type. Other columns we should change to factors are:
role_type
state
size
ownership
industry
sector
revenue
seniority_by_title
degree_req
# Change the columns listed above from character to factor type
jobs$role_type <- as.factor(jobs$role_type)
jobs$state <- as.factor(jobs$state)
jobs$ownership <- as.factor(jobs$ownership)
jobs$industry <- as.factor(jobs$industry)
jobs$sector <- as.factor(jobs$sector)
jobs$seniority_by_title <- as.factor(jobs$seniority_by_title)
jobs$degree_req <- as.factor(jobs$degree_req)
Since size
and revenue
are ordinal variables, we should also make sure to set them with the correct ordering.
Let’s check what their levels are:
unique(jobs$size)
## [1] "501 - 1000" "10000+" "1001 - 5000" "51 - 200" "201 - 500"
## [6] "5001 - 10000" "unknown" "1 to 50"
unique(jobs$revenue)
## [1] "$50 to $100 million (USD)" "$2 to $5 billion (USD)"
## [3] "$100 to $500 million (USD)" "$500 million to $1 billion (USD)"
## [5] "Unknown / Non-Applicable" "$1 to $2 billion (USD)"
## [7] "$25 to $50 million (USD)" "$10+ billion (USD)"
## [9] "$1 to $5 million (USD)" "$10 to $25 million (USD)"
## [11] "$5 to $10 billion (USD)" "Less than $1 million (USD)"
## [13] "$5 to $10 million (USD)"
We also want to order timezone
from west to east. Its levels are:
unique(jobs$timezone)
## [1] "Mountain" "Eastern" "Pacific" "Central"
jobs$size <- factor(c(jobs$size),
levels = c("unknown", "1 to 50", "51 - 200", "201 - 500", "501 - 1000", "1001 - 5000",
"5001 - 10000", "10000+"), ordered = T)
jobs$revenue <- factor(c(jobs$revenue),
levels = c( "Unknown / Non-Applicable", "Less than $1 million (USD)",
"$1 to $5 million (USD)","$5 to $10 million (USD)",
"$10 to $25 million (USD)", "$25 to $50 million (USD)",
"$50 to $100 million (USD)","$100 to $500 million (USD)",
"$500 million to $1 billion (USD)", "$1 to $2 billion (USD)",
"$2 to $5 billion (USD)", "$5 to $10 billion (USD)",
"$10+ billion (USD)"), ordered = T)
jobs$timezone <- factor(c(jobs$timezone),
levels = c("Pacific","Mountain","Central","Eastern"), ordered = T)
Now that we have made these adjustments, it’s time to start digging into some exploratory data analysis!
We want to answer some questions about the dataset. These questions include:
role_type
, timezone
, state
, and sector
?role_type
overall and by individual timezone
and state
?What is the overall distribution of the salaries?
(avg_sal_hist <- ggplot(jobs, aes(x = avg_salary))
+ geom_histogram(bins = 50,color=1, fill="olivedrab3")
+ scale_x_continuous(labels = scales::comma)) + labs(x = "Average Salary", y = "Number of Listings")
The distribution of salaries is right skewed, with more outliers toward the high end of the salary spectrum. The values appear to range from $15k to $210k and the median appears to fall around $90k.
Here are some summary statistics for avg_salary
.
summary(jobs$avg_salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.50 73.50 97.25 101.39 122.50 254.00
Faceted histograms and boxplots can be used to show the distribution of avg_salary
by role_type
.
# Histograms of salary faceted by job title
histos_salary_title <- ggplot(jobs, aes(x = lower_salary, color = 2, fill = role_type)) +
geom_histogram(binwidth = 10) +
labs(x = "Avg Salary (k)", y = "Listings",
title = "Listings by Role") +
facet_wrap(~role_type, nrow = 3) +
theme(legend.position="none")
histos_salary_title
Some role types have a lot fewer listings than others. From the histograms it appears that data scientist roles have the highest number of listings as well as the highest upper end salary. Director and data modeler roles have the fewest listings.
How many listings there are for each position type?
groupby_role <- group_by(jobs,role_type)
groupby_role_count <- groupby_role %>%
summarize(Count=n(), Proportion = round(n()/nrow(jobs),2)) %>%
arrange(desc(Count))
kable(groupby_role_count,"simple")
role_type | Count | Proportion |
---|---|---|
data scientist | 310 | 0.42 |
other scientist | 143 | 0.19 |
data engineer | 120 | 0.16 |
analyst | 109 | 0.15 |
ml engineer | 22 | 0.03 |
ds proj. manager | 16 | 0.02 |
other data prof. | 8 | 0.01 |
data modeler | 5 | 0.01 |
director | 5 | 0.01 |
At 310 listings, data scientist roles take up nearly half of the dataset, while directors, data modelers, and other data professionals each take up only ~1% of the set.
groupby_role_avg_salary <- groupby_role %>%
summarize(Avg_SalaryK=round(mean(avg_salary),2)) %>%
arrange(desc(Avg_SalaryK))
kable(groupby_role_avg_salary, "simple")
role_type | Avg_SalaryK |
---|---|
director | 130.80 |
ml engineer | 126.43 |
data scientist | 118.18 |
data engineer | 105.35 |
other data prof. | 91.06 |
other scientist | 88.06 |
data modeler | 77.50 |
ds proj. manager | 73.22 |
analyst | 66.34 |
Using boxplots we can visualize the 25th, 50th, and 75th percentiles of our data for each role type, along with their outliers.
ggplot(data=jobs, aes(fill=role_type)) +
geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
y=avg_salary)) +
coord_flip() +
labs(x = "Role Type",
y = "Average Salary",
title = "Boxplots of Average Salary by Role")+
theme(legend.position="none")
Now we are starting to see some points of interest in our data.
Director roles have the highest mean salary, but also have the least amount of data points (along with data modelers).
Roles with more data points tend to have more outliers, with the highest avg_salary
outliers falling in the data scientist category. Data scientists also have an outlier with the lowest salary.
I’m curious about what that outlier is, so let’s find it.
The outlier is the only observation falling under avg_salary
of 25k, so we can use a filter to locate this oberservation.
outlier <-groupby_role %>% filter(avg_salary<25)
kable(outlier[,c(1,2,4,8,19,20,21)],"simple")
index | job_title | company_name | state | lower_salary | upper_salary | avg_salary |
---|---|---|---|---|---|---|
239 | Data Scientist | NPD | NY | 15 | 16 | 15.5 |
Ah, so this is a data scientist position for NPD. Looking to Glassdoor.com, I found that there must be an error with this outlier, as a data scientist position for NPD is listed as being between $128-204k per year. The scraper tool may have mistakenly scraped the cash bonus values instead of the salary bonus values.
Let’s fix this value:
jobs[239, "lower_salary"] = 128
jobs[239, "upper_salary"] = 204
# Take the average of both these values for the "avg_salary" column
jobs[239, "avg_salary"] = (jobs[239, "lower_salary"] + jobs[239, "upper_salary"])/2
kable(jobs[239, c(c(1,2,4,8,19,20,21))],"simple")
index | job_title | company_name | state | lower_salary | upper_salary | avg_salary |
---|---|---|---|---|---|---|
239 | Data Scientist | NPD | NY | 128 | 204 | 166 |
From the skim output above I immediately noticed that there are certain skills than are more required than others. Let’s visualize this with more clarity.
For each job title, we would like to look at the percentage of job listings that require certain technical skills. In our dataset we have information on the requirement of 16 different technical tools for the job. In the ungrouped data, requirement is denoted by an indicator variable, where 1 means the skill is required and 0 means it is not.
By finding the mean of each skill column we get the percentage of listings that require that skill. Let’s create a table that helps us visualize which skills are most required for each simplified job title.
First we create a grouped tbl where we group our jobs
dataframe by role_type
.
Now we summarize groupby_role
by the means of each tech skill.
skill_importance <- groupby_role %>%
summarize(Python = round(mean(python), 2),
Spark = round(mean(spark),2),
AWS = round(mean(aws),2),
Excel = round(mean(excel),2),
SQL = round(mean(sql),2),
SAS = round(mean(sas),2),
Keras = round(mean(keras),2),
Pytorch = round(mean(pytorch),2),
Scikit = round(mean(scikit),2),
Tensor = round(mean(tensor),2),
Hadoop = round(mean(hadoop),2),
Tableau = round(mean(sql),2),
BI = round(mean(bi),2),
Flink = round(mean(flink),2),
Mongo = round(mean(mongo),2),
GoogleAn = round(mean(google_an),2))
kable(skill_importance, "simple")
role_type | Python | Spark | AWS | Excel | SQL | SAS | Keras | Pytorch | Scikit | Tensor | Hadoop | Tableau | BI | Flink | Mongo | GoogleAn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
analyst | 0.33 | 0.06 | 0.09 | 0.74 | 0.72 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.72 | 0.17 | 0.00 | 0.03 | 0.10 |
data engineer | 0.65 | 0.57 | 0.50 | 0.45 | 0.73 | 0.01 | 0.00 | 0.00 | 0.00 | 0.02 | 0.42 | 0.73 | 0.03 | 0.05 | 0.11 | 0.00 |
data modeler | 0.40 | 0.20 | 0.40 | 0.80 | 0.80 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.20 | 0.80 | 0.40 | 0.00 | 0.00 | 0.00 |
data scientist | 0.76 | 0.26 | 0.25 | 0.50 | 0.56 | 0.16 | 0.09 | 0.11 | 0.15 | 0.19 | 0.19 | 0.56 | 0.09 | 0.01 | 0.06 | 0.01 |
director | 0.00 | 0.00 | 0.20 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
ds proj. manager | 0.25 | 0.00 | 0.12 | 0.69 | 0.69 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.69 | 0.12 | 0.00 | 0.00 | 0.00 |
ml engineer | 0.82 | 0.18 | 0.27 | 0.32 | 0.59 | 0.00 | 0.00 | 0.23 | 0.32 | 0.41 | 0.18 | 0.59 | 0.00 | 0.00 | 0.00 | 0.00 |
other data prof. | 0.25 | 0.50 | 0.25 | 1.00 | 0.62 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.62 | 0.62 | 0.25 | 0.00 | 0.25 | 0.00 |
other scientist | 0.08 | 0.00 | 0.10 | 0.45 | 0.02 | 0.01 | 0.00 | 0.01 | 0.00 | 0.01 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
Great, but we want to make it easier to see at a glance which skills are most important for the most roles. We can rearrange the table so that the skill columns are organized in descending order by their column means.
First we need to get the means of the second through 17th columns.
skill_means <- colMeans(skill_importance[2:17])
skill_means
## Python Spark AWS Excel SQL SAS
## 0.393333333 0.196666667 0.242222222 0.594444444 0.525555556 0.033333333
## Keras Pytorch Scikit Tensor Hadoop Tableau
## 0.010000000 0.038888889 0.052222222 0.070000000 0.182222222 0.525555556
## BI Flink Mongo GoogleAn
## 0.117777778 0.006666667 0.050000000 0.012222222
And now we want to find the order of these means from highest to lowest and keep track of that order so that we can rearrange our tibble. Because the ordering excludes the first column we must add 1 to the value of the column orders and then append 1 to the beginning to keep the role_type column in its original position.
skill_order <- rev(order(skill_means)+1)
skill_order <- append(1, skill_order)
skill_order
## [1] 1 5 13 6 2 4 3 12 14 11 10 16 9 7 17 8 15
Now use these order values to rearrange the order of our columns by their column means.
skill_importance2 <- skill_importance[skill_order]
kable(skill_importance2, "simple")
role_type | Excel | Tableau | SQL | Python | AWS | Spark | Hadoop | BI | Tensor | Scikit | Mongo | Pytorch | SAS | GoogleAn | Keras | Flink |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
analyst | 0.74 | 0.72 | 0.72 | 0.33 | 0.09 | 0.06 | 0.03 | 0.17 | 0.00 | 0.00 | 0.03 | 0.00 | 0.12 | 0.10 | 0.00 | 0.00 |
data engineer | 0.45 | 0.73 | 0.73 | 0.65 | 0.50 | 0.57 | 0.42 | 0.03 | 0.02 | 0.00 | 0.11 | 0.00 | 0.01 | 0.00 | 0.00 | 0.05 |
data modeler | 0.80 | 0.80 | 0.80 | 0.40 | 0.40 | 0.20 | 0.20 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
data scientist | 0.50 | 0.56 | 0.56 | 0.76 | 0.25 | 0.26 | 0.19 | 0.09 | 0.19 | 0.15 | 0.06 | 0.11 | 0.16 | 0.01 | 0.09 | 0.01 |
director | 0.40 | 0.00 | 0.00 | 0.00 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
ds proj. manager | 0.69 | 0.69 | 0.69 | 0.25 | 0.12 | 0.00 | 0.00 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
ml engineer | 0.32 | 0.59 | 0.59 | 0.82 | 0.27 | 0.18 | 0.18 | 0.00 | 0.41 | 0.32 | 0.00 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 |
other data prof. | 1.00 | 0.62 | 0.62 | 0.25 | 0.25 | 0.50 | 0.62 | 0.25 | 0.00 | 0.00 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
other scientist | 0.45 | 0.02 | 0.02 | 0.08 | 0.10 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 |
Great! Now we can clearly see that for the most role types, Excel is required most often, followed by Tableau, SQL and Python.
It’s important to remember that these may be different from the overall importance of each skill. This is because each position type has a different count, so taking the mean of means is going to be different than the overall mean. Still, it makes information easier to read in this tibble when it is ordered in such a way.
But let’s also look at the overall ‘importance’ of each skill to compare.
skill_count <- groupby_role %>%
summarise(
Excel = mean(excel),
Tableau = mean(tableau),
SQL = mean(sql),
Python = mean(python),
AWS = mean(aws),
Spark = mean(spark),
Hadoop = mean(hadoop),
BI = mean(bi),
Tensor = mean(tensor),
Scikit = mean(scikit),
Mongo = mean(mongo),
Pytorch = mean(pytorch),
SAS = mean(sas),
GoogleAn = mean(google_an),
Keras = mean(keras),
Flink = mean(flink)) %>%
gather( "category", "mean", -role_type)
ggplot(skill_count,aes(category, mean)) +
geom_col(aes(fill = category), position = "dodge",)+ facet_wrap(.~role_type, ncol=3) +
theme(legend.position="none")+
labs(x = "Skill",
y = "Mean Required by Role",
title = "Skill Requirements by Role") +
theme(axis.text.x = element_text(angle = 45,
hjust=1,size = 8,))
Examining our dataset by timezine can help us uncover locational differences in roles and salaries.
by_timezone <- group_by(jobs, timezone)
listings_by_timezone_bar <- ggplot(jobs, aes(x = timezone, fill=timezone)) +
geom_bar() +
labs(x = "Timezone",
y = "Listing Count",
title = "Listing Count by Timezone")
listings_by_timezone_bar
salary_by_timezone_box <- ggplot(jobs, aes(x = timezone, y = avg_salary, fill= timezone)) + geom_boxplot() +
labs(x = "Timezone",
y = "Average Salary",
title = "Boxplots of Average Salary by Timezone")+
theme(legend.position="none")
salary_by_timezone_box
# Histograms of salary faceted by timezone
histos_salary_timezone <- ggplot(jobs, aes(x = avg_salary, color = 1, fill = timezone)) +
geom_histogram(binwidth = 10) +
labs(x = "Average Salary") + facet_wrap(~timezone, nrow = 3) +
theme(legend.position="none")
histos_salary_timezone
I’m interested in looking at jobs in the Eastern timezone
, so I’m going to filter the grouped dataframe by_timezome
by Eastern.
# Filter by_timezone by timezone="Eastern"
eastern_timezone_filter <- filter(jobs, timezone=="Eastern")
Which states in the Eastern timezone have the highest listing counts?
count_by_state_eastern <- filter(group_by(jobs, state), timezone=="Eastern") %>%
summarize(Count=n()) %>%
arrange(desc(Count))
kable(count_by_state_eastern, "simple")
state | Count |
---|---|
MA | 103 |
NY | 69 |
VA | 41 |
MD | 35 |
PA | 33 |
NC | 21 |
NJ | 17 |
FL | 16 |
OH | 14 |
DC | 11 |
IN | 10 |
DE | 6 |
GA | 6 |
KY | 6 |
MI | 6 |
CT | 5 |
RI | 1 |
SC | 1 |
And let’s again look at the boxplots of salaries by roles within the Eastern timezone
.
eastern_salary_by_role_box <- ggplot(data=eastern_timezone_filter, aes(fill=role_type)) +
geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
y=avg_salary)) + coord_flip() +
labs(x = "Role Type",
y = "Average Salary",
title = "Boxplots of Average Salary by Role in Eastern Timezone") +
theme(legend.position="none")
eastern_salary_by_role_box
Overall, the trend of these boxplots are the same as those in “Boxplots of Average Salary by Role.”
We’ve got the variable value_of_dollar
, so we can use this to standardize the salaries in the dataset to get a better idea of how valuable a salary is based on the state of the job. We will make a new column in the dataframe called std_salary
that multiplies avg_salary
by value_of_dollar
.
jobs$std_salary <- jobs$avg_salary*jobs$value_of_dollar
Looking at boxplots of std_salary
by timezone:
std_salary_by_timezone_box <- ggplot(jobs, aes(x = timezone, y = std_salary, fill=timezone)) + geom_boxplot() +
labs(x = "Timezone",
y = "Standardized Salary",
title = "Boxplots of Standardized Salary by Timezone")+
theme(legend.position="none")
std_salary_by_timezone_box
Here we see that the median salaries between the timezones are much closer together, which makes sense since they have all been standardized.
boxplots_std_sal_job_title <- ggplot(data=jobs, aes(fill=role_type)) +
geom_boxplot(mapping = aes(x= reorder(role_type, std_salary, FUN= median),
y=avg_salary)) +
coord_flip() +
labs(x = "Role Type", y = "Average Salary",
title = "Boxplots of Standardized Salary by Role")+
theme(legend.position="none")
boxplots_std_sal_job_title
Here’s a table with the top 10 un-standardized avg_salary
by state
:
# Chart same as std_sal_by_state but with unstandardized salaries.
by_state <- group_by(jobs, state)
avg_sal_by_state <- summarize(by_state,
avg_salary=round(mean(avg_salary),2))
avg_sal_by_state <- avg_sal_by_state %>%
arrange(desc(avg_salary))
kable(head(avg_sal_by_state, 10),"simple")
state | avg_salary |
---|---|
CA | 123.75 |
IL | 116.66 |
DC | 110.18 |
MA | 107.50 |
NJ | 104.56 |
MI | 100.25 |
RI | 100.00 |
NY | 99.46 |
NC | 98.45 |
MD | 97.36 |
Compare that with the table of the top standardized salaries by state.
std_sal_by_state <- summarize(by_state,
std_salary=round(mean(std_salary),2))
std_sal_by_state <- std_sal_by_state %>%
arrange(desc(std_salary))
kable(head(std_sal_by_state, 10))
state | std_salary |
---|---|
IL | 130.66 |
NJ | 117.11 |
KY | 110.58 |
NY | 109.41 |
MI | 108.27 |
CA | 107.66 |
NC | 105.34 |
DC | 103.57 |
IN | 101.14 |
RI | 101.00 |
Interesting! California slides down from 1st place to 5th. Kentucky moves up from 11th to 3rd. These, crudely put, are the top 10 states for getting the best value out of your salary.
count_by_state <- by_state %>%
summarize(Count=n()) %>%
arrange(desc(Count))
kable(count_by_state, "simple")
state | Count |
---|---|
CA | 151 |
MA | 103 |
NY | 69 |
VA | 41 |
IL | 40 |
MD | 35 |
PA | 33 |
TX | 28 |
NC | 21 |
WA | 21 |
NJ | 17 |
FL | 16 |
OH | 14 |
TN | 13 |
CO | 11 |
DC | 11 |
IN | 10 |
UT | 10 |
WI | 10 |
AZ | 9 |
MO | 9 |
AL | 8 |
DE | 6 |
GA | 6 |
KY | 6 |
MI | 6 |
CT | 5 |
IA | 5 |
LA | 4 |
NE | 4 |
OR | 4 |
KS | 3 |
NM | 3 |
ID | 2 |
MN | 2 |
RI | 1 |
SC | 1 |
I have heard that the job market is hot in North Carolina, so let’s check out what kind of jobs are listed there.
# Filter jobs by state="NC"
filter_by_state_nc <- filter(jobs, state=="NC")
kable(filter_by_state_nc[,c(1,2,4,21)])
index | job_title | company_name | avg_salary |
---|---|---|---|
104 | Data Engineer | CapTech | 95.5 |
118 | Senior Data Engineer | Red Ventures | 134.0 |
250 | Data Scientist | Genworth | 102.0 |
318 | Senior Data Scientist | Brighthouse Financial | 124.5 |
356 | Scientist - Analytical Services | Reynolds American | 99.5 |
362 | Senior Scientist - Regulatory Submissions | Reynolds American | 117.5 |
377 | Senior Data Engineer | Genesys | 112.5 |
411 | Senior Scientist - Toxicologist - Product Integrity (Stewardship) | Reynolds American | 74.0 |
413 | Senior Research Analytical Scientist-Non-Targeted Analysis | RTI International | 65.5 |
419 | Senior Scientist - Biostatistician | Reynolds American | 80.5 |
448 | Senior Data Scientist | Brighthouse Financial | 124.5 |
507 | Data Engineer - Consultant (Charlotte Based) | Clarity Insights | 85.5 |
531 | Scientist - Analytical Services | Reynolds American | 99.5 |
540 | Senior Scientist - Regulatory Submissions | Reynolds American | 117.5 |
567 | Senior Data Engineer | Genesys | 112.5 |
619 | Senior Scientist - Toxicologist - Product Integrity (Stewardship) | Reynolds American | 74.0 |
621 | Senior Research Analytical Scientist-Non-Targeted Analysis | RTI International | 65.5 |
633 | Senior Scientist - Biostatistician | Reynolds American | 80.5 |
684 | Data Engineer - Consultant (Charlotte Based) | Clarity Insights | 85.5 |
712 | Scientist - Analytical Services | Reynolds American | 99.5 |
722 | Senior Scientist - Regulatory Submissions | Reynolds American | 117.5 |
What’s the ‘hottest’ industry in North Carolina based on our data?
filter_by_state_nc %$% industry %>% table %>% which.max %>% names
## [1] "Consumer Products Manufacturing"
What’s the ‘hottest’ sector in North Carolina based on our data?
filter_by_state_nc %$% sector %>% table %>% which.max %>% names
## [1] "Manufacturing"
What’s the ‘hottest’ data role in North Carolina?
filter_by_state_nc %$% role_type %>% table %>% which.max %>% names
## [1] "other scientist"
What do the avg_salary
distributions look like by sector
in North Carolina?
nc_salary_by_sector_box <- ggplot(data=filter_by_state_nc, aes(fill=sector)) +
geom_boxplot(mapping = aes(x= reorder(sector, avg_salary,
FUN= median),
y=avg_salary)) + coord_flip() +
labs(x = "Sector",
y = "Average Salary",
title = "Boxplots of Average Salary by Sector in NC")+
theme(legend.position="none")
nc_salary_by_sector_box
What do the avg_salary
distributions look like by role_type
in North Carolina?
nc_salary_by_role_box <- ggplot(data=filter_by_state_nc, aes(fill=role_type)) +
geom_boxplot(mapping = aes(x= reorder(role_type, avg_salary, FUN= median),
y=avg_salary)) + coord_flip() +
labs(x = "Role Type",
y = "Average Salary",
title = "Boxplots of Average Salary by Role in NC")+
theme(legend.position="none")
nc_salary_by_role_box
# What skills are required most in the roles in North Carolina?
skill_count_nc <- filter_by_state_nc %>%
group_by(role_type) %>%
summarise(
Excel = mean(excel),
Tableau = mean(tableau),
SQL = mean(sql),
Python = mean(python),
AWS = mean(aws),
Spark = mean(spark),
Hadoop = mean(hadoop),
BI = mean(bi),
Tensor = mean(tensor),
Scikit = mean(scikit),
Mongo = mean(mongo),
Pytorch = mean(pytorch),
SAS = mean(sas),
GoogleAn = mean(google_an),
Keras = mean(keras),
Flink = mean(flink)) %>%
gather( "category", "mean", -role_type)
ggplot(skill_count_nc,aes(category, mean)) +
geom_col(aes(fill = category), position = "dodge",)+ facet_grid(.~role_type) +coord_flip()+
theme(legend.position="none")+
labs(x = "Skill",
y = "Mean Required by Role",
title = "Skill Requirements by Role in NC")
Between data engineers, data scientists and other scientists in NC:
What are the other scientist
jobs in North Carolina?
other_scientist_nc <- filter(jobs, state == "NC" & role_type=="other scientist") %>%
arrange(desc(std_salary))
kable(other_scientist_nc[,c(1,2,4,21)], "simple")
index | job_title | company_name | avg_salary |
---|---|---|---|
362 | Senior Scientist - Regulatory Submissions | Reynolds American | 117.5 |
540 | Senior Scientist - Regulatory Submissions | Reynolds American | 117.5 |
722 | Senior Scientist - Regulatory Submissions | Reynolds American | 117.5 |
356 | Scientist - Analytical Services | Reynolds American | 99.5 |
531 | Scientist - Analytical Services | Reynolds American | 99.5 |
712 | Scientist - Analytical Services | Reynolds American | 99.5 |
419 | Senior Scientist - Biostatistician | Reynolds American | 80.5 |
633 | Senior Scientist - Biostatistician | Reynolds American | 80.5 |
411 | Senior Scientist - Toxicologist - Product Integrity (Stewardship) | Reynolds American | 74.0 |
619 | Senior Scientist - Toxicologist - Product Integrity (Stewardship) | Reynolds American | 74.0 |
413 | Senior Research Analytical Scientist-Non-Targeted Analysis | RTI International | 65.5 |
621 | Senior Research Analytical Scientist-Non-Targeted Analysis | RTI International | 65.5 |
Oh wow! Most of these jobs are for a company called Reynolds American, which is a tobacco company. Positions for toxicologists does not seem like it should be in this dataset.
These jobs don’t look appealing to me. But there were two other role types that do look interesting to me: data engineer
and data modeler
. I want to look at these positions in the Eastern timezone and sort them by std_salary
.
engineer_modeler_eastern <- filter(jobs, timezone == "Eastern" & xor(role_type=="data engineer", role_type=="data modeler")) %>%
arrange(desc(std_salary))
kable(engineer_modeler_eastern[,c(1,2,4,21)],"simple")
index | job_title | company_name | avg_salary |
---|---|---|---|
15 | Data Engineer I | Audible | 146.0 |
118 | Senior Data Engineer | Red Ventures | 134.0 |
340 | Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence | Takeda Pharmaceuticals | 154.5 |
500 | Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence | Takeda Pharmaceuticals | 154.5 |
673 | Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence | Takeda Pharmaceuticals | 154.5 |
358 | Sr. Data Engineer | Moser Consulting | 122.5 |
534 | Sr. Data Engineer | Moser Consulting | 122.5 |
715 | Sr. Data Engineer | Moser Consulting | 122.5 |
137 | Data Engineer | Gridiron IT | 132.5 |
584 | Big Data Engineer | Peraton | 118.5 |
460 | Data Engineer | GNY Insurance Companies | 110.5 |
377 | Senior Data Engineer | Genesys | 112.5 |
567 | Senior Data Engineer | Genesys | 112.5 |
39 | Data Engineer | Lancer Insurance | 106.0 |
121 | Data Engineer | Upside Business Travel | 122.0 |
315 | Data Engineer | SpringML | 109.0 |
443 | Data Engineer | SpringML | 109.0 |
595 | Data Engineer | Maxar Technologies | 105.5 |
104 | Data Engineer | CapTech | 95.5 |
299 | Big Data Engineer | SpringML | 90.5 |
566 | Senior Data Engineer | Novetta | 97.0 |
351 | Data Engineer | MetroStar Systems | 109.0 |
523 | Data Engineer | MetroStar Systems | 109.0 |
703 | Data Engineer | MetroStar Systems | 109.0 |
93 | Data Modeler | IT Concepts | 106.5 |
295 | Data Engineer | ManTech | 92.0 |
692 | Data Architect / Data Modeler | Medidata Solutions | 86.5 |
436 | Data Engineer | Numeric, LLC | 91.5 |
288 | Data Engineer, Data Engineering and Artifical Intelligence | Takeda Pharmaceuticals | 102.5 |
507 | Data Engineer - Consultant (Charlotte Based) | Clarity Insights | 85.5 |
684 | Data Engineer - Consultant (Charlotte Based) | Clarity Insights | 85.5 |
84 | Data Engineer | Excella Consulting | 87.0 |
41 | Data Engineer | Persivia | 98.5 |
383 | Data Engineer | BRMi | 85.0 |
580 | Data Engineer | BRMi | 85.0 |
110 | Data Engineer | Sapphire Digital | 78.0 |
625 | Data Engineer | Carilion Clinic | 81.0 |
307 | Data Engineer | MassMutual | 92.0 |
423 | Data Engineer | MassMutual | 92.0 |
568 | Data Modeler (Analytical Systems) | MassMutual | 91.5 |
367 | IT - Data Engineer II | Arbella Insurance | 90.0 |
547 | IT - Data Engineer II | Arbella Insurance | 90.0 |
728 | IT - Data Engineer II | Arbella Insurance | 90.0 |
144 | Data Engineer | AVANADE | 87.0 |
151 | Data Engineer | Cogo Labs | 84.5 |
213 | Data Engineer | Cogo Labs | 84.5 |
338 | Data Engineer | Guidepoint | 64.5 |
497 | Data Engineer | Guidepoint | 64.5 |
169 | Data Engineer | IZEA | 71.5 |
256 | Data Engineer | IZEA | 71.5 |
317 | Associate Data Engineer | EAB | 65.5 |
442 | Associate Data Engineer | EAB | 65.5 |
379 | Data Modeler - Data Solutions Engineer | Liberty Mutual Insurance | 51.5 |
569 | Data Modeler - Data Solutions Engineer | Liberty Mutual Insurance | 51.5 |
Now these jobs look interesting.
What kind of skills do they require?
# What skills are required most in the roles in North Carolina?
filterby_eastern_tz <- filter(jobs, timezone == "Eastern")
skill_count_eastern_mod_eng <- filterby_eastern_tz %>%
group_by(role_type) %>%
filter(xor(role_type=="data modeler",role_type=="data engineer")) %>%
summarise(
Excel = mean(excel),
Tableau = mean(tableau),
SQL = mean(sql),
Python = mean(python),
AWS = mean(aws),
Spark = mean(spark),
Hadoop = mean(hadoop),
BI = mean(bi),
Tensor = mean(tensor),
Scikit = mean(scikit),
Mongo = mean(mongo),
Pytorch = mean(pytorch),
SAS = mean(sas),
GoogleAn = mean(google_an),
Keras = mean(keras),
Flink = mean(flink)) %>%
gather( "category", "mean", -role_type)
ggplot(skill_count_eastern_mod_eng,aes(category, mean)) +
geom_col(aes(fill = category), position = "dodge",)+ facet_grid(.~role_type) +coord_flip()+
theme(legend.position="none")+
labs(x = "Skill",
y = "Mean Required by Role",
title = "Skill Requirements of Eastern Data Engineers and Modelers")
Between data engineers and data modelers in the Eastern timezone:
It’s amazing how powerful of a tool EDA is to aid in decision-making. We were able to ask questions of the data and receive answers with data to back them up. Overall, we were able to
role_type
, timezone
, state
, and sector
role_type
overall and by timezone
and state
.