A Collection of Data Science Take-home Challenges Review
Tackling the Take-Home Challenge
An example EDA challenge with Python and Jupyter Notebooks
A popular take-home assignment for data positions involves exploratory data assay or EDA. You are given a dataset or iii and told to analyze the data.
A company may give this type of assignment to gain insight into your thought process. They want to run into how you tackle a new dataset — and of class to brand sure y'all accept the technical skills they require.
While open-ended challenges can be dandy equally they allow you to showcase your strengths and to be creative, it can be hard to know where to even first.
Oft times they say you can use any technology you like. It would make sense to employ a linguistic communication that you're comfy with and that the visitor you lot're interviewing with uses.
Any time I'm given a selection hither, I employ Python and Jupyter notebooks.
Jupyter notebooks makes information technology easy for y'all to show your thought process and document your process in an easy to present format.
An important bespeak to remember is that successfully completing the take-home challenge is unremarkably followed past a discussion of your work if the visitor decides to proceed. It is important to be able to explicate your thought processes and be comfortable talking nigh your code in follow-up interviews.
Here we will work through a sample accept-home challenge.
The Challenge
In this challenge, the company gives us a very open up ended job: Explore some data.
While getting a flexible assignment can be an awesome mode for us to highlight our strengths — and maybe avert our weaknesses, information technology tin also be challenging to get started with no clear goal to achieve.
The datasets we will use hither are the widget factory datasets we created in this article, where nosotros worked through generating fake information with Python.
For our sample challenge, we have a markdown file with instructions:
The markdown file is very helpful to go a feel for what kind of data we'll be working with. It includes information definitions and a very open up ended didactics.
Because at that place are essentially no constraints, nosotros will use Python and Jupyter notebooks.
Pace Cypher — Set for Success
I have a directory on my computer, coding_interviews
that contains every accept home challenge I've completed in my job searches. Within this directory, I have subdirectories with the names of each company for which I've completed an assignment.
I like keeping the former lawmaking. Many times I've gone dorsum to one notebook or another, knowing that I've done something similar in the past and am able to modify information technology for the current task.
Earlier getting started let's create a widget_factory
directory for our claiming and move all the files to it for ease of access and arrangement.
Getting Started — Read in the Data and Ask Bones Questions
The first pace I similar to have is to read in the data and ask easy questions nigh each dataset individually:
- How much data practice I have?
- Are there missing values?
- What are the data types?
Let's explore our datasets:
# read in worker data
worker_df = pd.read_csv('data/workers.csv')
print(worker_df.shape)
worker_df.head()
Because we moved all the files to our widget_factory
directory we tin can use relative paths to read in the data.
I similar to utilize relative paths in my coding assignments for a few reasons:
- It makes the code look clean and slap-up.
- The reviewer won't exist able to tell if you're on a Mac or PC.
- I like to call up yous could get bonus points for making information technology simple for the reviewer to run your lawmaking without needing to change the path.
We have the files read in, checked the shape, and printed a sample. Another steps I like to have right away are to check the datatypes, number of unique values, and cheque for null values.
# check number of unique values in the dataset
for i in list(worker_df.columns):
impress(f'Unique {i}: {worker_df[i].nunique()}')
# checking for null values
worker_df.isnull().sum()
Nosotros have a relatively make clean dataset with no missing values.
# statistics well-nigh numerical data
### 'Worker ID' is the merely numerical column - this column is an identity column according to the readme
worker_df.depict()
Our only numeric column is Worker ID
which is an identity column.
# checking column types
###### 'Hire Date' column isn't a engagement - nosotros'll need to prepare
worker_df.info()
# convert 'Rent Engagement' to datetime
worker_df['Hire Date'] = pd.to_datetime(worker_df['Hire Date']) # cheque that it worked
print(worker_df.info()) # bank check appointment range of dataset
print(f"Min Date: {worker_df['Hire Appointment'].min()}")
print(f"Max Date: {worker_df['Hire Engagement'].max()}")
For the widgets dataset we follow the aforementioned steps as in a higher place. The code for these steps can be found in the full notebook on Github.
Plotting — Visualizing the Data
Afterwards answering the easy questions about the information, the next step is visualization.
I like to get-go with more basic visualizations. For me this means working with one variable at a time and then exploring relationships between the features.
Our worker dataset contains five features. Let's piece of work through each feature individually.
We know Worker ID
is an identification cavalcade from the readme
file. We too confirmed this when nosotros checked unique values. With a unique value for each row, we can safely skip the visualization of this column.
While Worker Proper name
only has 4820 unique values, I recollect information technology's condom to assume this is also an identity column. Nosotros could argue that it may be interesting to see which workers have the same name, or to check for possible indistinguishable records, nosotros'll skip further exploration of this characteristic for now.
The side by side characteristic we have is Hire Date
. Hither we tin can plot the most common hire dates to explore if workers have hire dates in mutual.
To accomplish this, kickoff we need to count the number of unique dates. While there are many ways to accomplish this, I similar using Counter
.
From the Python documentation, Counter
creates a dictionary "where elements are stored as dictionary keys and their counts are stored as dictionary values". This will get in easy to create a bar nautical chart to display our counts.
After getting our counts, nosotros can then use Seaborn to create a barplot.
# visualize hire appointment
# first Count all unique dates from collections import Counter hire_dates = Counter(worker_df['Hire Date'].dt.engagement) # get dates and date counts
common_dates = [d[0] for d in hire_dates.most_common(xv)]
common_counts = [d[1] for d in hire_dates.most_common(15)] # https://stackoverflow.com/questions/43214978/seaborn-barplot-displaying-values
# function to testify values on bars
def show_values_on_bars(axs):
def _show_on_single_plot(ax):
for p in ax.patches:
_x = p.get_x() + p.get_width() / 2
_y = p.get_y() + p.get_height()
value = '{:.0f}'.format(p.get_height())
ax.text(_x, _y, value, ha="center") if isinstance(axs, np.ndarray):
for idx, ax in np.ndenumerate(axs):
_show_on_single_plot(ax)
else:
_show_on_single_plot(axs) # plot most common rent dates
fig, ax = plt.subplots()
g = sns.barplot(common_dates, common_counts, palette='colorblind')
g.set_yticklabels([]) # testify values on the bars to brand the nautical chart more readable and cleaner
show_values_on_bars(ax) sns.despine(left=True, lesser=True)
plt.xlabel('')
plt.ylabel('')
plt.title('Virtually Common Hire Dates', fontsize=30)
plt.tick_params(axis='ten', which='major', labelsize=fifteen)
fig.autofmt_xdate()
plt.show()
When creating visualizations for any presentation, I similar to keep my charts clean by removing unneeded labels and lines.
I also make sure to utilise a consequent color palette — here we're using Seaborn's colorblind palette.
In a accept habitation challenge, it can be the small details that make your assignment stand out! Make sure your charts have appropriate titles and labels!
Afterwards each plot I like to use a markdown prison cell to remark on any key observations that tin be drawn. Even if these observations are simple, creating a short meaningful explanation can help your EDA look complete and well thought out.
We can see from the above barchart, workers are frequently hired alone and not in cohorts.
We could as well explore common hiring months and years to determine if in that location is any pattern in hire dates.
# visualize rent date
# first Count all unique dates
hire_dates = Counter(worker_df['Hire Date'].dt.year) # go dates and date counts
common_dates = [d[0] for d in hire_dates.most_common()]
common_counts = [d[1] for d in hire_dates.most_common()] # plot 20 well-nigh common hire dates
fig, ax = plt.subplots()
g = sns.barplot(common_dates, common_counts, palette='colorblind')
one thousand.set_yticklabels([]) # bear witness values on the bars to make the chart more readable and cleaner
show_values_on_bars(ax) sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.title('Workers by Year Hired', fontsize=30)
plt.tick_params(axis='10', which='major', labelsize=fifteen)
fig.autofmt_xdate()
plt.evidence()
Here we can see workers hired by year. We see that the years with the least number of hires are the commencement and last years of the dataset. Above, when we checked the date range of the information, we institute the minimum date to be 1991–07–xi and the maximum date 2021–07–08 which explains the lower number of hires.
Moving on to Worker Status
.
#visualize status feature
fig, ax = plt.subplots()
g = sns.countplot(x=worker_df['Worker Status'], club = worker_df['Worker Status'].value_counts().index, palette='colorblind')
chiliad.set_yticklabels([]) # testify values on the bars to make the chart more readable and cleaner
show_values_on_bars(ax) plt.title('Workers by Worker Status')
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', which='major', labelsize=15)
plt.testify()
Here we can meet that a majority of workers are full time. Nosotros accept less workers that are categorized equally part time and per diem.
And finally, permit'due south explore the team feature.
#visualize team feature
sns.countplot(y=worker_df['Team'], order = worker_df['Team'].value_counts().alphabetize, palette='colorblind')
sns.despine(left=True, bottom=Truthful)
plt.championship('Workers past Team')
plt.show()
Information technology tin exist helpful to beginning asking questions nearly the information. Does ane characteristic relate to another that we take?
Here we see that the teams are of like sizes. Midnight Blue has the most members and Crimson the to the lowest degree. It could be interesting to explore how the teams are assigned. Could information technology be based on job title, location, or worker status?
Now we can start working with groups of features. Let's visualize team by worker status.
#visualize team by worker status
fig, ax = plt.subplots()
one thousand = sns.countplot(x=worker_df['Squad'], hue=worker_df['Worker Status'], palette='colorblind')
g.set_yticklabels([]) show_values_on_bars(ax) # position the legend so that information technology doesn't encompass any bard
leg = plt.legend( loc = 'upper right')
plt.depict()# Go the bounding box of the original legend
sns.despine(left=True, bottom=True)
bb = leg.get_bbox_to_anchor().inverse_transformed(ax.transAxes)
# Modify location of the legend.
xOffset = 0.ane
bb.x0 += xOffset
bb.x1 += xOffset
leg.set_bbox_to_anchor(bb, transform = ax.transAxes)
plt.title('Workers by Team and Worker Status')
plt.ylabel('')
plt.xlabel('')
plt.show()
Here we can see there is a relatively equal distribution of workers to teams past worker status. This suggests that workers are not assigned to teams by their status.
This completes our worker dataset! Allow'south move on to the widget dataset.
Nosotros can create histograms of our numerical data:
# create histograms of all numerical data
# we know worker id is an identity column
# and so removing it from this visualization
widget_df_hist = widget_df[['Pace 1', 'Step two', 'Step 3']] widget_df_hist.hist()
sns.despine(left=True, bottom=True)
plt.prove()
Here we meet histograms of each step in the widget making process.
For steps 1 and iii, it looks like a majority of workers complete the steps quickly and in that location are long tails where the task takes much longer to complete. The long tails could be due to errors in recording the data, or due to workers having trouble completing the steps. This would be interesting to explore further.
Step 2 appears to have a normal distribution. Could step 2 be an easier to complete or more than automatic pace in the widget making process?
We've successfully explored all our features in both datasets! While we could stop here, this is not the near useful or interesting assay.
The next step nosotros volition have is to merge our datasets to explore the relationships betwixt features.
Going Farther — Combining Datasets
Our next step is to combine our worker and widget datasets together. This demonstrates our ability to merge datasets — a crucial skill for any information job.
Nosotros will merge the datasets on Worker ID
every bit that is the common feature between the information:
# merge dataframes together
merged_df = pd.merge(worker_df,
widget_df,
how='inner',
on='Worker ID')
impress(merged_df.shape)
merged_df.head()
With the data successfully merged we can continue plotting. Let'southward plot particular count by team:
#visualize item count by team
fig, ax = plt.subplots()
thousand = sns.countplot(x=merged_df['Team'],
order = merged_df['Team'].value_counts().index,
palette='colorblind')
yard.set_yticklabels([]) # show values on the bars to make the nautical chart more than readable and cleaner
show_values_on_bars(ax) plt.championship('Detail Count by Team')
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(centrality='ten', which='major', labelsize=xv)
plt.show()
The MidnightBlue team created the near items, while Ruby-red created the least. We can infer this is related to the number of workers assigned to each team. In a higher place we institute MidnightBlue is the largest squad and Blood-red the smallest.
We tin can also explore item count by worker status:
#visualize particular count by worker
fig, ax = plt.subplots()
g = sns.countplot(x=merged_df['Worker Condition'],
club = merged_df['Worker Status'].value_counts().index,
palette='colorblind')
1000.set_yticklabels([]) # show values on the bars to make the nautical chart more readable and cleaner
show_values_on_bars(ax) plt.championship('Item Count by Worker Status')
sns.despine(left=True, lesser=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(centrality='10', which='major', labelsize=15)
plt.show()
Here we encounter particular count by worker condition. As expected, full time workers created the almost items.
Nosotros can as well explore item counts by private workers. This can bear witness the nigh and least productive workers. Let'due south look at the workers with the everyman detail counts:
#visualize workers with lowest particular counts
# get-go create temporary df
tmp = grouped_df.sort_values(by='Item Number Count',
ascending=True).head(20) fig, ax = plt.subplots()
k = sns.barplot(y=tmp['Worker Name'],
10=tmp['Detail Number Count'],
palette='colorblind') plt.title('Workers with Everyman Detail Count')
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='ten', which='major', labelsize=15)
plt.show()
Here we see the 20 workers with the lowest item count. It would be interesting to explore the worker status of these workers. Are they part time or per diem? If we had data relating to time off or hours worked it would be interesting to explore if there is any correlation.
For abyss we can bank check the worker status of the plotted workers by press out the full tmp
dataframe we created for plotting and visually checking.
Another, more succinct, pick is to employ value_counts()
to go the count of unique values for the Worker Condition
column.
# bank check the values for worker condition for the workers plotted above
tmp['Worker Status'].value_counts() >>> Per Diem 20
>>> Name: Worker Status, dtype: int64
We can see the workers creating the to the lowest degree number of items are all per diem. A logical supposition here is that these workers may accept worked the to the lowest degree number of shifts or hours as per diem workers are oft used on an as needed basis.
Allow'southward now exclude per diem and explore the distributions of each step in the widget making process for full and function time workers:
# create temp df with only part and total time
tmp = merged_df.loc[merged_df['Worker Status'].isin(
['Total Time','Office Time'])] # listing of steps to loop over
steps = ['Step 1', 'Step 2', 'Footstep 3'] # create a plot for each step
for step in steps:
fig, ax = plt.subplots()
g = sns.violinplot(ten='Team',
y=step,
hue='Worker Condition',
carve up=True, data=tmp) sns.despine(left=True, lesser=True)
plt.xlabel('')
plt.ylabel('Fourth dimension') plt.championship(f'{pace} by Worker Status', fontsize=20) plt.prove()
For step 1, nosotros see a similar distribution for both full and office fourth dimension workers. All distributions have long tails. This could be due to actual tiresome stride completion times or possible information collection errors. This could be interesting to explore further.
The distributions for step two appear normal. Both full and part time workers for all teams times for footstep two resemble a bell curve.
In step 3, we can see very long tails across all groups. It appears that this step is generally completed apace with few outliers taking longer.
Across all steps, we don't see whatever major differences between groups in the violin plots. This suggests that full time and role time workers beyond all teams are mostly consequent in the time it takes to make widgets.
A limitation we can notice hither is that nosotros have no time units for the time steps. In our instruction file, information technology was merely noted that the values in these columns are times, merely did not give any units.
While we by no ways explored all possible information visualizations nosotros have completed a thorough exploratory analysis and will brainstorm wrapping up this challenge here.
Wrapping Up — Conclusions, Limitations, and Further Exploration
To stop whatever take dwelling house consignment adding a brusk conclusion section is a keen thought. I similar to bear on three areas hither:
- Conclusions
- Limitations
- Further Exploration
Are there whatsoever conclusions we can draw from the information? What limitations do we observe about the data? What would be interesting to explore farther if we had more fourth dimension or data?
This section doesn't demand to be long, a few sentences on each topic should exist plenty.
Conclusions
We found that the MidnightBlue squad has the near workers and besides created the greatest number of widgets. Carmine, with the least number of members, created the lowest number of widgets.
For the timeframe of this dataset it appears that number of widgets created is correlated with team members.
We also found that workers who created the least number of widgets are all per diem condition. This suggests per diem workers could piece of work less hours than part fourth dimension and full time workers.
Full time and part fourth dimension workers across all teams appear to be similar in their widget creation times.
Limitations
The data given here does not include the time frame of the data collection period. We are only able to analyze the data as a single snapshot in fourth dimension, and are not able to explore how widget creation may vary over fourth dimension.
As nosotros noted higher up, we practice not know the time unit for the data in the widget table.
Farther Exploration
It would be interesting to explore widget creation over time. Do teams with the most workers e'er create the almost widgets?
We could also further explore the timings of the widget creation steps. Information technology would be interesting to see if this changes over time or to explore any potential outliers.
1 final thing I similar to exercise here is to restart my notebook and run all cells from top to lesser. This shows attention to detail, and peradventure more chiefly ensures that the notebook volition run in order error free.
If the reviewer tries to run our notebook, we want to be confident that it will run in gild with no errors!
Summary
Hither nosotros worked through a sample interview take home challenge from start to finish.
We began by reading in the information and asking unproblematic questions.
- Exercise we take outliers?
- Do we have missing values?
- What type of data practise we have?
- What does the information look like?
We then moved on to plotting, first visualizing single features, and so moving on to exploring the relationships among the features.
By manipulating the information, merging it, and creating visualizations, nosotros were able to showcase our Python and data exploration skills.
The total notebook can be found on GitHub.
Best of luck on your next take home challenge!
Source: https://towardsdatascience.com/tackling-the-take-home-challenge-7c2148fb999e
ارسال یک نظر for "A Collection of Data Science Take-home Challenges Review"