This is probably the most common question I got asked beside “How did you land your job in Data Science/ Data Analytics?” I will write another blog on my job hunting journey, so this will focus on how to get the industry exposure without that gig yet.
I gave a talk on this topic before at DIPD @ UCLA — the student organization dedicated to increasing diversity and inclusion in the fields of Product and Data that I co-founded. However, I aim to expand this topic and make it accessible to a broader audience.
And there it goes, I hope this post will potentially inspire more and more data enthusiasts to start their own blogs.
This may be a tough time for many of us, but it’s also a prime time to turbocharge and level up your skill sets in data science and analytics. If your employment got impacted at this time, treat the unfortunate as a great opportunity to take a break, reflect and kickstart your personal project — things that are luxurious when time does not allow.
“When one door closes, another opens” — Alexander Graham Bell
Hardship does not determine who you are, it’s your attitude and perseverance that define your values. Let’s get right into it!
Where to start?
Start small and scale up
Before we start any project, first narrowing down your interests. This is your personal project so you will have full autonomy over it. Find something that makes you tick and gets you motivated to devote your time!
There will be a lot of challenges along the way that may discourage or sidetrack you from accomplishing the project, the thing that keeps you going should be the analysis topic that strongly aligns with your interest. It does not have to be something out of the world. Ask yourself what is important to you and why should we care about it.
When I first started, I knew that I wholeheartedly care about mental health and the ways to gain more mindfulness. So I dug deeper into analyzing the top 6 guided meditation apps to understand which one will be most suitable for my preferences.
Read, read, and read!
One of the most important key factors that I learned through my research assistant position at CRESST UCLA is to balance the workload between analysis and literature review. What this means is that we need to find what has been done in the past and figure out which additions or unique aspects you can contribute on top of the findings. My reading sources vary from Medium, Analytic Vidhya, statistics books to any relevant sources that I can find on the internet.
Take my Subtle Couple Traits analysis for example. There has been some work done in the space of music taste analysis via Spotify API, but no one has really delved into movies yet. So I took this chance and discovered the intersection of our couple’s cult favorites for music and movies.
Finding the right toolbox
Now you get to this step where you need to figure out which data to collect and find the right tools for the job. This part has always resonated intrinsically with my industry experience as a data analyst. It’s the most challenging and time-consuming part indeed.
My best tip for this stage of analysis is to ask a lot of practical questions and come up with some hypotheses that you need to answer or justify through data. We have to also be mindful of the feasibility of the project, otherwise, you can be more flexible in terms of tweaking your approach towards a more doable one.
Note that you can use the programming language that you are most comfortable with 🙂 I believe that either Python or R has its own advantages and great supporting data packages.
An example from my past project can crystalize this strategy. I was curious about the non-pharmaceutical factors that correlate to the suppression of COVID-19 so I listed out all of the variables I can think of such as weather, PPEs, ICU beds, quarantines, etc. then I began massive research on the open-source data sets.
“All models are wrong, but some are useful” — George Box
Since I did not have a background in public health, building predictive models for this type of pandemic data was a huge challenge. I first started with some models I’m familiar with such as random forest or Bayesian ridge regression. However, I discovered that pandemic typically follows the trend of a logistic curve in which the cases grow exponentially over a period of time until it hits the inflection point and levels out. This refers to the compartmental models in epidemiology. It took me almost 2 weeks to learn and apply this model to my analysis but the result was extremely mesmerizing. And I eventually wrote a blog about it.
If you are working in the Data Science/Analytics field, this is not new to you — “80% of a data scientist’s time consists of preparing (simply finding, cleansing, and organizing data), leaving only 20% to build models and perform analysis.”
The process of cleaning data may be cumbersome, but when you get it right, your analysis will be more valuable and significant. Here’s the typical process I take for my analysis workflow:
1) Collecting Data
2) Cleaning Data
- Detect outliers and anomalies
- Troubleshoot missing values
- Decode and conquer imbalance classes
- Re-format variable names and types
3) Project-based techniques
- (NLP) Sentimental analysis, POS, topic modeling, BERT, etc.
- (Predictions) Classification/Regression model
- (Recommendation System) Collaborative Filtering, etc.
4) Write up insights and recommendations
Connecting the dots
This is the most important part of the analysis. How do we connect the analysis insights into a real-life context and making actionable recommendations? Regardless of your project’s focus, whether it’s about machine learning, deep learning or analytics, what problem is your analysis/model trying to solve?
Imagining that we build a highly complex model to predict how many Medium readers will clap for your blog. Okay, so how’s this important?
Link it to potential impacts! If your post receives more endorsement from claps, it may get curated and featured more often on Medium platform. And if more paying Medium readers find your blog, you can probably earn more money through the Medium Partner Program. Now that’s an impact!
However, it’s not always about profit-driven impact, it could be social, health, or even environmental impact. This is just one example of how you can make the connections between technical concepts with real-world implementation.
You may hit a wall at some points during the journey. My best piece of advice is to proactively seek help!
Besides from reaching out to friends, colleagues, or mentors to ask for advice, I often found it helpful to search or post questions on online Q&A platforms like Stack Overflow, StackExchange, Github, Quora, Medium, you name it! While seeking for solutions, be patient and creative. If the online solutions have not yet solved your problems, try to think of another way to customize the solution for the characteristics of your data or the version of the code.
The art of writing is rewriting.
When I first published my first data blog to Medium, I found myself re-visiting my post and fixing some sentences or wording here and there. Don’t be discouraged if you notice some typos or grammar mistakes after releasing it, you can always go back and edit!
Since it is our personal project, there’s no obligation on whether you must finish it. Hence, prioritization and disciplines play a crucial role throughout the journey. Set a clear goal for your project and lay out a timeline to achieve it. At the same time, don’t spread yourself too thin since it may cause you to lose interest.
Understand your timeline and capacity! I often push my personal project in a sprint of 2 to 4 weeks to finish during break or the weekends. In order to organize your sprint and track your progress, you can refer to some Agile framework that can be found through collaboration software like Trello or Asana. As long as you make progress even the smallest one, your success shall flourish some day. So keep going and don’t give up!
The first step is always the hardest. If you don’t think that the project is ready yet, give yourself some time to fine-tune and share it!
Nothing will be perfect at first. But by shipping it to the audiences, you would know what to improve for later projects — I adopted this principle wholeheartedly from product management perspectives.
I used to be not good at communicating my thoughts structurally and clearly (which I’m still trying to improve), but by pushing myself out of the comfort zone, I have gone extra miles from where I was. I hope this will, to some degree, inspire you to start your first data blog. Believe in yourself, be brave and reach out to me or anyone in your network if you need help along the way!
“Faith is taking the first step even when you don’t see the whole staircase.” — Martin Luther King
Photo by Glen McCallum via Unsplash
By Giang Nguyen
Sourced from towards data science