So you want to become a data scientist? Here’s where to start!
Welcome to my blog! I am a PhD student at Duke University studying computational cognitive neuroscience. About a 1.5 years ago I became interested in data science and have become obsessed with data science ever since (especially deep learning!). Studying data science has been rewarding and fun, and has even taken my PhD research in directions I never expected. Within this blog I’m going to publish posts about my general experiences with data science, but probably with more of an emphasis on deep learning (since deep learning is going to change the world!). I may even have a few posts on how I apply machine learning methods in my PhD research, such as using deep convolutional neural networks to examine whether computers and humans see similarity (spoiler: they do to some extent).
For this first post, I would like to share resources I studied (and continue to study) to learn data science. This is by no means the best way to learn data science methods, and I would love for readers to make other recommendations. I should also note that data scientists come in many different flavors (data science is a very general term), so the tools used by one data scientist may be completely different from those used by another one — again my bias is toward deep learning. So here is a list of a few resources I have utilized over the past 1.5 years that I have found to be extremely helpful:
1. Python for Everybody
Before I started studying data science, I predominately coded within MATLAB. It turns out that MATLAB is not popular within industry and not great for machine learning. If you want to become a data scientist (or at least use data science tools) I would recommend learning Python or R. I decided to go the Python route (as can be seen from my recommendation). I think Python is especially great for researchers interested in deep learning, as the best deep learning libraries (e.g., PyTorch, TensorFlow) are coded within Python. So if you want to learn Python, check out Python for Everybody. The video lectures and other materials are completely free and assumes you have no prior experience with Python. The class is also offered on Coursera if you would like to receive a certificate.
2. Python Data Science Handbook by Jake VanderPlas
After learning the fundamentals of Python, it will be time to learn Python toolboxes commonly used by data scientists (e.g., Numpy, Pandas, Matplotlib, Scikit-Learn, etc.). The Python Data Science Handbook is a great book to learn many of these toolboxes. The book even has a chapter on machine learning. You can purchase this book, but it also freely available on the author’s website. The one drawback of this book is that it does not have any practice exercises. From my experience in learning to code (and really anything) the real learning does not happen until you actually try to code on your own. So you will need to find your own data to practice on (I discuss this more below). Related to this book, I have also heard great things about Python for Data Analysis, so it is probably also worth checking out this book.
3. Andrew Ng’s Machine Learning Course
Python for Data Analysis does cover machine learning, but to learn more, check out Andrew Ng’s famous Coursera machine learning course. This course does a great job at explaining the fundamentals of machine learning and covers many popular machine learning algorithms. Just to give you a heads up, unfortunately, the practice exercises are coded within MATLAB, but this is not to say that they are not worth doing. You could even code them within Python, like this person did — http://www.johnwittenauer.net/machine-learning-exercises-in-python-part-1/
4. fast.ai
As I mentioned earlier in this post, I am obsessed with deep learning. I find deep learning to be one of the most exciting branches of data science and I am convinced deep learning is going to the change the world (it already has in many ways!). If you want to learn deep learning, I highly recommend taking the fast.ai MOOC. This course was first taught in Keras and is now being taught with PyTorch. The instructors take a top-down teaching approach, in which the course starts off showing you how to train a state of the art deep learning model and as the course progresses you learn more of the details. I am currently taking the first part of this course and they will be posting the second part of this course (the updated version taught in PyTorch) I believe sometime in the summer.
5. Practice
As I mentioned before, the real learning occurs when you practice the skills that you learned. So practice and code as much as you can. This can be at your own job or your own personal life. For example, I recently created a classifier to classify whether an image contained a clown fish or a damsel (yes, I do have a salt water aquarium with those two types of fish). Another great place to practice machine learning is Kaggle. Corporations host competitions, which typically involve creating a classifier to make some type of prediction (e.g., the State Farm Distracted Driving competition). Many users post their ideas so this is great for learning what other machine learning practitioners are doing and to receive feedback on your own work.
6. What am I planning to learn next?
As I already mentioned, I am currently taking the fast.ai MOOC. When the second part of the course is released, I plan on taking that. Fast.ai also has a machine learning course, which I will eventually take. I am also planning to learn SQL, which is often used by data scientists.
Final Thoughts
My last piece of advice to aspiring data scientists is don’t be scared to learn. There may be concepts or statistical toolboxes that seem complicated (and maybe they are) but I’ve come to find it’s often the case that these things are not as hard as you think. If you stick with it, you will probably be able to figure it out.
I hope people find this useful and will post additional resources that aspiring (or even current) data scientists will find useful.