How to get started with Data Science?
“Data science”, “Artificial intelligence”, “Machine learning” – I’m sure you’ve heard all these buzzwords many times. Interest in these fields has increased significantly over the past five years. In medicine and healthcare they are starting to make practical impact for both patients and healthcare professionals (read more in this Nature Medicine article). The foundation for all three fields is data – the ability to collect, analyse, and learn from it at scale. Data science is a multidisciplinary field which will allow you to do that. But how to get started?
Firstly, data science relies on computational methods, so learning programming is key (and even if you don’t have any data on hand, programming teaches you problem solving and creative thinking which you can apply in your professional career). I’d recommend starting with Python – it’s easy to learn, there are plenty of resources around, and it’s very commonly used for data science problems. If you’re brand new to Python, Codecademy is a good place to start. Alternatively, have a look at this article for free online resources on learning Python. Some top tips: if possible start with Python 3 (Python 2 is still quite widely used, but support will be gradually phased out), and for easy and comprehensive installation of Python on your computer go for Anaconda.
Once you’ve got the basics down, you are ready for your first data science projects! You learn programming by doing, so it’s a good idea to start some small practical tasks early in your programming journey. You can come up with your own projects (e.g. load in a spreadsheet you have lying around and calculate/visualise some basic statistics, or download an open-source book and find the most frequent phrases), or you can head to Kaggle Learn for structured learning paths.
The two Python libraries which are used very commonly for data science projects are pandas (data wrangling) and scikit-learn (machine learning). Both come with excellent documentation and example tutorials. Another resource you will need to become familiar with is Stack Overflow – it’s a Q&A forum for all your programming needs. If your code executes with an error message, or you’re not sure how best to implement something, chances are that somebody has already asked a question about it on Stack Overflow. It might seem daunting at first, but the ability to formulate a search query, find and implement a solution is a skill for any data scientist.
Data science is an exciting field and a gateway to innovation in healthcare and beyond. The only thing left to say is – enjoy, have fun, and good luck!
Other useful resources:
- Python exercises: https://www.w3resource.com/python-exercises/
- Python Tutor (a tool that visualises how code is executed line by line): http://pythontutor.com/
- “Python Cookbook” by Brian Jones and David Beazley (a programming cookbook is a book of quick solutions to common programming problems; O’Reilly publishes a lot of thematic programming books on Python, data science, machine learning etc.)
- “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy” by Cathy O’Neil
- XRDS (student magazine of the Association for Computing Machinery, an international professional body for anything computational): https://xrds.acm.org/
- Advanced Research Computing at the University of Leeds offers some training for programming and data analysis (staff and research students only): https://arc.leeds.ac.uk/training/
- University of Essex runs an annual Analytics and Big Data summer school: https://www.essex.ac.uk/summer-schools/analytics-and-data-science