Johns Hopkins University is a world renowned institution known for their excellent medical research facilities. While the study of biology and chemistry is at the center of medicine, mathematics and computer science are becoming as important to the field of medicine. Quantitative methods of research as key to the understanding of epidemiology and public health. The department of biostatistics are Johns Hopkins University offers an online Data Science program through Coursera. The field of Data Science is relatively new and it involves computerized statistical analysis done on massive sets of data. The purpose of doing this is to extract information from the patterns found in medical data. This is extremely useful in research, especially in medical research, to study the causes of diseases, effects of drugs, or the likelihood of developing certain illnesses and much more.
I was curious about the field of Data Science and I wanted to be involved in it. I started doing some of the online classes offered by Johns Hopkins University in my free time back in 2014. I was still doing my bachelor of science degree then, and I am starting my master of technology in September, so I have been very busy and could not fit more than a couple of classes every few months on my schedule. So far I have finished 7 courses, all with distinction, and I have 2 more to do, plus a capstone project, to graduate with a Specialization in Data Science.
This article is a review for the courses I finished so far and what to expect.
Course #1 : The Data Scientist’s ToolBox
This was the first course in the specialization and it serves as an introduction to the tools used by data scientists. It provided examples of data sets, questions, and tools for data analysis. The main thing to get from this course was the use of Github and Rstudio. Github being a code provision tool and Rstudio being a statistical programming environment.
Course #2: Programming in R
This course was an introduction to the programming language R. It is used by computer scientist and statisticians to write code that process data sets. This course was fun and I had some good conversations with the student community in this course. The quizzes and assignments required understanding concepts of R and statistics and required writing R code. This course, in my opinion, is the foundation for the rest of the courses as students learn the basic functions and how to execute these functions using R.
Course #3: Getting and Cleaning Data
This courses covered the different types and formats of data, how to obtain data, how to feed the data to R code, and how to tidy the data up to be useful. Data can be acquired from different sources and come in different formats, which need to be cleaned and formatted to be easier to process and extract information. The acquisition of web data, API data, spreadsheet data, database data, and first hand data are covered in this course.
Course #4: Exploratory Data Analysis
This course covered the process of forming a hypothesis from data sets. Techniques of summarizing the data, plotting systems, and data graphics are covered. This course was interesting and I learned a lot about how to visualize statistics and how to from hypotheses about data sets. This course was less technical and more expressive, in my opinion. Communication is key in Data Science.
Course #5: Reproducible Research
This course covered research techniques and research publishing standards. The course is not only applicable to data science, but to any research activities and scientific claims. Any scientific research has to be reproducible to confirm the claims and findings of it. Statistical tool, R code, and research documents standards were covered to ensure that Data Science research is published in accordance to these standards for reproducibility. The final assignment of this course was the production of a research document using the standards taught in the course. The assignment was peer-reviewed and each student was require to peer-review the work of at least other 4 students.
Course #6: Statistical Inference
This course covers Statistical Inference which is the practice of drawing conclusions and scientific truths from data. Modes and models for doing this are covered including; Baysian theory, likelihood theory, design based theory, and frequentists theory. This course was one of the more technical ones and it required good understanding of statistics and mathematics. I enjoyed doing this course as it provided me with a practical approach of doing things that data scientists deal with everyday. The final assignment was peer-reviewed as well.
Course #7: Regression Models
This course covers linear modelling and prediction. The course professor insisted that regression models are the single most important topic in data science. Needless to say this class was very technical and there was a lot to cover. The final assignment was an investigation of data using residual and variability models. It was peer-reviewed as well.
The courses I have done so far were the foundation of data science applications. I have 2 more courses remaining in this specialization before I do the capstone project. The remaining courses are based on applications of data science and automation of statistical processes. They are called Practical Machine Learning and Developing Data Products. I am planning on doing these two remaining courses soon and I will write an update when I am done.