Data science is a discipline that uses scientific methods, processes and algorithms to extract meaningful information, knowledge and insights from structured and unstructured data.
The aim of this course is to provide insights on intermediate and advanced data science topics, using the Python programming language. The course will explore concepts such as machine learning, deep learning and natural language processing from a practical hands-down point of view. The focus will be on tools and methods rather than diving into the theoretical basis, in order to be appreciated by an audience with a minimal mathematical background.
Experience in using a programming or scripting language is a must. The student should master all the concepts explored in the course Python Programming for Data Science - Introduction.
In order to complete the assignment (and in order to get the full benefit from the course) students will need access to a computer capable of running the open-source software used in the course and access to the Internet. A limited amount of class time will be allocated to working on the class assignment, so students should ensure that they have access to a computer outside of class.
The course will rely on Jupyter Notebooks for interactive Python programming as they are widely used in Data Science.
Before attending this course, prospective students will know:
- All the requirements and topics covered in the "Python Programming for Data Science - Introduction" course, i.e:
- The fundamentals of linear algebra: what is a matrix and how matrix addition and multiplication are performed.
- The following fundamental concepts of statistics: mean, median, variance and standard deviation, interquartile range.
- The fundamentals of algebra: real and complex numbers, exponential and logarithm, and trigonometric functions.
- How to perform fundamental Python operations such as variable creation, numerical operations on scalar, vectors and matrices, iteration through a collection, manipulation of elements in a collection.
- how to use NumPy and pandas to import a dataset and extract important statistics from it using techniques such as split-apply-combine (for example, finding the mean, median or max of a quantitative variable for each category in a categorical variable).
- Given a dataset, how to select the appropriate visualisation graph depending on the information to be conveyed, and use the matplotlib and seaborn library to draw it and add title, captions and figure legends.
- How to create and add state and behaviour to a class in Python.
- How to use nltk or spaCy to preprocess a text and convert it to a numerical representation that can be manipulated by information retrieval algorithms (e.g. for sentimental analysis, semantic search or machine learning algorithms).
- What is, at least conceptually or visually, a derivative and a gradient.