Nowadays most of the scientific research supported by High Performance Computing (HPC) Systems begin with large simulations and scientific instruments data collection campaigns followed by large data analytics workflows. This has motivated the convergence between HPC and Big Data Analytics from the hardware and the software point of view. On one hand, the design of supercomputers was pushed towards the development of architectures that meet the needs of both numerical computations and Big Data analysis. On the other hand, many data analytics tools, such as Dask, in the Big Data ecosystem have been adapted to HPC systems.
Dask is an open-source library for parallel/distributed computing in Python. Dask extends scientific data collections such as Numpy, Xarray, Pandas DataFrames, Scikit-Learn, among others so that they can achieve parallel/distributed processing from local machines to distributed systems such as the cloud and high-performance computing systems
- Desirable but not required experience in Python, installing libraries via pip or anaconda.
Access. Students must use their personal equipment
- It is expected for the user to have administrator privileges in her machine.
- It is recommended to use a Linux-based OS.
- The user will be required to install the dask[complete] Python library and nodejs via APT only for Lab 2.
- Dask Fundamentals Tutorial for High Performance Computing https://github.com/DonAurelio/dask-tutorial-2023