Trends from FOSDEM20

Open Data Science Platforms

The research community around the world increasingly use the “cloud” and “containers” to support reproducible data science research activities.

  • Researchers need “ data science workflows ” when working with data and publishing results
  • Peer-review of research findings -> results needs to be reproducible
  • Workflows increasingly involve open data but also reproducible analyses (code)
  • Convergence of HPC and interative web applications - running together on containers.

Challenges for a researcher

Creating

  • How can I as a researcher create reproducible open research results? Tools and support?
  • What do I need to know - skills in “data carpentry” etc?

Sharing

  • What does it mean in practice if the research funds requires me to follow FAIR principles?
  • How do I deal with “personal data”?
  • How do I share my analysis? Is it reproducible?

What are best practices and recommended workflows?

Trends at FOSDEM20

Canadian open science cloud

Compute Canada provides HPC infrastructures and support to every academic research institution in Canada.

Researchers are provided with a complete HPC cluster software environment including a Slurm scheduler (jobs), a Globus Endpoint (file sharing), JupyterHub, LDAP, DNS, and over 3000 research software packages.

https://fosdem.org/2020/schedule/event/magic_castle/

Compute Canada staff has been using this software to deploy ephemeral clusters for training purposes every other week for the past two years.

Takeaways

  • Félix-Antoine Fortin at Université Laval, Canada Digital Research Institute
  • 5 datacenters, usage free for researchers
  • 150 workshops per year, ephemeral accounts approx 3 days
  • Access for researchers through https / ssh / globus (GridFTP)
  • Audience could log in at “superman.calculquebec.cloud”

  • DNS names automated incl SSL/TLS termination
  • Demo: Google Talk -> Dialogflow -> Flask -> MagicCastle -> OpenStack
  • On GitHub: https://github.com/magic_castle
  • Frontend: JupyterHub, GPU support on Compute Nodes

Vienna Biocenter Open Science Cloud

Interactive applications on HPC systems

Exploratory data analysis has increased the demand for interactive tools. In the same way, workshops and other teaching events often benefit from immediate and on-demand access to preconfigured, interactive environments.

On-premise container orchestration is often preferable because it enables deploying interactive tools on existing compute infrastructure that provides access to both software packages and the data to be analysed.

The deployment on HPC batch systems specifically brings challenges on how to handle authentication, user identities, and job submissions.

Open source “science cloud” on Raspberry Pi

“Cheap” open source “carpentry cluster”

Data Science Toolbox

Commercial/corporate analytics tools

Within BI, existing commercial solutions are QlikTech, TIBCO Spotfire, Tableau, Cognos, SAS mm. Downsides for academic domain:

  • Vendor lock-in and license managment requirements and costs.
  • Support for modern agile workflows not built-in , such as using GitHub Flow and reproducible containers (Docker Hub)
  • Support for reproducibility is increasingly required in the frontlines of academic research, utilizing data science approaches using with R and Python for ML / AI.
  • Weak support for Big Data / Fast Data such as data analysis powered by Apache Spark
  • Sharing results and code openly

Open source analytics environments

Beyond front-end “notebooks”

Technical solution in Ã…BU

Open source-based platform for reproducible research including web-friendly data analysis:

  • Front-end for open data science KONTARION

  • Can do ML och AI, scales locally or in the cloud using Docker Swarm / Kubernetes, container-based, supports ML/AI workflows using R, Python (Jupyter) etc… GPU-scaling and slurm jobs)

  • Add-on: Backend for big data using Apache Spark + Minio + Select S3

About KONTARION

A “data science software stack” extending https://www.rocker-project.org/ w domain-specific functionality.

Web-friendly data analytics environment providing RStudio Open Source Edition (AGPL v3) and Jupyter.

Containerized Open Data Science Analytics front-end with sparklyr connection to Big Data backend.

  • interactive web-based analytics (runtimes for Shiny och Dash)
  • running “jobs”/tasks
  • deploying and using APIs for open data
  • markdown-based authoring of content

Hybrid cloud big/fast data backend

Rationale

Handbook

Questions ?

“Use cases”