News

Workflow Paper at arXiv

2020-09-07

Christopher Schiefer has published a paper on porting a variant detection workflow to Saasfee, a Cuneiform-enabled workflow system in a massively parallel environment.

Portability of Scientific Workflows in NGS Data Analysis: A Case Study

The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a workflow developed for a particular system on a particular hardware infrastructure to another system or to another infrastructure is non-trivial, which poses a major impediment to the scientific necessities of workflow reproducibility and workflow reusability. In this work, we describe our efforts to port a state-of-the-art workflow for the detection of specific variants in whole-exome sequencing of mice. The workflow originally was developed in the scientific workflow system snakemake for execution on a high-performance cluster controlled by Sun Grid Engine. In the project, we ported it to the scientific workflow system SaasFee that can execute workflows on (multi-core) stand-alone servers or on clusters of arbitrary sizes using the Hadoop. The purpose of this port was that also owners of low-cost hardware infrastructures, for which Hadoop was made for, become able to use the workflow. Although both the source and the target system are called scientific workflow systems, they differ in numerous aspects, ranging from the workflow languages to the scheduling mechanisms and the file access interfaces. These differences resulted in various problems, some expected and more unexpected, that had to be resolved before the workflow could be run with equal semantics. As a side-effect, we also report cost/runtime ratios for a state-of-the-art NGS workflow on very different hardware platforms: A comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster (552 threads), and a high-end HPC system (3784 threads).

Tutorial Racketfest 2020

2020-01-01

On 2020-02-27 I will hold a tutorial on Typed Racket at Racketfest 2020.

White-Belt Typed Racket

Website Reboot

2019-12-31

I have rebooted the Cuneiform Website to improve my publishing workflow and to accomodate for planned changes.

One reason for the reboot is that I grew uncomfortable with the Jekyll-based website because I found it hard working around Ruby’s iffiness. Somehow my distribution’s package manager apt, Ruby’s package manager gem, and bundle disagreed about what software is on my system and in what version. I do not want to micro-manage Ruby dependencies anymore. There are other static site generators with a more convenient workflow.

The second reason for the reboot is that I want Cuneiform’s website and its documentation to be generated from a single source. Until now, I had a Jekyll-based home page and a Sphinx-based documentation. Migrating everything to Sphinx allows me to build the whole website from a single source.

As a consequence, the site should have a more smooth navigation. But also, I have to redo a lot of the content. If there is something missing from the website, that you have relied on then I apologize. I am working to get the full content back up as soon as possible.

Tutorial Racketfest 2019

2019-03-23

On 2019-03-23 I held a tutorial on parallel computation with Racket places at [Racketfest <https://racketfest.com>`_ in Berlin.

Parallelism in Racket

We use parallel computation, independent state, and message passing every day. Our world is wired around webservers and load-balancers. Racket allows us to create our parallel computation network inside a laptop.

In this tutorial we give a short introduction to Racket places. We introduce the basic concepts for independent computation and approach step-by-step how to use Racket places. Lastly, we play with examples including a worker pool manager. Essentially, we fork-bomb Racket.

Talk Afterwork Racket 2

2019-01-15

On 2019-01-15 I gave a talk at After Work Racket Episode 2 in Mainz, Germany.

Redex Language Models for Exploring a Distributed Programming Language

Redex is a Racket library for modeling and exploring programming languages. Redex allows us to define an abstract syntax, a reduction relation, and judgment forms for type rules. I have used Redex to model Cuneiform, a distributed functional language. We discuss a simplified Redex model of Cuneiform and address how the language interacts with a distributed execution environment.

Talk at Eighth RacketCon

2018-09-29

On 2018-09-29 I gave a talk at the Eighth RacketCon held from 2018-09-29 to 2018-09-30 in St. Louis, Missouri.

Petri Net Flavored Places: An Advanced Transition System for Distributed Computing in Racket

Places provide a way to specify parallel and distributed computation in Racket. Using places we can set up independent services that communicate via channels. Often state machines are used in this setting to model session state, resource allocation, or service start order. However, plain state machines often suffer from the explosion of the state space as soon as multiple state variables appear in combination. Also some applications have only an infinite representation, if modeled as a state machine.

In this talk, we address these challenges by introducing pnet, a Racket library that allows to define Racket places as Petri nets. Petri nets are a class of transition systems representing state as tokens produced and consumed by transitions. We consider several examples for Petri nets as Racket places such as a worker pool and consider the possibility of driving the pnet library from a tailor-made DSL or using it to construct distributed programming languages.

Petri Net Paper At Erlang Workshop 2018

2018-09-29

Our short paper has been accepted at the Seventeenth ACM SIGPLAN Erlang Workshop 2018 and I presented it on 2018-09-29 in St. Louis, Missouri.

Modeling Erlang Processes as Petri Nets

Distributed systems are more important in systems design than ever. Partitioning systems into independent, distributed components has many advantages but also brings about design challenges. The OTP framework addresses such challenges by providing process templates that separate application-dependent from application-specific logic. This way the OTP framework hosts a variety of modeling techniques, e.g., finite state machines.

Petri nets are a modeling technique especially suited for distributed systems. We introduce gen_pnet, a behavior for designing Erlang processes as Petri nets. We give a short introduction to Petri net semantics and demonstrate how Erlang applications can be modeled as Petri nets. Furthermore, we discuss two Erlang applications modeled and implemented as Petri nets. For both applications we introduce a Petri net model and discuss design challenges.

Talk Code Beam STO 2018

2018-06-01

On 2019-06-01 I gave a talk on Petri nets in Erlang at Code BEAM STO 2018 held from 2018-05-31 to 2018-06-01 in Stockholm, Sweden. The video is available online.

Beyond State Machines: Services as Petri Nets

An important design principle in Erlang is the integration of domain models as OTP behaviors. Services managing session state, resource allocation, or service start order are often modeled as state machines. And several OTP behaviors around state machines have been conceived for Erlang. However, plain state machines often suffer from the explosion of the state space as soon as multiple state variables appear in combination. Also some applications have only an infinite representation, if modeled as a state machine.

We address these challenges by introducing gen_pnet, a new OTP behavior based on Petri nets. Petri nets are a class of transition systems generalizing service state by representing state as tokens produced and consumed by transitions. We consider several examples for Petri nets in gen_pnet such as a worker pool, a scheduler, as well as a distributed programming language. We also discuss how the Erlang error handling facilities play with gen_pnet and how unit testing works.

Cuneiform Semantics Paper Accepted at Journal of Functional Programming

2017-07-21

Our paper has been accepted at the Journal of Functional Programming and will be part of a Big data special issue.

Our Cuneiform semantics paper is available online in the Journal of Functional Programming volume 27, 2017.

Computation Semantics of the Functional Scientific Workflow Language Cuneiform

Cuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher-order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines.

This paper introduces the computation semantics for Cuneiform. It presents Cuneiform’s abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform’s syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straight-forward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language’s safety up to black-box operators. Lastly, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.

Bux Hi-WAY Paper Accepted at EDBT 2017

2016-12-22

Marc Bux’s paper has been accepted at the EDBT 2017 conference. It presents the Hi-WAY workflow execution environment based on Hadoop YARN which, among other workflow languages, runs Cuneiform.

The paper is available online.

Hi-WAY: Execution of Scientific Workflows on Hadoop YARN

Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today’s data-driven science. However, existing scientific workflow management systems (SWfMSs) are often limited to a single workflow language and lack adequate support for large-scale data analysis. On the other hand, current distributed dataflow systems are based on a semi-structured data model, which makes integration of arbitrary tools cumbersome or forces re-implementation. We present the scientific workflow execution engine Hi-WAY, which implements a strict black-box view on tools to be integrated and data to be processed. Its generic yet powerful execution model allows Hi-WAY to execute workflows specified in a multitude of different languages. Hi-WAY compiles workflows into schedules for Hadoop YARN, harnessing its proven scalability. It allows for iterative and recursive workflow structures and optimizes performance through adaptive and data-aware scheduling. Reproducibility of workflow executions is achieved through automated setup of infrastructures and re-executable provenance traces. In this application paper we discuss limitations of current SWfMSs regarding scalable data analysis, describe the architecture of Hi-WAY, highlight its most important features, and report on several large-scale experiments from different scientific domains.

Talk Erlang Factory Lite 2016

2016-11-24

On 2016-11-24 I gave a talk at Erlang Factory Lite 2016 in Berlin.

Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience

The need to analyze large scientific data sets on the one hand and the availability of distributed compute resources with an increasing number of CPU cores on the other hand have promoted the development of a variety of systems for distributed data analysis. Erlang, a language focused on concurrency and asynchronous communication, is a perfect match for orchestrating concurrent, distributed computation. In this talk we discuss the building blocks constituting the distributed execution environment underlying the Cuneiform workflow language. We show how Erlang actors and behaviours can be composed to build a system for concurrently running workloads comprising a large number of independent tasks and accessing large amounts of data through distributed file systems.

Cuneiform is a minimal workflow language focused on parallelism and integration. Users create workflows by defining and calling deterministic, side effect-free tasks. The Cuneiform interpreter automatically derives task- and data parallelism in a workflow. Tasks can be defined in any given programming language. Thus, external tools and libraries can be integrated with minimal effort. We demonstrate how workflows integrating external libraries from the bioinformatics domain are specified in Cuneiform and how parallel execution takes place on the Erlang VM orchestrating distributed compute resources and file systems.

Poster at BSR Winterschool 2016

2016-10-28

I presented a Cuneiform poster at the BSR Winterschool in Ede.

Cuneiform: Distributed Functional Programming and Foreign Function Interfacing

Cuneiform is a minimal workflow specification language with immutable state, lazy evaluation, lists, and second order functions operating on lists. In this, it borrows from Functional Programming languages. Cuneiform deliberately constrains users to specify workflows in a parallelizable way. Its execution environment Hi-WAY runs on top of Hadoop. In addition, functions (tasks) can be defined in any given scripting language, e.g., Bash, R, or Python. This way users can not only supplement features absent in native Cuneiform but can reuse any tool or library no matter what API it requires.

Talk at Erlang User Conference 2016

2016-09-08

On 2016-09-08 I gave a talk at Erlang User Conference 2016 in Stockholm. The slides and the video are available online.

Scalable Multi-Language Data Analysis on Beam: The Cuneiform Experience

The need to analyze large scientific data sets on the one hand and the availability of distributed compute resources with an increasing number of CPU cores on the other hand have promoted the development of a variety of systems for distributed data analysis. Erlang, a language focused on concurrency and asynchronous communication, is a perfect match for orchestrating concurrent, distributed computation. In this talk we discuss the building blocks constituting the distributed execution environment underlying the Cuneiform workflow language. We show how Erlang actors and behaviours can be composed to build a system for concurrently running workloads comprising a large number of independent tasks and accessing large amounts of data through distributed file systems.

Cuneiform is a minimal workflow language focused on parallelism and integration. Users create workflows by defining and calling deterministic, side effect-free tasks. The Cuneiform interpreter automatically derives task- and data parallelism in a workflow. Tasks can be defined in any given programming language. Thus, external tools and libraries can be integrated with minimal effort. We demonstrate how workflows integrating external libraries from the bioinformatics domain are specified in Cuneiform and how parallel execution takes place on the Erlang VM orchestrating distributed compute resources and file systems.

Talk at Curry On 2016

2016-07-19

I presented at Curry On 2016 in Rome from 2016-07-18 to 2016-07-19.

The video is available online.

Functional Programming and Foreign Language Interfaces: Essentials in Distributed Computing

The need to analyze massive scientific data sets on the one hand and the availability of distributed compute resources with an increasing number of CPU cores on the other hand have promoted the development of a variety of languages and systems for parallel, distributed data analysis.

In this talk we argue that both integrating existing tools and libraries and expressing complex workflows in a functional programming model are necessities in contemporary languages for distributed computing.

We demonstrate the usefulness of these features by the example of Cuneiform, a minimal functional language for large-scale scientific data analysis running on the Erlang VM. We discuss applications in bioinformatics and machine learning.

Irina Guberman Talks About Cuneiform at Erlang Factory 2016

2016-03-11

On 2016-03-11 Irina Guberman gave a talk about Cuneiform at Erlang Factory in San Francisco.

Cuneiform: DSLs for Scientific Workflows

Complexity of scientific workflow logic and the complexity of the underlying architectures will quickly become a nightmare if those two concerns aren’t properly separated. Scientific workflow DSLs to the rescue! One of the newest and hottest ones out there is Cuneiform, first ever functional scientific workflow DSL. Currently Cuneiform is being rewritten from Java to Erlang, a much better suited language for the job.

Cuneiform supports HTCondor as one of its possible backends. We’ll also talk about HTCondor, oldie but very goodie job scheduling and orchestrating architecture created at UW-Madison and how we use it for LiDAR image processing at HERE.

Andrey Salomatin Interviews Me on Concurrency

2016-02-29

Andrey Salomatin has interviewed me on concurrency.

Podcast: Concurrency: CSP and Actors.

Host: Andrey Salomatin

Multithreading is not the only approach we use to deal with concurrency. Single-purpose processes is our next frontier. Processes, that don’t have shared state. To coordinate, they pass messages to each other.

We can build complex concurrent systems using simple principles of CSP or Actors model. We break down programs into independent processes, each performing some specific job, talking to each other. How they talk to each is the point of contention here. That’s where the differences between CSP and Actors arise.

Cuneiform Website Online

2015-12-24

This site, cuneiform-lang.org, has gone online. It complements information material hosted at saasfee.io concerning the functional workflow language Cuneiform.

We will extend this site as we go, gradually filling the gaps.

Cuneiform at Erlang Factory Lite 2015

2015-12-01

On 2015-12-01 I gave a talk at Erlang Factory Lite 2015. The video and slides are available online.

Cuneiform: A Functional Workflow Language Implementation in Erlang

The need to analyze massive scientific data sets on the one hand and the availability of distributed compute resources with an increasing number of CPU cores on the other hand have promoted the development of a variety of languages and systems for parallel, distributed data analysis. Among them are data-parallel query languages such as Pig Latin or Spark as well as scientific workflow languages such as Swift or Pegasus DAX. While data-parallel query languages focus on the exploitation of data parallelism, scientific workflow languages focus on the integration of external tools and libraries. Cuneiform is a novel language for large-scale scientific data analysis that combines easy integration of arbitrary tools, treated as black boxes, with the ability to fully exploit data parallelism. We introduce a use case from next generation sequencing to discuss the way Cuneiform facilitates the reuse of existing software tools and the exploitation of data parallelism. Additionally, we discuss the way, this language was specified in Erlang and compare this specification to previous approaches. Finally, we discuss Cuneiform’s architecture and the way it is implemented in terms of Erlang services.