The paper gives a formal semantics of Cuneiform and discusses the design decisions enabling this functional programming language to make use of large-scale distributed compute infrastructures.
Computation Semantics of the Functional Scientific Programming Language Cuneiform
Cuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher-order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines.
This paper introduces the computation semantics for Cuneiform. It presents Cuneiform’s abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform’s syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straight-forward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language’s safety up to black-box operators. Lastly, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.