Applications running on clusters of shared-memory computers are often implemented using OpenMP+MPI. Productivity can be vastly improved using task-based programming, a paradigm where the user partitions the computation into tasks, and expresses the data and control-flow relations between tasks. While programmer productivity is increased, high-performance execution remains challenging: the implementation of parallel algorithms typically requires specific task placement and communication strategies to reduce inter-node communications and exploit data locality. Furthermore, high-performance implementations of the tasks remain essential, with pattern-specific and target-specific code optimizations typically needed to achieve high-performance execution of task bodies. In this talk we present a new macro-dataflow programming environment for distributed-memory clusters, based on the Intel Concurrent Collections (CnC) runtime. We introduce a compiler based on the polyhedral compilation model to automatically generate high-performance codes for programs with polyhedral dataflow, implementing key optimizations automatically including task coarsening and coalescing. We also outline several high-performance code generation approaches, demonstrating how to use optimizing compilation to generate efficient code for CPUs, GPUs and FPGAs.