Data Engineering
Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. But how achievable are these speeds and what can you do to avoid memory errors?
In this blog I will use a real example to introduce two mechanisms of data movement within Spark and demonstrate how they form the cornerstone of performance.