Scott Logic / Altogether Smarter

Data Engineering · 24 January 2016 · 1 min read

Why Apache Spark is getting so much momentum behind it

Andrew Carr

Apache Spark has quickly become the largest open source project in big data - with over 750 contributors from 200 companies. It’s very easy to see why - it is a data processing platform where client code can be Java, Scala or Python. It can do mapreduce processing like hadoop but due to it’s tendency to process things in memory; it is commonly much faster typically between 10 to 100 times faster.

Spark can run on YARN (hadoop evolution) or alone, plus comes with out of the box algorithms such as Machine Learning, GraphX and IBM has now announced 15 of it’s core analytics libs including SPSS predictive analytics portfolio are now integrated with Spark.

But three things really seal the deal, firstly it can scale out to over 8,000 nodes and process petabytes of data and provides great tools to manage and deploy easily, secondly it comes with an interactive shell where the user can instantly run functions and try stuff out in Scala or Python. Last but by no means least - because of all the built in functions - it is so easy to write code that is loosely coupled and can be run in parallel that it can speed up the development cycle for rolling new functionality. Which lets the developers concentrate on developing new features for users.

While Apache Spark won’t solve every problem, and there are plenty of Use Cases where Apache Storm is more relevant such as where low latency is key (even Spark Streaming doesn’t quite match the low latency Storm can achieve as Spark Streaming is effectively micro-batching not true event triggered processing), but if you want speed of development, deployment, and throughput - you really should consider Apache Spark.

Andrew

Unaffordable Country In Apache Spark

Recreating the data used for the Guardian's Unaffordable Country visualisation in Apache Spark.

James Dunkerley · 19^th Dec 2016 · 11 min read

The Rise of Big Data Streaming

With the advent of the Internet of Things, the world of Big Data couldn't be more relevant. This post gives an overview of technologies that achieve processing at scale and in real time.

Daniel Cook · 7^th Feb 2017 · 3 min read

Want to receive more insights?

If you enjoyed this blog post, why not subscribe to our mailing list to receive Scott Logic content, news and insights straight to your inbox? Sign up here.

Andrew Carr

I lead the Data Engineering Practice within Scott Logic. I have a strong interest and expertise in low latency Front Office trading systems, software managing very large networks and the technologies involved in processing large volumes of data.

Back to all posts

Why Apache Spark is getting so much momentum behind it

Read more

Want to receive more insights?

Categories