Data Engineering

An exploration of a lightweight, open-source alternative to traditional SaaS data engineering platforms, highlighting the benefits and trade-offs of each approach.

Benjamin Logan · 12^th May · 9 min read

Data Engineering

Solving Data Consistency in Distributed Systems with the Transactional Outbox

Distributed systems often struggle with data consistency. In this post, I explore how the Transactional Outbox pattern helped us solve this challenge in a client project, and how it compares to CDC and Event Sourcing.

Matthew Dunsdon · 8^th Sep 2025 · 8 min read

Tech

Why you should rethink legacy and consider Event-Driven Architecture

In this post, I describe how your business can assess whether a system is ready for modernisation and, if so, how to set your project up for success. I then explain why, in most cases, you’ll probably want to take an incremental approach rather than replacing the old system in one fell swoop. I end by providing an example of one of the ways your business can do this – by using Event-Driven Architecture.

James Moore · 6^th Aug 2025 · 6 min read

Data Engineering

From Diligence to Exit: The Critical Role of Data in PE Investments

Data enables faster and more accurate due diligence, informs operational transformation post-acquisition, and supports more effective positioning when it comes time to exit. This post outlines the role of data across each of these key stages.

Colin Eberhardt · 7^th Jul 2025 · 3 min read

Data Engineering

What is a Data Lakehouse?

In this post, I explore what a Data Lakehouse is, how it works, and whether it delivers on its promises—covering core features, formats, real-world patterns, and platform realities.

Matt Richards · 20^th Jun 2025 · 7 min read

Podcast

Beyond the Hype: Event-Driven Architecture – The only data integration approach you need?

In this episode, I dive into the world of Event-Driven Architecture (EDA) with Tom Fairbairn from Solace and Scott Logic’s Gordon Campbell. The discussion explores whether EDA has matured beyond the hype into a practical strategy for modern systems integration, or if it’s just another architectural buzzword.

Oliver Cronk · 10^th Jun 2025 · 1 min read

Data Engineering

Are we ready to put AI in the hands of business users?

Lots of businesses want to use AI, if they can find the right business case for it. We look at some new and enhanced AWS products which take a low-or-no-code approach to using AI to enhance Business Intelligence tools.

Caitlin Salt, Sam Gladstone · 23^rd Apr 2024 · 6 min read

Data Engineering

Async APIs - don't confuse your events, commands and state

This blog is about the different types of message you can put on systems like Rabbit MQ and Kafka. It discusses the differences between commands, events, state and gives a few tips around how to structure your messages.

David Hope · 22^nd Apr 2024 · 10 min read

Data Engineering

Apache Spark - What does going from 2.4 to 3.5 get you?

We look at what has changed between Apache Spark 2.4.x and 3.5.1, describing some of the new functionality and the significant boost in performance .

Steve Conway, Mike Morgan · 22^nd Apr 2024 · 7 min read

Podcast

Beyond the Hype: Are Data Mesh and Data Fabric just Marchitecture?

In this episode, Oliver Cronk, Andrew Carr and David Hope talk about the ever-changing world of data, with conversations moving from data warehouse to data lake, and data mesh to data fabric. They discuss the importance of data ownership and common tooling, and their view that data mesh is an approach rather than an architecture.

Colin Eberhardt, Oliver Cronk, Andrew Carr, David Hope · 18^th Apr 2024 · 1 min read

Data Engineering

Cloud Business Intelligence: A Comparative Analysis of Power BI, QuickSight, and Tableau

A comparative analysis of three leading Business Intelligence Tools, Microsoft Power BI, Amazon Quicksight and Tableau. We focus on cloud platform usage, and are interested in functionality and ease of use by novice BI users.

Mike Morgan, Steve Conway · 26^th Mar 2024 · 11 min read

Data Engineering

A quick tour of data distribution technologies

This blog discusses the different ways we might choose to distribute data between services including queues and distributed log technologies and their relative strengths and weaknesses

David Hope · 14^th Nov 2023 · 12 min read

Data Engineering

Understand your data requirements

This blog discusses the different data requirements that exist in a typical organisation and provides some suggestions over how to classify them and match them to technologies

David Hope · 7^th Nov 2023 · 11 min read

Tech

How to Make Your Own Search Engine: Semantic Search With LLM Embeddings

Understand how Google and other search engines use LLMs to gain insights into the semantic meaning of the language in search queries using embedding and cosine similarity.

William Booth-Clibborn · 11^th Aug 2023 · 11 min read

Data Engineering

Why rapid collaboration needs careful preparation

The pandemic response required a remarkable level of collaboration between and beyond government departments. In this blog post, I’m going to look at the Clinically Extremely Vulnerable People Service, outlining the different areas of collaboration upon which the service depended, and reflecting on the lessons that government can take forward to achieve its vision of a responsible, efficient and effective data ecosystem.

Jessica McEvoy · 24^th Jan 2023 · 5 min read

Data Engineering

How data literacy gives leaders the edge

Data-literate leadership underpinned the most successful pandemic-response programmes. In this post, I explore what data-literate leadership looks like, drawing on examples from the roundtables on data sharing in government that the Institute for Government ran in partnership with Scott Logic.

Jessica McEvoy · 12^th Jan 2023 · 6 min read

Data Engineering

Rules help you go faster

A few years ago while working on a digital product in a government department, my team learnt a valuable lesson: rules can help you go faster. In this post, I explain the positive difference that regulatory and legislative frameworks can make to the design and delivery of digital services, with some examples from the government's response to the pandemic.

Jessica McEvoy · 30^th Nov 2022 · 3 min read

Data Engineering

Why you should get the right people in the room from the start

Over the summer, in partnership with Scott Logic, the Institute for Government (IfG) ran a series of roundtable discussions with senior civil servants and government experts on the topic of Data Sharing in Government. This is the first in a series of blog posts in which I'll share some reflections on key themes that arose.

Jessica McEvoy · 18^th Nov 2022 · 4 min read

Data Engineering

How data has improved the amateur runner

As a keen amateur runner that somehow found themselves with a qualifying time to stand on the start line of the Great North Run with the Elite women, I take a look at the main ways data has aided my journey to that start line.

Molly Pace · 12^th Sep 2022 · 3 min read

Data Engineering

Battling the 5S's at the Data+AI summit

Highlights from one day training course on performance tuning with Apache Spark. Delving into the five most common reasons for poor performance.

Zinat Wali · 4^th Jul 2022 · 7 min read

Data Engineering

Enabling the Government Data Strategy

The UK Government has an ambitious data strategy that aims to encourage and facilitate data sharing between departments and businesses. Elements of the strategy appear relatively straightforward, but how will the government fully realise the potential, and align citizens with this bold new approach?

Andrew Carr · 28^th Oct 2021 · 4 min read

Data Engineering

What actually is a Data Mesh? And is it really a thing?

Organisations across the globe have been on a journey to find the optimal approach for managing and leveraging analytics data. In this post, I’ll set out each of the key milestones on the journey, to arrive at the latest milestone – the Data Mesh paradigm – and ask whether it is really a thing.

Andrew Carr · 28^th May 2021 · 7 min read

Data Engineering

Big Data and the Testing Challenge

This blog is about tools that help address the challenge of testing systems which handle large data volumes. We’ll see why creating a large, realistic and valid test data set is hard, how test data generators can help, and compare some of those available.

Andy Hickman · 2^nd Jul 2020 · 7 min read

Data Engineering

Elasticsearch - clustering on AWS with optional auto-scaling

Create your own Elasticsearch cluster in cloud in next to no time. Leverage ElasticHQ and CloudWatch logging to gain transparency. Excerpts from a client project.

Zinat Wali · 19^th Jul 2019 · 9 min read

Data Engineering

The £11k gas bill, customer satisfaction and improved interactions

What started as one faulty gas reading in the summer of 2017, ended up as a series of wasted calls where my bill kept getting higher and higher until it reached £11k. How could this have been handled faster and left me without considering moving energy provider.

Andrew Carr · 17^th Jul 2018 · 3 min read

Data Engineering

Comparing Apache Spark, Storm, Flink and Samza stream processing engines - Part 1

Distributed stream processing engines have been on the rise in the last few years, first Hadoop became popular as a batch processing engine, then focus shifted towards stream processing engines. Stream processing engines can make the job of processing data that comes in via a stream easier than ever before.

Andrew Carr, Andy Aspell-Clark · 6^th Jul 2018 · 18 min read

Data Engineering

Comparing Apache Kafka, Amazon Kinesis, Microsoft Event Hubs and Google Pub/Sub

Distributed log technologies have matured in the last few years. In this article, I review the attributes of distributed log technologies, compare four of the most popular and suggest how to choose the one that's right for you.

Andrew Carr · 17^th Apr 2018 · 9 min read

Data Engineering

Apache Spark - Performance

Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. But how achievable are these speeds and what can you do to avoid memory errors? In this blog I will use a real example to introduce two mechanisms of data movement within Spark and demonstrate how they form the cornerstone of performance.

Mathew de Beneducci · 22^nd Mar 2018 · 5 min read

Data Engineering

Apache Spark - question everything

Spark is well known in Big Data for its incredible performance and expressive API. However, it just takes one small misstep to transform a massively parallel powerhouse into a pathetically poor performer. This post presents an example and the steps that can be taken to indentify the problem.

Matt Sinton-Hewitt · 14^th Mar 2018 · 5 min read

Data Engineering

Looking At R

In this quick look at the R language and tools I'll look briefly at the syntax of the language and have a go at creating a few charts with a data set.

Dave Ogle · 31^st Jan 2018 · 13 min read

Data Engineering

Using Kafka and Grafana to monitor meteorological conditions

Apache Kafka provides distributed log store used by increasing numbers of companies and often forming the heart of systems processing huge amounts of data. This post shows how to use it for storing meteorological data and displaying this in a graphical dashboard with Graphite and Grafana

Oliver Kenyon · 13^th Oct 2017 · 14 min read

Data Engineering

Cassandra - Achieving high availability while maintaining consistency

A discussion about Cassandra consistency levels and replication factor, which are frequently misunderstood. This post explains the Cassandra infrastructure and how its configuration can be tuned.

Zinat Wali · 6^th Oct 2017 · 8 min read

Data Engineering

Chess data mining with Apache Spark and Lichess

Lichess makes over 100GB of chess games from 2017 available on their website. This post shows how this data can be transformed with Apache Spark and analysed. Something for Data Engineers and Chess Enthusiasts alike!

Bartosz Jedrzejewski · 1^st Sep 2017 · 9 min read

Data Engineering

The Big Data technologies that saved BP $7bn

Yesterday the Financial Times boldly declared that BP saved $7bn since 2014 by investing in Big Data technologies. I spent a couple of hours researching Big Data technologies associated with BP members of staff to try and build up a picture of exactly which technologies they are using.

Andrew Carr · 18^th Jul 2017 · 3 min read

Data Engineering

Successful microservices architecture with the Twelve-Factor App

Using microservices in your architecture is a very popular choice. Unfortunately it is also challenging to get it right. With the help of Twelve-Factor methodology, I will tell you how to set yourself up for a success rather than a disappointment.

Bartosz Jedrzejewski · 17^th Jul 2017 · 7 min read

Data Engineering

Load testing Alteryx API with Gatling

A successful attempt of load testing Alteryx API with Gatling and a not-so-successful attempt with Apache JMeter

Zinat Wali · 22^nd Jun 2017 · 6 min read

Data Engineering

Cassandra vs. MariaDB, Scaling

In this post we compare how Cassandra and MariaDB can be configured to operate in clusters and how this affects response time for queries. We found Cassandra to scale well and to be highly configurable. MariaDB can be used with Galera Cluster but it does not provide horizontal scaling. Also NDB can be used to scale MySQL but it was not as configurable as Cassandra.

James White, Dominic Ketley · 20^th Mar 2017 · 9 min read

Data Engineering

Cassandra vs. MariaDB

We've been comparing Cassandra and MariaDB in single node setups, exploring the issues of each in terms of performance and ease of use from a development perspective. In this article we explore the issues at play in such a setup such as the differences in queries, speed of response and the features that seperate these two technologies.

Dave Ogle, Laurie Collingwood · 1^st Mar 2017 · 12 min read

Data Engineering

Keeping Secrets in Docker

Docker 1.13 introduces a simple way of providing secrets to containers

Ross Hendry · 1^st Mar 2017 · 3 min read

Data Engineering

StreamSets with Docker - an example HDFS integration

StreamSets Data Collector (SDC) is an open source tool for stream-based extracting, transforming and loading large quantities of data. It provides an easy to use UI on top of the underlying processing power of YARN and Spark Streaming with a large number of installable integrations with source and destination systems.

Dominic Ketley, James White · 27^th Feb 2017 · 3 min read

Data Engineering

The Rise of Big Data Streaming

With the advent of the Internet of Things, the world of Big Data couldn't be more relevant. This post gives an overview of technologies that achieve processing at scale and in real time.

Daniel Cook · 7^th Feb 2017 · 5 min read

Data Engineering

How are businesses using Big Data?

Big Data can help businesses run more efficiently. Their main challenge is getting the best value from the data they have to turn it into actionable information

Tamara Chehayeb Makarem · 1^st Dec 2016 · 4 min read

Data Engineering

Spring Boot and MongoDB - a perfect match!

The popularity of Spring Boot in the Java world is undeniable. In this post I will show you yet another reason for this. Using Spring Boot makes working with MongoDB an absolute pleasure.

Bartosz Jedrzejewski · 22^nd Nov 2016 · 5 min read

Data Engineering

Using Rally to benchmark Elasticsearch queries

In this post I describe how to use Elastic's Rally to generate benchmarks for your private Elasticsearch queries and clusters. I'll be creating a benchmark which allows comparison of an unscored query with one where scoring is enabled.

Darren Smith · 22^nd Nov 2016 · 9 min read

Data Engineering

Docker 1.12 swarm mode - round robin inside and out

This post demonstrates how Docker 1.12 swarm mode round robins the containers in a service both for incoming connections (ingress) and DNS within the swarm.

Chris Smith · 30^th Aug 2016 · 6 min read

Data Engineering

Service discovery with Docker Swarm

For the last few months we've been working on a very DevOps focused project. As such we've used AWS, infrastructure as code, Docker and microservices. The different microservices were initially running all on one box, each with a different port. This solution wasn't scalable or very practical. We couldn't have all our services on one machine and it was getting tiresome and error prone having to remember/lookup which port each service was on. We needed our services to run on separate machines, and we needed a way to communicate with them without having to hard-code IP addresses or port numbers. What we needed was service discovery. As we had already been using Docker for each service, Docker Swarm was a natural candidate.

David Wybourn · 17^th Jun 2016 · 5 min read

Data Engineering

Bitcoin payments and the Lightning Network

This is the second blog post orientated around Bitcoin and its inner workings. The first post took the blockchain and broke down the algorithms which create the fundamental structure of any cryptocurrency. The post was separated into two sections; the first focusing on the block header and the second focusing on the construction of a transaction. If you are not comfortable with how the blockchain works, I suggest you read the first blog post before continuing.

James Hill · 16^th Jun 2016 · 13 min read

Data Engineering

Code reuse in microservices architecture - with Spring Boot

In most microservice architectures, there are many opportunities and temptations for sharing code. In this post I will give advice based on my experience on when it should be avoided and when code reuse is acceptable. The points will be illustrated with the help of an example Spring Boot project.

Bartosz Jedrzejewski · 13^th Jun 2016 · 10 min read

Data Engineering

Writing a Docker Volume Plugin for S3

An experiment in writing a volume plugin for Docker

Ross Hendry · 30^th May 2016 · 3 min read

Data Engineering

Log-driven big data: The ELK stack

An insight into the ELK stack and how we used it on a big data project

Alexander Cheshire · 26^th May 2016 · 4 min read

Data Engineering

Playing with Docker Compose and Erlang

This post uses Docker Compose to spin up a three container HTTP server. One container services the HTTP requests and delegates work to two other containers in a load-balanced way. Erlang is used for development to add a bit of extra challenge!

Chris Smith · 25^th Jan 2016 · 10 min read

Data Engineering

Why Apache Spark is getting so much momentum behind it

Apache Spark has quickly become the largest open source project in big data, but why has it suddenly got so much momentum behind it?

Andrew Carr · 24^th Jan 2016 · 1 min read

Data Engineering

Introduction to Hadoop and MapReduce

Big data is one of those buzz phrases that gets thrown round a lot, companies love saying they work with ‘Big’ data, but what is ‘Big’ data?

David Wybourn · 13^th Jan 2016 · 9 min read

Data Engineering

About

Welcome to my blog, Andrew's thoughts on Big Data. This page gives a little background on myself and the blog.

Andrew Carr · 1^st Jan 2016 · 1 min read

Data Engineering

Creating a High Performance Stock Ticker Using Haskell

This post demonstrates how to create an efficient stock ticker app using HTML5 WebSockets and a Haskell server.

Ian Sullivan · 15^th Nov 2015 · 7 min read

Data Engineering

Sharded Clusters in MongoDB - The Key Considerations

Sharded clusters enable the data persistence layer in MongoDB to be shared across several machines. In this post, we will look at the key considerations you should make before you use sharded clusters.

Matthew Dunsdon · 8^th Aug 2014 · 4 min read

Data Engineering

Real-time data analysis using Spark

Big Data is a hot topic these days, and one aspect of that problem space is processing streams of high velocity data in near-real time. Here we're going to look at using Big Data-style techniques in Scala on a stream of data from a WebSocket.

James Phillpotts · 29^th Jul 2013 · 12 min read

Data Engineering

Synchronise Heterogeneous Data Sources

With non-relational database implementations (key-store, graph, etc.) entering the mainstream, the necessity has arisen to synchronise relational databases to their non-relational cousins.

Nicholas Hemley · 20^th Mar 2012 · 9 min read

Scott Logic / Altogether Smarter

Blog

Data Engineering

Are we ready to put AI in the hands of business users?

Beyond the Hype: Are Data Mesh and Data Fabric just Marchitecture?

Authors

Blog

Data Engineering

Are we ready to put AI in the hands of business users?

Beyond the Hype: Are Data Mesh and Data Fabric just Marchitecture?

Categories

Authors