Blog

Data Engineering

Data Engineering
Distributed log technologies have matured in the last few years. In this article, I review the attributes of distributed log technologies, compare four of the most popular and suggest how to choose the one that's right for you.
Data Engineering
Apache Spark is the major talking point in Big Data pipelines, boasting performance 10-100x faster than comparable tools. But how achievable are these speeds and what can you do to avoid memory errors? In this blog I will use a real example to introduce two mechanisms of data movement within Spark and demonstrate how they form the cornerstone of performance.
Data Engineering
Spark is well known in Big Data for its incredible performance and expressive API. However, it just takes one small misstep to transform a massively parallel powerhouse into a pathetically poor performer. This post presents an example and the steps that can be taken to indentify the problem.
Data Engineering
In this quick look at the R language and tools I'll look briefly at the syntax of the language and have a go at creating a few charts with a data set.
Data Engineering
Apache Kafka provides distributed log store used by increasing numbers of companies and often forming the heart of systems processing huge amounts of data. This post shows how to use it for storing meteorological data and displaying this in a graphical dashboard with Graphite and Grafana
Data Engineering
A discussion about Cassandra consistency levels and replication factor, which are frequently misunderstood. This post explains the Cassandra infrastructure and how its configuration can be tuned.
Data Engineering
Lichess makes over 100GB of chess games from 2017 available on their website. This post shows how this data can be transformed with Apache Spark and analysed. Something for Data Engineers and Chess Enthusiasts alike!
Data Engineering
Yesterday the Financial Times boldly declared that BP saved $7bn since 2014 by investing in Big Data technologies. I spent a couple of hours researching Big Data technologies associated with BP members of staff to try and build up a picture of exactly which technologies they are using.
Data Engineering
Using microservices in your architecture is a very popular choice. Unfortunately it is also challenging to get it right. With the help of Twelve-Factor methodology, I will tell you how to set yourself up for a success rather than a disappointment.
Data Engineering
A successful attempt of load testing Alteryx API with Gatling and a not-so-successful attempt with Apache JMeter
Data Engineering
In this post we compare how Cassandra and MariaDB can be configured to operate in clusters and how this affects response time for queries. We found Cassandra to scale well and to be highly configurable. MariaDB can be used with Galera Cluster but it does not provide horizontal scaling. Also NDB can be used to scale MySQL but it was not as configurable as Cassandra.
Data Engineering
We've been comparing Cassandra and MariaDB in single node setups, exploring the issues of each in terms of performance and ease of use from a development perspective. In this article we explore the issues at play in such a setup such as the differences in queries, speed of response and the features that seperate these two technologies.
Data Engineering
Docker 1.13 introduces a simple way of providing secrets to containers
Data Engineering
StreamSets Data Collector (SDC) is an open source tool for stream-based extracting, transforming and loading large quantities of data. It provides an easy to use UI on top of the underlying processing power of YARN and Spark Streaming with a large number of installable integrations with source and destination systems....
Data Engineering
With the advent of the Internet of Things, the world of Big Data couldn't be more relevant. This post gives an overview of technologies that achieve processing at scale and in real time.
Data Engineering
Big Data can help businesses run more efficiently. Their main challenge is getting the best value from the data they have to turn it into actionable information
Data Engineering
The popularity of Spring Boot in the Java world is undeniable. In this post I will show you yet another reason for this. Using Spring Boot makes working with MongoDB an absolute pleasure.
Data Engineering
In this post I describe how to use Elastic's Rally to generate benchmarks for your private Elasticsearch queries and clusters. I'll be creating a benchmark which allows comparison of an unscored query with one where scoring is enabled.
Data Engineering
This post demonstrates how Docker 1.12 swarm mode round robins the containers in a service both for incoming connections (ingress) and DNS within the swarm.
Data Engineering
This post describes the Concourse build system and explains why declarative CI / CD is so compelling. No more pet build servers!
Data Engineering
For the last few months we’ve been working on a very DevOps focused project. As such we’ve used AWS, infrastructure as code, Docker and microservices. The different microservices were initially running all on one box, each with a different port. This solution wasn’t scalable or very practical. We couldn’t have...
Data Engineering
This is the second blog post orientated around Bitcoin and its inner workings. The first post took the blockchain and broke down the algorithms which create the fundamental structure of any cryptocurrency. The post was separated into two sections; the first focusing on the block header and the second focusing...
Data Engineering
In most microservice architectures, there are many opportunities and temptations for sharing code. In this post I will give advice based on my experience on when it should be avoided and when code reuse is acceptable. The points will be illustrated with the help of an example Spring Boot project.
Data Engineering
An experiment in writing a volume plugin for Docker
Data Engineering
An insight into the ELK stack and how we used it on a big data project
Data Engineering
This post uses Docker Compose to spin up a three container HTTP server. One container services the HTTP requests and delegates work to two other containers in a load-balanced way. Erlang is used for development to add a bit of extra challenge!
Data Engineering
Apache Spark has quickly become the largest open source project in big data, but why has it suddenly got so much momentum behind it?
Data Engineering
What is ‘Big’ Data? Big data is one of those buzz phrases that gets thrown round a lot, companies love saying they work with ‘Big’ data, but what is ‘Big’ data? When does data get so big that it can be called Big data? One Gigabyte? How about a Terabyte,...
Data Engineering
Welcome to my blog, Andrew's thoughts on Big Data. This page gives a little background on myself and the blog. My name is Andrew Carr and I have been coding and generally being interested in technology since the age of 12. I have been involved in professional Software Engineering for...
Data Engineering
This post demonstrates how to create an efficient stock ticker app using HTML5 WebSockets and a Haskell server.
Data Engineering
Sharded clusters enable the data persistence layer in MongoDB to be shared across several machines. In this post, we will look at the key considerations you should make before you use sharded clusters.
Data Engineering
Big Data is a hot topic these days, and one aspect of that problem space is processing streams of high velocity data in near-real time. Here we're going to look at using Big Data-style techniques in Scala on a stream of data from a WebSocket.
Data Engineering
With non-relational database implementations (key-store, graph, etc.) entering the mainstream, the necessity has arisen to synchronise relational databases to their non-relational cousins. Furthermore, a non-relational data source may be fitted retrospectively to an existing RDBMS deployment to leverage the benefits of a non-relational schema with only minor 'integration disturbance' to...