by: Kosta Zivic
According to multiple studies, 2.5 Exabytes of data are produced on a daily basis, and that number is growing exponentially. A large number of various systems are producing an avalanche of data which is of use to nobody unless it’s effectively processed and analyzed. With data, it’s all about three things:
In order to achieve all three things, people resort to scripting, hoping to find a quick but accurate and effective solution. And most of the time, it works! Scripting languages today are agile enough to handle a variety of problems, and yet robust enough to ensure long-term stability.
Let’s assume I’ve made you believe scripting is the way to go, and that you have tons of data lying around. It’s time to start digging with a powerful tool, but the question is – which one?
Content and Context matter!
The first question you should ask yourself is “What am I going to use it for?”
A common pitfall for young data engineers is that they’ve heard some nice stuff about a language, and resorted to it right away. A scripting language is like a toolbox – it comes with a set of powerful tools to get the job done, but it’s meaningful only if you choose the right tools for the right job.
This depends heavily on the data you have. Different types of data require different approaches, and in most cases putting your data in context drives you toward an elegant solution. It has to have a meaning to begin with. Additionally, what’s actually between those commas in a much-loathed .CSV file should give you an idea how to make the most of it. Remember – content and context matter!
The Ugly Duckling paradox
No data processing system is based on one language alone. It’s always multiple languages that get the job done, and more often than not a lot of different systems too. You must never allow your code to become an ugly duckling, unable to communicate with anyone else.
Choose a language that is easy combined with what you already have or use. This will save you a lot of time and significantly reduce the amount of errors that may occur. Any language that requires a special environment just for itself, a lot of maintenance in terms of library support, or is needy in any other way shouldn’t be your first choice. Of course, this is often justified by content and context of the data, but you should always go for a golden middle. Constantly switching environments, data formats, and libraries used, will inevitably lead you to an ugly duckling piece of code.
Notes on the fridge
Your language of choice should say more with less. Keep it short and simple. At the same time, do more! And most important of all, feel comfortable writing it. You wouldn’t leave yourself a 40-page essay about grocery shopping on the fridge door. The agility that scripting languages provide often requires you, or worse, someone who didn’t write that particular code, to read it over and over again. Make sure that only required info is available whenever you look at the code. Do not, and I repeat, DO NOT let yourself drown in unnecessary lines of gibberish!
With great power comes great reusability!
Dynamic = Beautiful.
Go for a language that allows you to use a single script on as much of the data you can, as many times as you can. Nothing should be hardcoded. Time is of the essence, and writing the same stuff over and over again leads nowhere. Simple modifications turn into bigger ones, the ways in which your code communicates with the rest of your system are changing, every change needs to be tested, approved, etc, etc. Don’t do that to yourself.
Smaller quantities of scripting are always the better solution. You’ll always be in control of your data.
I’m ready, let’s script!
After learning and accepting some principles of dirty data work, let’s dig into some languages!
Perl does an amazing job when working with massive text files, especially in the domain of regular expressions. It spends less time backtracking in lines that don’t match a pattern than any other language you may think of, and has a lot of built-in tools to do basically anything that crosses your mind. It’s designed to do exactly that – text manipulation. However, Perl doesn’t support a lot of things you may be used to when coming from another language (I.e enumerated types). It’s not the best language to choose if you wish to perform something more demanding on your data at the preparation stage.
Ruby is Perl’s little sister, most commonly found in web development. It has a similar syntax, similar possibilities and similar restrictions. However, Ruby tends to open itself to the world with its most popular framework Ruby on Rails, providing different data retrieval, publishing and packing options. It’s very popular among developers who use messaging brokers and streaming platforms such as RabbitMQ or Apache Kafka. Overall, a neat language to use if you have to stay on the web!
Python. Boy, where do I even start with this one. A beautiful language to write, very lightweight, and full of useful data weapons. Python can communicate with basically anything. From toasters and coffee-makers all the way to a Hadron collider (Frameworks such as Leptonica or Tesseract are a part of Hadron collider). It has a library for everything – text manipulation, connections, requests, formatting, columnar storage processing, and so on. NumPy, SciPy, Pandas and mathplotlib are just a few of commonly used frameworks for data manipulation and analysis. However, due to its strict indentation and a lack of strict data types, Python can be a dangerous thing for an inexperienced developer.
Lua may be a good place to start if you need good data description, secure parallelism, map/reduce functionality, or just want to keep it small and simple. It performs good both on a single machine and in a cluster. As well it has a very, very nice support for compressed data. Lua, however, can be difficult to maintain if you don’t have experience with it, and it often has needs for sorted data sets.
Shell scripting is still the most powerful way of piping your processing steps. Often, you have to take action dictated by the data you receive, but which doesn’t necessarily have anything to do with the data itself. Having the power to control the whole ecosystem with just a few characters can really make a difference in what you do. Most developers are familiar with it. It features a bunch of great tools such as grep (with all variations), awk, sed, cut, join, paste, and many others, all of which provides you with a world of opportunities. Still, use with caution, as shell scripts are very hard to maintain, not friendly with the dynamic parameterization of the code, and can be very dangerous for your environment if not used correctly.
R. Know the math behind your data. R is an extremely powerful language for advanced data crunching. If your needs include anything more than basic math, R is often a natural choice. With its data frames, columnar support, a large, coherent, integrated collection of intermediate tools for data analysis, R can do more than just move the data around – it can crunch it in the process. It’s widely used as a part of bigger, commercial reporting systems.
R is demanding in a lot of ways. It requires Rserve (a TCP/IP server) to allow other programs to use R in their work. It can be hard to maintain, it’s not very readable at times, and there’s a big chance that all the matrices will drive you crazy.
Many, many more…
The above mentioned languages are among the most popular languages in the data world. Of course, they’re not the only choice you have. There’s a bunch of scripting variations of non-scripting languages that may be more suitable for you. Every one of them has its pros and cons, and which one to use is up to you to decide based on your needs, knowledge, time and will.
So Long, and Thanks for All the Scripts!
Consider trying multiple languages to see which one best fits your data. Unleash your inner scripting beast with the language of your choice and take control of the whole system.