Deploy Inc.

Tips and tricks for fast handling of vast amounts of data

If your line of work includes working with data, chances are you already know how cruel the world can be when it comes to delivering it. If people could send you sticky notes with their phone numbers, email addresses, and favorite colors, they would. And if they could watch you trying to extract anything valuable from their scribbled mess, the pleasure of torturing you would bring them utter happiness.

Ok, that was a bit too much, but the fact is – we data guys don’t have an easy time of it. It’s highly unlikely that the data you get will be structured, organized and, in some not-so-rare cases, even have a meaning to begin with! It’s up to you to give your data just that: meaning. Not only that, but also to give it a higher purpose – to make it a decent place to start your analysis with. Today, everything is about delivering fast insight. It won’t be easy but you have to be confident you could pull it off: just don’t get up to drink that fifth coffee today and work through it.

Sounds reasonable, unless...there’s gigabytes of it.

HOUSTON, WE HAVE A STRAY CHARACTER!

The majority of data is sent/retrieved in the form of plain text. The Terminal is the only thing that isn’t concerned with the size of that CSV file. Plus, the data often needs to be specially treated. The Terminal is your friend. And there should be a serious disclaimer here: I’m going to be totally subjective and use bash. For me, it’s the most comfortable place to start, but feel free to consider an alternative if it makes you feel better. The dataset I’m going to use throughout this post is Sales Transactions dataset, which can be obtained here.

Let’s take a peek into some data now.

$ head -10 textfileofdoom.csv

Head is the command I always use first on a dataset. It gives me a good idea of what my data should look like. Of course, not every file has to be consistent, but most of them are.

Head defaults to 10 lines, and it will always output the first 10 lines of your file. The same goes for tail, which will output the last 10 lines of your file.

Now you may ask yourself, how many lines are there?

$ wc -l textfileofdoom.csv
And we have our answer.

Maybe not all of those lines contain the intel you’re searching for, so we could filter out some lines of interest.

Let’s say I’m most interested in transactions made by a Visa credit card. Filtering those shouldn’t be a problem:

$ grep “Visa” textfileofdoom.csv

This leaves me with only those lines which have a string “Visa” in them. Of course, this can be misleading, and I may want to filter only those lines which contain only “Visa” in a column, so I have to resort to RegEx to do so:

$ grep ,"Visa", textfileofdoom.csv

The search is now limited only to those lines which have the ,”Visa”, string in them. This should help get rid of descriptions and misleading columns. But what if I want to search only a certain column? AWK is an awesome and very powerful tool to do so, among other things. I can specify a delimiter, a column I want to inspect, and the pattern I’m searching for.

$ awk -F, '$4==""Visa""' textfileofdoom.csv

Now that we have some decent row filtering, I may be interested in just a couple of columns. I could extract one to begin with using cut:

$ cut -d"," -f 7 textfileofdoom.csv

...which gives me the desired column. I could even extract multiple columns:

$ cut -d"," -f 2,7 textfileofdoom.csv

...and maybe reorder them a bit using bash piping to propagate the output of the first command to the second one:

$ cut -d"," -f 2,7 textfileofdoom.csv | awk -F, '{print $2","$1}'

Looks like we’re getting somewhere! If we assume that the timestamp can act as a primary key of the dataset, I could now go and sort the file on it. But I’ve rearranged the file so that it’s no longer the first column, and standard sorting couldn’t get me where I needed to be (I’ll ignore the fact that it would sort taking the whole line as a value).

$ cut -d"," -f 2,7 textfileofdoom.csv | awk -F, '{print $2","$1}' | sort -t, -nk2

Now I have my data sorted on the second column. Sort will rearrange the columns, but it’s something we’re already familiar with and can easily modify. Let’s say we want to grab the lines from the original file that match our criteria now. This can be obtained by using standard bash looping and some commands we’re already comfortable with:

$ cut -d"," -f 1,7 textfileofdoom.csv | sort -t, -nk2 | awk -F"," '{print $1}' | while read line; do grep -m1 "$line" textfileofdoom.csv ; done

We now have our dataset sorted the way we like, filtered the way we like. But what if we noticed some bad data in the process and want to fix that before we continue? Sed comes to the rescue!

$ cut -d"," -f 1,7 textfileofdoom.csv | sort -t, -nk2 | awk -F"," '{print $1}' | while read line; do grep -m1 "$line" textfileofdoom.csv ; done | sed -e 's/Product([0-9])/Product \1/g'

I’ve pretty-printed the second column and instead of “ProductN” got “Product N”. The same principle, but with more complex RegEx, could be used to modify pretty much anything.

Maybe I’m now satisfied with what I have and want to store it somewhere. Let’s use output streaming!

$ cut -d"," -f 1,7 textfileofdoom.csv | sort -t, -nk2 | awk -F"," '{print $1}' | while read line; do grep -m1 "$line" textfileofdoom.csv ; done | sed -e 's/Product([0-9])/Product \1/g' > prettyoutputfile.csv

A simple head on my new file now gives me something easier to work with: Tips and tricks for fast handling of vast amounts of data 1

I have to admit, I’ve done a little bit of cheating here. Turns out that the timestamp is not a primary key, and the grep I used to pull them out gave me mixed results. So I decided to just remove those “Mastercard” lines with some more grep-ing!

$ grep -v Mastercard

Now I have a dataset I’m happy with, which is ready to be processed by other parts of my system. Thanks to a one-liner.

BUT WHAT IF MY FILE IS NOT A CSV FILE?

This often (but not necessarily) calls for a higher programming language. It could still be done using bash, but you have to keep one thing in mind: the more complex expressions you use, the more error prone your work will be.

A lot of programming languages offer excellent libraries to work with the data in various forms, so why waste time on mind-challenging bash scripts here when we can do something faster.

My preferred language for these actions is Python. It’s easy to write, elegant but powerful, and most importantly – fast. But feel free to choose a language you’re most comfortable with.

Let’s explore some JSON data using Python! In this example, I’ll use some free JSON sample data (hey, I love free stuff, give me a break!), publicly available here. Go give it a look!

Tips and tricks for fast handling of vast amounts of data 2

Yes, it’s an image. Don’t be lazy: type it out!

Of course, you can do waaay more than this short example showed you. Using some simple looping, you could process that bulk of data in no time!

There’s a bunch of libraries designed to help you with your data in various formats. XML libraries, Avro and Parquet, even PDFs and image files. With some research, it should be easy to find the right library for the right job and get that data up and running in no time!

WHY?

With data, you need to be perfect. Mistakes are not allowed, and when they do happen – it takes a lot to fix them. Due to the ever-growing impatience of your superiors, you also need to be quick. Or smart.

You need to buy yourself time to work on your data (or drink that fifth coffee). Don’t be afraid to use simple tools designed to make your life easier and to cut corners that replace hours of manual work with just a couple of characters.

Move your data back and forth, tumble it around, chew it and spit it out clean, organized and meaningful. Simple operations, like the ones I’ve used above, are lighter on your system than any external tool you might want to use.

Don’t panic when the data gets big: organize it so that you can easily interrogate it and the world should be a happy place once again!

All the best in your future endeavors, fellow miner!

Back to blog