Web Scale Analytics Reading List

columnsBig data is taking the world by storm and with it comes an explosion of new ideas and technologies looking to help us understand what this data is telling us. With VLDB 2012 under way I decided to take another look at the literature to see what advances are out there as well as refresh on the classics. The result of this deep dive is the web scale analytics reading list below. The list is grouped at a high level into column oriented database solutions and online analytical processing (OLAP) solutions. Column oriented databases are by far more powerful but also more complicated to implement. As such, much of the work on column oriented databases is being done at companies that are building one as a product. The obvious exception being Google. OLAP systems on the other hand, while less powerful, are simpler to implement. For this reason we see a variety of companies rolling their own solutions in response to their growing analytics problems.

Column Oriented Databases

Online Analytical Processing

A Sequential I/O Reading List

Hard Drive SpindleOver the past year I’ve been collecting links on the growing trend of rooting out all random I/O in large scale distributed systems. It has always been apparent that rotational media suffered a random I/O penalty but as bandwidth improvements continue and latency improvements languish the difference between sequential and random I/O is becoming unbearable. SSDs have been hailed as a fix, and in large part they remove the pain of random reads, but at the cost of requiring sequential writes. And now, as it now turns out, even RAM benefits from sequential I/O due to the increasingly complex structure of new and faster chips. In an effort to help others who might be stumbling into this area here is a curated list of the best posts and papers I have found on the topic.

Rotational Media

Solid State Drives

Memory

Benchmarking More Seq Traversal Idioms in Scala

Last week I had the luxury of spending some quality time with YourKit and our production system at Localytics and was pleasantly surprised to see things humming right along. Most of our time was spent building collections and iterating over them which got me thinking. What is the most efficient way to traverse a collection in Scala? After a quick trip to google I had two blog posts in hand.

The first post was from way back in 2009. While its main focus was on the impact of JVM options on Scala iteration it pointed out that Vectors seem to be more performant than Lists for iteration. However, being from 2009, I was curious if this result still stood. The second post was from 2012 and benchmarked various ways of iterating over a List while applying two different transformations. The post didn’t benchmark Vector but the author was particularly rigorous with his methodology; making use of Caliper and posting his code on github. I highly recommend reviewing the post.

I forked the repo, added two more tests, ensured the server vm was being used and re-ran the tests.

Setup

Java

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-4ubuntu2)
OpenJDK Server VM (build 20.0-b12, mixed mode)

Scala

Scala code runner version 2.9.1 -- Copyright 2002-2011, LAMP/EPFL

OS

Ubuntu 12.04

CPU

Dual Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (EC2 c1.medium)

Code

Functional Vector

val wLength = wordsIndexedSeq.map( _.length )
val wCaps = wordsIndexedSeq.map( isCapitalized )
(wLength, wCaps)

Builder

val n = wordsList.length
val wLength = List.newBuilder[Int]
val wCaps = List.newBuilder[Boolean]
for( word <- wordsList ) {
  wLength += word.length
  wCaps += isCapitalized(word)
}
( wLength.result(), wCaps.result() )

Results

Each style of iteration was tested with four different sized collections. The graph shows how many times slower each style was versus the OldSchool style (i.e. arrays and while loops).

Results

Conclusions

  • Nothing beats arrays and while loops (i.e. the OldSchool solution)
  • Vector beats out List
  • Vector interestingly gets better with more items
  • In contrast to the original post none of the List optimizations seem like a clear win to me

Decoding strings with an unknown encoding

Unicode SnowmanWe’ve all been there, the system is humming along when you bring up the UI only to see “Espa?ol” staring you back in the face. What is this? Why is there a question mark in there? Well, you’ve been bitten by they mystical world of unicode. Usually this issue is an easy fix; always use UTF-8 encoding. However, what to do if you don’t own the whole code path? What if you have to support poorly written third party client libraries? What if you have client libraries in the wild that will never be updated? What if these client libraries out right lie about what encoding they are using? Follow me and we’ll find out.

Bytes To String

On the JVM there are two standard ways of converting bytes to Strings; new String(bytes[], encoding) and using the NIO Charset. Since they both accomplish the same feat my decision came down to performance. Luckily someone else did the heavy lifting and figured out that new String(bytes[], encoding) ekes out a small win over NIO Charset. However, this option poses a challenge. How do we know if the decoding succeeded? The NIO option can throw an Exception (effective but slow) or insert a Unicode Replacement Character (easy to find with a String.contains()) if it encounters a some bytes that cannot be decoded . new String(bytes[], encoding) does no such thing. It blindly decodes characters and will output random garbage in the resulting string if the decoding chokes on some bad byte values. We need a way to find those garbage characters.

Regex for non unicode characters

The clue that got me on the right track was an odd looking pattern in the javadoc for Pattern.

[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

It seemed that \p{L} was some magical regex incantation for any unicode letter and after some additional searching it appeared that that was exactly the case. Of course what we really want to find are characters that are not unicode letters, spaces, punctuation or digits. Lucky for us there are matching groups for all of these, leading us to this regex:

[^\p{L}\p{Space}\p{Punct}\p{Digit}]

Performance Testing

Since performance is of particular concern it was important to test the overhead of this regex check. I fired up the Scala REPL and tested the using new String(bytes[], encoding) with the above regex as compared to NIO Coded using String.contains() to check for the replacement character. After all that work it turns out that the regex was significantly more expensive than String.contains(). So much so that the NIO code was about 2x as fast. So, in the end, I ended up going with the simpler NIO option.

The Code

import io.Codec
import java.nio.ByteBuffer

val UTF8 = "UTF-8"
val ISO8859 = "ISO-8859-1"
val REPLACEMENT_CHAR = '\uFFFD'

def bytesToString(bytes: Array[Byte], encoding: String) = {
  val upper = encoding.toUpperCase
  val codec = if (ISO8859 == upper) Codec.ISO8859 else Codec.UTF8
  val decoded = codec.decode(ByteBuffer.wrap(bytes)).toString
  if (!decoded.contains(REPLACEMENT_CHAR)) {
    decoded
  } else {
    val otherCodec = if (ISO8859 == upper) Codec.UTF8 else Codec.ISO8859
    val otherDecoded = otherCodec.decode(ByteBuffer.wrap(bytes)).toString
    if (!otherDecoded.contains(REPLACEMENT_CHAR)) {
      otherDecoded
    } else {
      val utf8 = if (ISO8859 == upper) otherDecoded else decoded
      utf8.replace(REPLACEMENT_CHAR, '?')
    }
  }
}

Update

Thanks to Joni Salonen for pointing out that I was wrong about new String() not inserting the unicode replacement char. In light of that info the following code is just a touch faster.

val UTF8 = "UTF-8"
val ISO8859 = "ISO-8859-1"
val REPLACEMENT_CHAR = '\uFFFD'
def bytesToString(bytes: Array[Byte], encoding: String) = {
  val upper = encoding.toUpperCase
  val firstEncoding = if (ISO8859 == upper) ISO8859 else UTF8
  val firstDecoded = new String(bytes, firstEncoding)
  if (!firstDecoded.contains(REPLECEMENT_CHAR)) {
    firstDecoded
  } else {
    val secondEncoding = if (ISO8859 == upper) UTF8 else ISO8859
    val secondDecoded = new String(bytes, secondEncoding)
    if (!secondDecoded.contains(REPLECEMENT_CHAR)) {
      secondDecoded
    } else {
      val utf8 = if (ISO8859 == upper) secondDecoded else firstDecoded
      utf8.replace(REPLECEMENT_CHAR, '?')
    }
  }
}

Java Heaps and Garbage Collection with some Zazz

Here is a presentation I put together about Java heaps and garbage collection to go into some more detail than just raising the heap size. I put in links to some great websites and blog posts that are fantastic reads if you are trying to tune your app’s garbage collector.

The best part though is the “wicked awesome” presentation tool Prezi. It is an infinitely zoomable moving twisting camera landscape thingy . . . its kinda hard to describe. But it makes PowerPoint look like two cups with a string between them. Just check it out.

You can view it fullscreen and Prezi lets you copy presentations and modify them. Here is the direct link to the presentation above if you want to copy it. You can download a self contained offline viewer of a presentation you own (or one you’ve copied).

Software Development is not Just Coding

CodingThroughout my career as a startup software developer I have constantly come across fellow developers who seem to have a confused concept of our common profession. Specifically, they believe coding to be the single activity which comprises software development. Now its not hard to see how this misconception comes about. Too frequently developers are interrupted by frivolous meetings causing a backlash against anything that hints at time away from the desk. Moreover, at the end of the day, the top priority is getting features out the door and any time spent not coding smells of time lost. However, this narrow view of software development is holding us back and cramping our productivity. Specifically, there are three realms in which we need to spend more time up front so we can run faster over all.

Quality assurance and the rise of unit tests

I have yet to meet a startup that had enough resources for a QA team and rightly so. The rise of unit test and continuous integration tools throws into doubt the benefit of a dedicated QA team in all but the most extreme cases. However, I still see far too many developers who pay lip service to their value and then, in the name of speed, proceed to code without any. What they are missing is the insane productivity gains to be found in having a suit of tests running against every checkin. Being able to refactor and hack away without needing weeks of subsequent manual testing or months of sleepless nights fixing live bugs is something that is missed by many, particularly in the startup world, and is a huge drain on productivity.

Ops becomes DevOps

More recently we are beginning to see the influence of tools effecting the operations side as well. These days, through the power of Amazon Web Services, Rackspace, and others, developers no longer need to rely on in house operations for their hardware needs. However, operations provided more services than simply assembling servers and connecting them to a network. Developers are now in the position where they must take on the responsibility of deploying and monitoring their software. Fortunately, hardware is something no startup can run without so, unlike QA, understanding the necessity of AWS or Rackspace is a no brainer. Unfortunately, understanding the need to spend time setting up monitoring and automating deployment is still a work in progress. Too often clients and customers are the first ones to discover a service outage and deploying more servers takes days and not hours. The seemingly never ending march of minor crises which is operations can destroy productivity if some judicious work and planning is not done upfront.

What are we doing and we will we get there?

Finally we come to the most important and also the most contentious issue, planning. Many developers approach any sort of planning with apprehension if not outright disgust. However, these are the same developers that end up taking days if not weeks longer to launch a product or wrap up a milestone because there was not enough coordination and thought ahead of time to plan the multitude of steps that goes into such an event. At a higher level I have seen whole companies spin while different developers over engineer and over refine products based on vague requirements causing a jump in coding “productivity” due to less meetings and yet a serious drop in output as nothing is delivered and what is delivered is inevitably not what anyone wanted. If we want to step up our game and get things done we need to stop shying away from all process and being adopting the right processes. There are any number of options (SCRUM, Lean, Kanban, etc.) to learn and borrow from. Make your own process. Start with the simplest thing that works. Add (or remove!) when something isn’t working. Whatever it might be we need to move beyond the “all process is evil” mentality. Yes there will be meetings but you will shock yourself at how much faster you are in the end.

Twitter > RSS: How to keep up with developers

Times are a Changing

I remember walking into a team meeting my first day on the job and overhearing my fellow blogger explaining the RSS feed setup that he used to keep up with developer news. At the time, this was a new idea to a majority of the yet here we are, a short four years later, and the landscape is changing again. Today, if you aren’t on Twitter you are missing the conversation. While RSS remains an important tool in the struggle to remain on top of our discipline, the latest news, conversation, and ideas are happening on Twitter.

Open Source

We all know that open source and documentation have a sordid affair and this expands just as readily to blogs. If you are luck there is a business selling support for the project and they have a blog but there is hardly ever a blog for any apache project for instance. In stark contrast, many comitters are on twitter and talking about what they love, namely their project. Take Apache ActiveMQ for instance, a community that I’m fairly familiar with. There is no blog.  If you do some digging you can find a few developer’s blogs, but these long form posts are few and far between. However, if you join in the conversation over on Twitter you are privy to up to the minute information about decisions, up coming features, latest tips and tricks, interesting news, and not to forget a willing group of guys to answer questions and help you along in using their product.

Developer Trends

There are two trends that I’m following closely in the development world, namely NoSQL and DevOps. Both of these topics are changing and evolving on a daily basis and the only way to follow the conversation is via Twitter. Blog posts are often the topic of conversation but to really be part of the conversation you have to join in where the people are. The DevOps crowd is something I’m only starting to get into but @mmarschall and @danackerson were nice enough to put together a great list of the who’s who. In the NoSql camp Mashable has a good list to get you started. However, thats not the end of it. There are people talking about scala, pontificating on craftsmanship, and more.

Go Get ‘em

The only way to keep up with the conversation is to join in so sign up or start following, share what you are reading, respond to others ideas, and while you’re at it, check me out @bdarfler.

Follow

Get every new post delivered to your Inbox.