Optimization and When To Do It

Optimization is enticing as an engineer may take some superior knowledge about the langauge you’re writing in, but don’t start optimizing unless you really need to.

In the early days of Hadoop, it was common for developers to store data entries as a single line in a file. Every map-reduce job that needed the data would parse the file every time it came time to glean information. Each job required converting ASCII data into usable numbers and words before filtering and aggregating.

Fast forwarding a decade, streaming data pipelines are becoming crucial in tech stacks and Hadoop now has several efficient means of storing and querying data. If the data sources cannot be updated, then it’s quite common for the conversion process to happen as soon as possible in the data pipeline using a microservice. The rest of the pipeline can then use the data as it comes with an understandable format, and data can be pushed in its final form to the data lake.

Envision, if you will, you’re working on this new pipeline with your team of developers in your Java-stacked world:

* * *

Inspecting raw entries, it becomes evident that an entry is composed of a list of key-value pairs that are combined using a pipe character, and each pair is separated by a tab.

The first step to converting this line to the final format is to parse each entry into a map of keys and values. The code review for parsing comes into your notification channel:

private static void parsePairAndAddToMap(final Map<string, string=""> parsedMap, final String pair){
  final String[] keyAndValue = pair.split("\\|");
  parsedMap.put(keyAndValue[0], keyAndValue[1]);
}

public static Map<string, string=""> splitStringToMap(final String string, final int expectedSize) {
  final Map<string, string=""> parsedMap = new HashMap<>(2*expectedSize);
  for(final String pair : string.trim().split("\\t")){
    parsePairAndAddToMap(parsedMap, pair);
  }
  return parsedMap;
}

</string,></string,></string,>
The tests pass and the data can now be converted further into JSON, Avro, ProtoBuff, POJOs, or any number of data formats. The solution is merged and deployed with 1% of the data traffic flowing through the service to set up the new pipeline.

All systems are nominal. You sit back, smile, and high-five your coworkers on a job well done.

Now the ramping phase begins and this microservice is going to move from converting just a 3 million entries an hour to a 300 million. Almost instantly, the graphs monitoring the pipeline are showing there’s huge backpressure caused by this microservice. Conversion is simply taking too long.

In a panic, everyone asks you what’s the next step? Faster machines? Multi-thread it somehow? Change JVM settings? Deploy more instances? Convert to Go?

The service already works, scaling will work for now but having too many instances deployed could cost too much for the budget. Now is the time to make the service fast.

Java is interesting in what the compiler will optimize for you and what it will not. Come to find the “split” function will compile a new regular expression each time it’s called at runtime, not at compile time. If the program’s logic only needed it a few hundred times, the overall performance wouldn’t be impacted to a discernible degree. But in this instance, where there’s an explosive amount of data and surely more to come in the future, it’d be far better if the regex could be created once and used over and over again, especially since the incoming data is unlikely to change.

What’d be even better is if the algorithm didn’t have to call the second regex but instead search for the index of the pipe delimiter for each key-value pair to create substrings:

With just a few keystrokes, you change the conversion to this:

private static final String MESSAGE_SPLIT_REGEX = "\\t";
private static final Pattern MESSAGE_PATTERN_FOR_SPLIT = Pattern.compile(MESSAGE_SPLIT_REGEX);

private static void parsePairAndAddToMap(final Map<string, string=""> parsedMap, final String pair) {
  final int pipeIndex = pair.indexOf('|');
  final String key = pair.substring(0, pipeIndex);
  final String value = pair.substring(pipeIndex + 1);
  parsedMap.put(key, value);
}

public static Map<string, string=""> splitStringToMap(final String string, final int expectedSize) {
  final Map<string, string=""> parsedMap = new HashMap<>(2*expectedSize);
  for(final String pair : MESSAGE_PATTERN_FOR_SPLIT.split(string.trim())) {
    parsePairAndAddToMap(parsedMap, pair);
  }
  return parsedMap;
}

</string,></string,></string,>
After running benchmarks between the two versions, you visualize the differences in performance for everyone to see:

chart (5)

chart (4)

Num Pairs Split Time (ns) Regex Time (ns) Speed-up
10,000,000 3.731944495 0.645141412 5.784692202
1,000,000 0.089111163 0.019189501 4.643745713
100,000 0.003726146 0.00163975 2.272386644
10,000 0.000392534 0.000158185 2.481486867
1,000 0.000064426 0.000024319 2.649204326
100 0.000038367 0.000012751 3.008940475
10 0.000032541 0.000011158 2.916382864
1 0.000026876 0.000011948 2.249414128

DANG! Now that’s a great improvement. With what is likely the worst-case scenario, an entry having over 10 million pairs, the parsing time fell from 3.7ns to 0.64ns; a 5.7x speed-up! Even at having a more-likely amount of pairs at 10, there’s still a 3x speedup.

Your team quickly merges and deploys the update. The pipeline starts catching up, and eventually fewer instances will need to be deployed for the current load.

* * *

In that scenario, I’d say you saved the day. Well done!

I had a similar experience to this recently. The difference between the two versions doesn’t seem obvious, and the initial approach is completely understandable for any developer especially if they’re switching over from a high-level language like Python. The algorithm could have nearly been copied from one language to another, with the obvious exception of syntax.

With a word of caution, I would not recommend starting to optimize code in this manner or this deep into a service until after an initial deployment and test phases. The most important part of developing is creating something.

It’s easy to start optimizing, and some people feel compelled to do it. But requirements frequently change and priorities shift. The time it takes to optimize could have been better spent making a new feature, fixing a critical bug, or even worse, the code you optimized may soon become obsolete and all that time is wasted.

I find it best to get something working first and to make another low-priority ticket for things I know could later use optimization. When there are periods of downtime, that’s when a few tickets can be hammered out. Better yet, if a scaling problem does arise, there are already tickets to describe how to hasten the service’s performance.

With the cloud and the ability to scale on demand, implementing features and fixing bugs are usually more important than optimizing every little line of code. Plus, when it’s time to find cost-saving measures, your team can hit a few tickets before the need of profiling and save the project and the company money. Now you’re twice the hero! Everything is developed and running smoothly, and you’re under budget. Time for happy hour!

So, my general advice, and a small twist on a commonly heard phrase:

Make it work, scale if needed, and then make it fast.

See the code on GitHub.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s