Rotating Videos and Preserving Metadata

So this is a pet peeve that’s been biting me for a sometime.

Often when you take a picture with you phone or digital camera, the camera rotation sensor (gyroscope?) gets it wrong and ends  up taking a picture that is sideways (ie, rotated +/- 90 degrees).
Surely enough you can use your phone’s photo library app to rotate it and job done, sanity restored.

However, the really really annoying thing is when you camera gets the orientation of a video wrong. Rotating a video is just not a functionality that the photo library offers off the shelf. That’s because rotating a video involves actually re-enconding it from the scratch. Some nice video players (eg, VLC) allow you to rotate while playing the video, but this is fiddly and hard to remember to do every time you play the video.

So in any case, here are a few bash lines that I cobbled together to rotate a batch of videos trying to preserve as much of the metadata information as possible.

for i in $( find *.mov ); do
  IN=$i OUT=${IN}.m4v && \
  ffmpeg -i ${IN} -c:a copy -map_metadata 0:s:0  -vf "transpose=1"  -c:v libx264 -crf 23 -preset medium ${OUT} && \
  exiftool -tagsfromfile ${IN} -makernotes -make -model ${OUT} && \
  exiftool ${OUT} && \
  rm *_original

NOTE: To run these you will need to have ffmpeg and exiftool installed in your system.


Bash Shell $PS1 Configuration

After a recent talk with Jim Meyering I’ve decided to finally organise a bit my .bashrc and all my .dot_files in general.

So first and foremost important change, track all .dot_files in some for of version control system. I’m right now using Mercurial powered by BitBucket. Git is also a great choice. Go with whatever you are comfortable with, just make sure you don’t lose your precious configs and that you can easily synchronise all you unix/linux/bsd boxes effortlessly.

The other big take for this talk was to make sure your $PS1 shell prompt gives you the right information. Two key things that are absolute gold to have:

  1. The branch/bookmark you are currently in if you are in a VCS directory.
  2. The exit code of the previous command if it returned an error (different from 0).

Here’s how my current $PS1 looks like:

.bashrc $PS1

And this is my current .bashrc $PS1 configuration:

# Decorate $PS1
function __get_vcs {
local path=`pwd`
while true; do
if [[ -d "${path}/.hg" ]]; then
echo "mercurial"
elif [[ -d "${path}/.git" ]]; then
echo "git"
elif [[ "${path}" = "/" ]]; then
echo "none"
path=`cd ${path}/../ && pwd`

function __get_vcs_branch {
case "$vcs" in
out=`hg id -b`
out=" (hg:${out})"
out=`git rev-parse --abbrev-ref HEAD`
out=" (git:${out})"
echo "$out"

function __get_exit_code {
local code="$?"
local msg=''
if [ $code != 0 ]; then
msg="[${code}] "
echo "$msg"

export PS1="${red}\$(__get_exit_code)${blue}\t ${green}\W${purple}\$(__get_vcs_branch)${blue} \$${black} "

Finally, here are some other things that could be interesting to display:

  • \u – Username
  • \h – Hostname
  • \w – Full path of the current working directory

PS: Word of advice, it’s very easy to get carried away and try to add ‘the world information’ to your $PS1. Up to you what you value the most.

Hadoop/Hive – Writing a Custom SerDe (Part 1)

(Special thanks to Denny Lee for reviewing this)

Apache Hadoop is an open source project composed of several software solutions for distributed computing. One interesting component is Apache Hive that lets one leverage the MapReducing powers of Hadoop through the simple interface of HiveQL language. This language is basically a SQL lookalike that triggers MapReduces for operations that are highly distributable over huge datasets.

A common hurdle once a company decides to use Hadoop and Hive is: “How do we make Hadoop understand our data formats.”. This is where the Hadoop SerDe terminology kicks in. SerDe is nothing but a short form for Serialization/Deserialization. Hadoop makes available quite a few Java interfaces in its API to allow users to write their very own data format readers and writers.

Step by step, one can make Hadoop and Hive understand new data formats by:
1) Writing format readers and writers in Java that call Hadoop APIs.
2) Packaging all code in a java library – eg., MySerDe.jar.
3) Adding the jar to the Hadoop installation and configuration files.
4) Creating Hive tables and explicitly set the input format, the output format and the row format.

Before diving into the Hadoop API and Java code it’s important to explain what really needs to be implemented. For concisiveness of terms I shall refer to row as the individual unit of information that will be processed. In the case of good old days SQL databases this indeed maps to a table row. However our datasource can be something as simple as Apache logs. In that case, a row would be a single log line. Other storage types might take complex message formats like Protobuf Messages, Thrift Structs, etc… For any of these, think of the top-level struct as our row. What’s common between them all is that inside each row there will be sub-fields (columns), and those will have specific types like integer, string, double, map, …

So going back to our SerDe implemention, the first thing that will be required is the row deserializer/serializer (RowSerDe). This java class will be in charge of mapping our row structure into Hive’s row structure. Let’s say each of our rows corresponds to a java class (ExampleCustomRow) with the three fields:

  1. int id;
  2. string description;
  3. byte[] payload;

The RowSerDe should be able to mirror this row class and their properties into Hive’s ObjectInspector interface. For each of our types it’ll find and return the equivalent type in the Hive API. Here’s the output of our RowSerDe for this example:

  1. int id -> JavaIntObjectInspector
  2. string description -> JavaStringObjectInspector
  3. byte[] payload -> JavaBinaryObjectInspector
  4. class ExampleCustomRow -> StructObjectInspector

In the example above, the row structure is very flat but for examples where our class contains others classes and so forth, the RowSerDe needs to be able to recursively reflect the whole structure into Hive API objects.

Once we have a way of mapping our rows into hadoop rows, we need to provide a way for hadoop to read our files or databases that contain multiple rows and extract them one by one. This is done via de Input and Output format APIs. A simple format for storing multiple rows in a file would be separating them by newline characters (like comma separated files do). An Input reader in this case would need to know how to read a byte stream and single out byte arrays of individual lines that would later be fed into to our custom SerDe class.

As you can probably imagine by now, the Output writer needs to do exactly the opposite: it receives the bytes that corresponds to each line and it knows how to append them and separate them (by newline characters) in the output byte stream.

How an Hadoop MapReduce interacts with a custom SerDe for Hive.
How an Hadoop MapReduce interacts with a custom SerDe for Hive.

Summarizing, in order to implement a complete SerDe one needs to implement:
1) The Hive Serde interface (contains both the Serializer and Deserializer interfaces).
2) Implement the InputFormat interface and the OutputFormat interface.

In the next post I’ll take a deep dive into the actual Hadoop/Hive APIs and Java code.
(Two years have gone by and I unfortunately never got round to writing anything else. Probably, anything I would write now would be outdated so I would encourage anyone who has questions to try to ping me directly or just ask directly the the hive community)

Counting lines in files

Here’s silly (and long) Python script I wrote a while back to count the number of lines in files recursively through a dir. (I guess I was fed up with ‘find ../ -iname “*cpp” | xargs wc -l” or just utterly bored or using Windows…)

# Usage:
#   ./ [path_to_dir_or_file]

__author__ = "Rui Barbosa Martins ("

import optparse
import os.path
import sys

def getNumberOfLines(filePath):
  fp = file(filePath, "r");
  numberOfLines = len(fp.readlines())
  return numberOfLines

def visit((fileTypesIncluded, files), dirname, names):
  for f in names:    
    filePath = os.path.join(dirname, f)
    if (fileTypesIncluded == ["*"] or 
        filter(lambda fe: f.endswith(fe), fileTypesIncluded)) and
      files[filePath] = getNumberOfLines(filePath)

def main(startPath, fileTypesIncluded):
  if not os.path.exists(startPath):
    print "Error: Path [%s] does not exist." % (startPath)
  elif os.path.isfile(startPath):
    files = {startPath: getNumberOfLines(startPath)}
    fileTypes = "|".join(fileTypesIncluded)
    absPath = os.path.abspath(startPath)
    print "Searching for extensions '%s' in '%s'." % (fileTypes, absPath)    
    files = {}
    os.path.walk(startPath, visit, (fileTypesIncluded, files))

  keys = files.keys()
  total = 0
  longestLength = 0
  groups = {}

  for key in keys:
    if longestLength < len(key):
      longestLength = len(key)      

  for key in keys:
    spaces = " " * (longestLength - len(key))
    print "%s%s %d" % (key, spaces , files[key])
    total = total + files[key]

  if len(keys) > 1:
    print "Total lines of code: %d" % (total)

def parseArgs():
  usage = "Usage: %prog [Options] FILE_OR_DIRECTORY"
  parser = optparse.OptionParser(usage=usage)
  parser.add_option("-f", "--file_extensions", dest="file_extensions",
                    help="File extensions to count in. (comma separated)", 
  (options, args) = parser.parse_args()
  fe = options.file_extensions
  if not args:
    path = "."
  elif len(args) == 1:
    path = args[0]
    print "Error: Too many paths provided. Only one expected."

  if type(fe) == str:
    fe = map(lambda s: s.strip(), fe.split(","))
  if not fe or filter(lambda s: "*" in s, fe):
    fe = ["*"]
  return path, fe

if __name__ == "__main__":

SimplePhoto version 1.0.0

simplephoto-150x150Finally, after almost a year battling away with wxWidgets, GraphicsMagick, gcc, Visual Studio, CppUnitLite, … I finally get to release version 1.0.0 of SimplePhoto.

SimplePhoto is a batch processing application for images. For the time being it allows image format, image dimensions and groupings.

My main focus for this application was to make it’s memory footprint as little as possible. It’s been implemented in C++ and on Windows it takes around 5MB of memory when running. There are still loads of features to implement but I really wanted to get this out there ASAP. Next step is to open source it (probably hosting it at

Have fun!

Finding magic.mgk in a deployed Mac app

After a lot of frustration I decided that the best way to have GraphicsMagick/ImageMagick finding the required magic.mgk file inside is to patch the image library code.

The change is quite trivial and I’ve pasted the diff below. This patch was applied to GraphicsMagick version 1.3.7. Basically it uses the path provided via magick::InitializeMagick(“executable/path”) to find Contents/Resources. Have fun!!!

diff -r 152043af6bf4 magick/blob.c
--- a/magick/blob.c	Wed Nov 18 00:05:49 2009 +0000
+++ b/magick/blob.c	Wed Nov 18 00:29:25 2009 +0000
@@ -1741,6 +1741,18 @@
 #endif /* !defined(UseInstalledMagick) */
+  {
+		// ruibm added
+		char buffer[2048];
+		sprintf(buffer, "%s/../Resources/", GetClientPath());
+		// printf("Adding [%s] to the search path.n", buffer);
+		AddConfigurePath(path_map,&path_index,buffer,exception);
+  }
   if (logging)

Mac Application Deployment – dylib

Lately I’ve been battling away with the Mach-O binary file format. I am trying to create a Mac version of an application that depends on dylibs provided by wxWidgets and GraphicsMagick. I started this quest by finding out the hard way that mac binaries (libs and applications) store the path to their dylib dependencies in the binary itself. As you can imagine this is a big hassle for app deployment as you force every user wanting to install the app to have the same exact /usr/lib and /opt/local/lib as your machines does.

After a bit of scavenging in forums I found out that otool allows you to print the list of dependencies of a binary and that install_name_tool allows you to change them. Bearing this in mind I now wanted the simplest way I could get to change the list of dependencies both in my binary and inside all of its dependency dylibs. I ended up writing the python script below for this.

# Uses otool and install_name_tool to change a given path in the list of 
# dylib dependencies to another (hopefully relative) path to ease deployment.
# If this is applied to dylib files it will also change their ID.
# Author: Rui Barbosa Martins (

import os.path
import re
import subprocess
import sys

def RunCmd(cmd):
	obj = subprocess.Popen(cmd, shell=True, bufsize=42000, stdout=subprocess.PIPE)
	out = ""
	while (True) :
		content =
		if content:
			out += content
	ret_code = obj.wait()
	if ret_code == 0:
		return out
		return None
def GetDependencies(file):
	cmd = "otool -L " + file
	output = RunCmd(cmd)
	if not output:
		raise Exception("Problem running otool. [%s]" % (cmd))
	output = output.split("n")
	deps = list()
	for line in output:
		m = re.match("(.*)\(compatibility version.*", line);		
		if not m:
		dylib =
	return deps

def ChangeDependecies(file, dependencies, replaceFrom, replaceTo):
	print "Changing dependencies in %s" % (file)
	fname = os.path.basename(file)
	for d in dependencies:
		if not replaceFrom in d:
		new = d.replace(replaceFrom, replaceTo)
		if fname == os.path.basename(d):
			cmd = "install_name_tool -id %s %s" % (new, file)
			print "ID: %s -> %s" % (d, new)
			cmd = "install_name_tool -change %s %s %s" % (d, new, file)
			print "Change: %s -> %s" % (d, new)

def main(argv) :
	if len(argv) != 4:
		print "Usage: %s [file] [replaceFrom] [replaceTo]" % (argv[0])
	file = argv[1]
	replaceFrom = argv[2]
	replaceTo = argv[3]
	deps = GetDependencies(file)
	ChangeDependecies(file, deps, replaceFrom, replaceTo)

if __name__ == "__main__":