Nov 7, 2009

Late to the Party

I just worked my way through this fantastic game. Great stuff! Could it be that you haven't played Portal already?

Wow, Mac Wins

I was just cleaning up my pile of computer bits, and I found a USB hard drive that I've not used in a while. After browsing the contents I felt it was time to wipe it clean. I wanted to format it to be Windows friendly just in case. Linux felt like the best tool for the job there, and a combo of fdisk and mkfs.vfat accomplished that pretty quickly. I wanted to find a good tool to test the disk for bad sectors (testdisk?) and that's when I noticed:

OMG. I haven't even installed Firefox in Ubuntu since installing 9.04 back in April! I just don't use Linux any more. Mac has won me over that completely.

Oct 23, 2009

Distributed Solutions: What is out there?

I've been reading the Hadoop book mainly to learn more about the MapReduce approach to scaling solutions. Pig is interesting but not update-oriented. Hive sounds like the closest Hadoop tool but that's not right either:

"Hive is based on Hadoop which is a batch processing system. ... For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours." [more]

I'm looking for something that can distribute a large set of data across a number of machines, and then be able to let me coordinate processing the data in such a way that each machine works on whatever portion of the data it has held locally. That's why Hadoop was sounding so promising. The other requirement I have is that the data can be supplemented on the fly. Some lag is acceptable, but ultimately I think this eliminates Hadoop also as it's append-only approach is incompatible here.

A basic Distributed Hash Table that supports partitioning of the data would be perfect if it has some way of knowing what data it "owned" locally. Infinispan looks like it might be a good fit once it matures. Unfortunately "the ability to move the code to where the data is and execute it there" [issue] is part of the last milestone on their road map horizon.

Hmm. Infinispan, Project Voldemort, and Riak. They will all know what data they have cached locally. And that's half way to being able to execute a job in a partitioned way. The other half is either modeling some concept of which node as the "primary" cache of some data, or having a way of resolving any duplication of work between nodes. Hadoop solves this problem by having an indexer that keeps track of which nodes own what and where everything has been replicated to.

What's the solution for a technology that hasn't necessarily been designed with this consideration in mind? What do people do when they have outgrown their RDBMS and still want to be able to process large volumes of data for quick ad-hoc queries?

Oct 17, 2009

Firefox 3.5: Firefox doesn't know how to open this address, because the protocol (foo) isn't associated with any program.

Humph. I'm not having any success creating a new protocol that maps to an external application in Firefox. I have these in my prefs.js:

user_pref("network.protocol-handler.app.foo",
"/Applications/TextEdit.app/Contents/MacOS/TextEdit");
user_pref("network.protocol-handler.expose.foo", true);
user_pref("network.protocol-handler.external.foo", true);
user_pref("network.protocol-handler.warn-external.foo", false);
Have you got it working?

I've seen these: knowledge base entry, registering a protocol, and path not found error. This one I didn't get to the bottom of: whitelisting protocols.

Here's my about:config which shows my default settings as well:

Oct 16, 2009

Hadoop Streaming: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127

So, my most recent problem was that my hadoop streaming job was failing, and the tracking url was listing this exception under "Failed/Killed
Task Attempts":

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:540)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Digging around in the logs directory I found a fuller explanation:
env: groovy: No such file or directory
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
My groovy script was failing to launch - presumably because groovy wasn't visible on the path. My hacked solution was to hardcode the path to groovy in my script:
#!/usr/bin/env /Users/user/Applications/groovy-1.6-RC-2/bin/groovy
And then, voila! It worked!

Hadoop Streaming: Cannot run program

I'm working from the Hadoop book, and I'm noticing some gaps. When I ran hadoop using streaming like this, my job hangs:

$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-0.20.1-streaming.jar \
-input movies.input.txt -output output -mapper movies.map.groovy \
-reducer movies.reduce.groovy
packageJobJar: [/tmp/hadoop-user/hadoop-unjar2417163781720808364/] [] /var/folders/Ht/HtruzsCeGAukVrRT16Q4+k+++TI/-Tmp-/streamjob7394763723023597419.jar tmpDir=null
09/10/16 21:13:46 INFO mapred.FileInputFormat: Total input paths to process : 1
09/10/16 21:13:47 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-user/mapred/local]
09/10/16 21:13:47 INFO streaming.StreamJob: Running job: job_200910162029_0008
09/10/16 21:13:47 INFO streaming.StreamJob: To kill this job, run:
09/10/16 21:13:47 INFO streaming.StreamJob: /Users/user/Applications/hadoop-0.20.1/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_200910162029_0008
09/10/16 21:13:47 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_200910162029_0008
09/10/16 21:13:48 INFO streaming.StreamJob: map 0% reduce 0%
^C
The tracking URL is handy, from there I could see that hadoop streaming wasn't able to run my groovy mapper:
Caused by: java.io.IOException: Cannot run program "movies.map.groovy": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:166)
... 19 more
I was missing the -file parameter. This now gets me to my next error...
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-0.20.1-streaming.jar \
-input movies.input.txt -output output -mapper movies.map.groovy \
-reducer movies.reduce.groovy -file movies.map.groovy -file movies.reduce.groovy

Hadoop: HDFS in a bad state

I managed to get to the position where I could not copy files to my HDFS:

$ hadoop fs -copyFromLocal movies.input.txt movies.input.txt 
09/10/16 21:20:43 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
File /user/user/movies.input.txt could only be replicated to
0 nodes, instead of 1
Formating my HDFS didn't help. http://localhost:50070/dfshealth.jsp was reporting that I had no live nodes. In the end I blew away my /tmp/hadoop-user directory and followed up with a hadoop namenode -format for good measure.

Hadoop: Error: JAVA_HOME is not set.

If you get this when trying to use Hadoop:

$ start-dfs.sh 
starting namenode, logging to /Users/user/Applications/hadoop-0.20.1/bin/../logs/hadoop-user-namenode-merlyn.local.out
localhost: starting datanode, logging to /Users/user/Applications/hadoop-0.20.1/bin/../logs/hadoop-user-datanode-merlyn.local.out
localhost: Error: JAVA_HOME is not set.
localhost: starting secondarynamenode, logging to /Users/user/Applications/hadoop-0.20.1/bin/../logs/hadoop-user-secondarynamenode-merlyn.local.out
localhost: Error: JAVA_HOME is not set.
Then edit conf/hadoop-env.sh (I couldn't easily google this info).

Edit: I missed this: Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

The file is helpful once you know it is relevant:
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

Oct 4, 2009

P2V: Hassles Cloning a Linux Installation as a Virtual Machine

After some reading around, I settled on these tools to get the job done:

  • A Knoppix boot disc
  • An 8Gb USB pen drive
  • VirtualBox
  • dd, and fdisk
All free to use, all robust.

I took a look at my partition table on my physical Ubuntu box:
$ fdisk -l /dev/hda
Disk /dev/hda: 100.0 GB, 100030242816 bytes
255 heads, 63 sectors/track, 12161 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 1 522 4192933+ 83 Linux
Next, I copied the contents of hda1 to my pen drive:
$ dd if=/dev/hda1 of=/media/usb/hda1.raw bs=4096
I also copied the MBR:
$ dd if=/dev/hda of=/media/usb/mbr.raw bs=512 count=1
Back to my host computer and my fresh VirtualBox VM. I booted the VM using Knoppix, and then created the same partition layout (using fdisk expert mode to change the number of heads and sectors to match). I rebooted for good measure.

After booting again in Knoppix I copied first the hda1 into place:
$ dd if=/media/usb/hda1.raw of=/dev/hda1 bs=4096
Then I copied the MBR:
$ dd if=/media/usb/mbr.raw of=/dev/hda bs=512 count=1
I rebooted, and my cloned system wouldn't start. Restarting with Knoppix, I confirmed that I have the following problem:
$ sudo mount /dev/hda1 ~/temp
mount: /dev/hda1: can't read superblock
Now, I've tried this a number of different ways: without the mbr writing step, without tweaking the heads and sectors to match, with a partition larger than the original one - all to no avail.

I'm not finding other people with this same problem. Does anyone have any ideas?

Aug 29, 2009

Ruby Warrior

Here's an interesting programming puzzle (thanks Nige). The challenge is to write a (stateless?) control function for a warrior, and have it fight its way up the stairs to the next level. This is the level I'm stuck on:

Level 6

The wall behind you feels a bit further away in this room. And you hear more cries for help.

Tip: You can walk backward by passing ':backward' as an argument to walk!. Same goes for feel, rescue! and attack!.

 --------
|C @ S aa|
--------

> = Stairs
@ = groo (20 HP)
C = Captive (1 HP)
S = Thick Sludge (24 HP)
a = Archer (7 HP)


Available Abilities:

warrior.walk!
Move in given direction (forward by default).

warrior.rest!
Gain 10% of max health back, but do nothing more.

warrior.feel
Returns a Space for the given direction (forward by default).

warrior.health
Returns an integer representing your health.

warrior.rescue!
Rescue a captive from his chains (earning 20 points) in given direction (forward by default).

warrior.attack!
Attack the unit in given direction (forward by default).


Give it a shot.