I've been reading the Hadoop book mainly to learn more about the MapReduce approach to scaling solutions. Pig is interesting but not update-oriented. Hive sounds like the closest Hadoop tool but that's not right either:
"Hive is based on Hadoop which is a batch processing system. ... For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours." [more]
I'm looking for something that can distribute a large set of data across a number of machines, and then be able to let me coordinate processing the data in such a way that each machine works on whatever portion of the data it has held locally. That's why Hadoop was sounding so promising. The other requirement I have is that the data can be supplemented on the fly. Some lag is acceptable, but ultimately I think this eliminates Hadoop also as it's append-only approach is incompatible here.
A basic Distributed Hash Table that supports partitioning of the data would be perfect if it has some way of knowing what data it "owned" locally. Infinispan looks like it might be a good fit once it matures. Unfortunately "the ability to move the code to where the data is and execute it there" [issue] is part of the last milestone on their road map horizon.
Hmm. Infinispan, Project Voldemort, and Riak. They will all know what data they have cached locally. And that's half way to being able to execute a job in a partitioned way. The other half is either modeling some concept of which node as the "primary" cache of some data, or having a way of resolving any duplication of work between nodes. Hadoop solves this problem by having an indexer that keeps track of which nodes own what and where everything has been replicated to.
What's the solution for a technology that hasn't necessarily been designed with this consideration in mind? What do people do when they have outgrown their RDBMS and still want to be able to process large volumes of data for quick ad-hoc queries?
Des timbres chinois ????
-
Vous trouverez ci-dessous la retranscription de plusieurs questions posées
par un philatéliste.
Tout d’abord, merci de lui confirmer que les timbres qu...
8 hours ago

1 comments:
http://riak.basho.com/scale.html
Start with "consistent hashing" (http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing). With a GMS publishing an ordered list of membership, any client with knowledge of the algorithm can directly insert data (and target code execution) at the proper node with little data transfer or effort (I can't find a good description of a GMS, that book I linked is the only source I know of).
Are you specifically trying to execute Java on the data's 'primary' node? If so, if you choose an appropriate consistency level, the node's 'primaryness' becomes largely irrelevant.
Post a Comment