Allows data processing from different sources
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
Data processing often happens in three steps. First current data is queried, processed in ram and then the result is written again to a storage system. Where and how the data is stored depends again on the structure and how the data should be queried afterwards. So for storing results it's again possible to choose one of the storage systems
For the thesis I want to evaluate how different centrality indexes can be calculate with the given storage system.
Degree centrality simply describes how much connection a node in the network has. In the case of the useKit data this gets a little bit more complex because there are three different types of nodes: Context, Content and User. A user that has 100 content items (100 connections) is not necessarly more important than a user with 3 context and only 1 content item. The reason for this is, that a user can be connected to lots other users over the context but if the other user has only private content items, he is not connected at all.
For the degree centrality I try to define a reasonable weight for all the different types of connections. The weight of a connection can be between 0-10.
The degree centrality for a node is calculate by the sum of all this values. To compare the results, the value have to normalized to 0-1.
In general I think the degree centrality is too limited for this kind of data because for the value of a node it's important to now how connected the connected nodes are (Example: How much users a context has the user is in). So probably I receive better results with the eigenvector centrality where the importance of a node depends also on its connected nodes.
Implemented with views
CREATE VIEW content_degree AS SELECT id, ( ((SELECT COUNT(*) FROM content_context WHERE content_id = content.id) * 1) + ((SELECT COUNT(*) FROM content_user WHERE content_id = content.id) * 1) ) AS degree FROM content ORDER BY degree DESC
View script
{
"_id":"_design/usekit",
"language": "javascript",
"views":
{
"contentDegree": {
"map": "function(doc) {
if (doc.type == 'content') {
var contextCounter = doc.contexts.length;
var userCounter = doc.users.length;
var degree = userCounter * 1 + contextCounter * 1;
emit(degree, doc.content_id);
}
}"
},
"userDegree": {
"map": "function(doc) {
if (doc.type == 'user') {
var contextCounter = doc.contexts.length;
var contentCounter = doc.contents.length;
var degree = contentCounter * 4 + contextCounter * 5;
emit(degree, doc.user_id);
}
}"
},
"contextDegree": {
"map": "function(doc) {
if (doc.type == 'context') {
var contentCounter = doc.contents.length;
var userCounter = doc.users.length;
var degree = userCounter * 7 + contentCounter * 2;
emit(degree, doc.context_id);
}
}"
},
}
}