Data Processing

Map-Reduce

Hadoop

http://hadoop.apache.org/

Allows data processing from different sources

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

Links

Visualization

Software

Store processed data

Data processing often happens in three steps. First current data is queried, processed in ram and then the result is written again to a storage system. Where and how the data is stored depends again on the structure and how the data should be queried afterwards. So for storing results it's again possible to choose one of the storage systems

Centrality

For the thesis I want to evaluate how different centrality indexes can be calculate with the given storage system.

Degree centrality

Degree centrality simply describes how much connection a node in the network has. In the case of the useKit data this gets a little bit more complex because there are three different types of nodes: Context, Content and User. A user that has 100 content items (100 connections) is not necessarly more important than a user with 3 context and only 1 content item. The reason for this is, that a user can be connected to lots other users over the context but if the other user has only private content items, he is not connected at all.

For the degree centrality I try to define a reasonable weight for all the different types of connections. The weight of a connection can be between 0-10.

  • Content → User: 1
  • Content → Context: 1
  • User → Context: 5
  • User → Content: 4
  • Context → User: 7
  • Context → Content: 2

The degree centrality for a node is calculate by the sum of all this values. To compare the results, the value have to normalized to 0-1.

In general I think the degree centrality is too limited for this kind of data because for the value of a node it's important to now how connected the connected nodes are (Example: How much users a context has the user is in). So probably I receive better results with the eigenvector centrality where the importance of a node depends also on its connected nodes.

MySQL

Implemented with views

CREATE VIEW content_degree AS
SELECT id, 
	(
		((SELECT COUNT(*) FROM content_context WHERE content_id = content.id) * 1)
		+
		((SELECT COUNT(*) FROM content_user WHERE content_id = content.id) * 1)
	) AS degree
FROM content
ORDER BY degree DESC

CouchDB

View script

{
    "_id":"_design/usekit",
    "language": "javascript",
    "views":
    {
        "contentDegree": {
            "map": "function(doc) {
				if (doc.type == 'content') {
					var contextCounter = doc.contexts.length;
					var userCounter = doc.users.length;

					var degree = userCounter * 1 + contextCounter * 1;
					emit(degree, doc.content_id);
				}
            }"
        },
        "userDegree": {
            "map": "function(doc) {
				if (doc.type == 'user') {
					var contextCounter = doc.contexts.length;
					var contentCounter = doc.contents.length;

					var degree = contentCounter * 4 + contextCounter * 5;
					emit(degree, doc.user_id);
				}
            }"
        },
		"contextDegree": {
            "map": "function(doc) {
				if (doc.type == 'context') {
					var contentCounter = doc.contents.length;
					var userCounter = doc.users.length;

					var degree = userCounter * 7 + contentCounter * 2;
					emit(degree, doc.context_id);
				}
            }"
        },
    }
}

Eigenvector centrality

master-thesis/data_processing/index.txt · Zuletzt geändert: 2010/10/18 16:59 von ruflin
 
Falls nicht anders bezeichnet, ist der Inhalt dieses Wikis unter der folgenden Lizenz veröffentlicht: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki