Using Elastica with multiple Elasticsearch Nodes PDF Print E-mail
Written by Nicolas Ruflin   
Monday, 21 November 2011 17:30

Elasticsearch was built with the cloud / multiple distributed servers in mind. It is quite easy to start a elasticsearch cluster simply by starting multiple instances of elasticsearch on one server or on multiple servers. Every elasticsearch instance is called a node. To start multiple instances of elasticsearch on your local machine, just run the following command in the elasticsearch folder twice:

./bin/elasticsearch -f
./bin/elasticsearch -f

As you will see, the first node will be started on port 9200, the second instance on port 9201. Elasticsearch automatically discovers the other node and creates a cluster. Elastica can be used to retrieve all node and cluster information. In the following example first the cluster object is retrieved (Elastica_Cluster) from the client and then the cluster state is read out. Then all cluster nodes (Elastica_Node) are retrieved and the name of every node is printed out. Every cluster has at least one node and every node has a specific name.

$client = new Elastica_Client();

// Retrieve a Elastica_Cluster object
$cluster = $client->getCluster();

// Returns the cluster state
$state = $cluster->getState();

// Gets all cluster notes
$nodes = $cluster->getNodes();

foreach ($nodes as $node) {
    echo $node->getName();
}

Client to multiple servers

As elasticsearch is a distributed search engine that can be run on multiple servers, it is possible that some servers fail and still, the search works as expected as the data is stored redundantly (replicas). The number of shards and replicas can be chosen for every single index during creation. Of course, this can also be set with Elastica through the mapping as can be seen in the Elastica_Index test. More details on this perhaps in a later blog post.

One of the goals of the distributed search index is availability. If one server goes down, search results should still be served. But if the client connects to only the server that just went down, no results are returned anymore. Because of this, Elastica_Client supports multiple servers which are accessed in a round robin algorithm. This is the only and also most basic option at the moment. So if we start a node on port 9200 and port 9201 above, we pass the following arguments to Elastica_Client to access both servers.

$client = new Elastica_Client(array(
	'servers' => array(
		array('host' => 'localhost', 'port' => 9200)
		array('host' => 'localhost', 'port' => 9201)
	)
));

From now on, every request is sent to one of these servers in a round robin type. Instead of localhost, an external server could be used in addition. I'm aware that this is still a quite basic implementation. As probably some of you already realized, this is no safe failover method, as every second request still goes onto the server that is down. One idea here is to give a specific threshold for every server in which the respond time should be and otherwise the query goes to the next server. In addition, it would be useful to store this information on unavailable servers somewhere in order to use it for the next request. Thus, only one client has to wait for the unavailable server. Storing this information is somehow an issue, since Elastica does not have any storage backend.

Load Distribution

This client implementation also allows to distribute the load on multiple nodes. As far as I know, Elasticsearch already does this quite well on its own. But it helps if more than one node can answer http requests. Therefore, the method above is really useful if you use more than one elasticsearch node in a cluster to send your request to all servers.

It is planned to enhance this multiple server implementation in the future with additional parameters such as priority for a server and some other ideas. Please feel free to write down your ideas in the comment section or directly create a pull request on github.

 

Comments  

 
0 #4 Nicolas Ruflin 2011-11-21 21:06
Yeah, there are no persistent objects in PHP. The only possiblity is store it into any kind of storage backend (APC, Redis, MySQL, File). As the only thing that is probably available on every box is file, I will implement an option to write the info to a temporary file.

I will probably also add additional params you have like max requests. This will all make much more sense, as soon as I can store it somewhere.
Quote
 
 
0 #3 Clinton Gormley 2011-11-21 20:13
Hiya

So you're saying that you have to recreate the Elastica object on each page request? I'm not terribly familiar with PHP, but you can't have persistent objects that retain state?

I agree that it doesn't make sense to sniff nodes on every page request.

If persistence is out, then just look at the code path for when the parameter no_refresh is true - that does no sniffing, but still enables round robin and the (temporary) removal of dead nodes.

clint
Quote
 
 
0 #2 Nicolas Ruflin 2011-11-21 20:07
Hi clint

Thanks for the link. This is quite interesting. I think I should be able to copy over some of this stuff to Elastica (or all ;-) ). I see you also query information from the cluster about which nodes to access and which not.

One main issue I have is that I can't store the information about the available nodes. Most request in PHP are quite short so to retrieve all the information every time directly from the server is not the best idea. How do you deal with that? Perhaps I should include a tmp directory to write some information to disk, this would already help.

As soon as I find some time I will study your implementation more in detail.
Quote
 
 
0 #1 Clinton Gormley 2011-11-21 19:33
Hiya Nicolas

You may want to take a look at my Perl API which supports live node sniffing and auto-failover and retry. Should be pretty easy to convert to PHP.

https://metacpan.org/source/DRTECH/ElasticSearch-0.47/lib/ElasticSearch/Transport.pm

clint
Quote
 

Add comment


Security code
Refresh

 
JOOMLA TEMPLATES Joomla Templates By JoomlaBear