Getting Netifi Brokers to use Consul for discovery

I have a 4 server cluster with Netifi brokers working with a static cluster config (passing in netifi.broker.seed.address with the 4 IPs works). I want to expand this to an environment with an unknown number of servers, so I am trying to incorporate consul for service discovery.

The official docs (https://docs.netifi.com/1.6.6/clusters/) are not very helpful in getting Consul working. I set both properties (netifi.discovery.environment=consul, netifi.discovery.consul.address=http://aconsulip:8500) and nothing works. I dug into the netifi-discovery-consul Java source and additionally tried setting netifi.discovery.consul.url and netifi.discovery.consul.serviceName, still to no avail.

My Consul service definition looks like (netifi-broker.json, placed in /etc/consul.d, which is set as -config-dir for consul):
{
“service”: {
“name”: “netifi-broker”,
“tags”: [
“netifi-broker”
],
“port”: 7001
}
}

Consul is setup on 2 of the servers and operating correctly. It is loading the netifi-broker service, electing a leader, and responding to http requests (per the consul docs). The netifi broker does not appear to be trying to connect to Consul to query anything.

Any ideas?

Hey @rgrayson I’ll be troubleshooting this later today. I’ll keep you posted with any results as soon as possible, and sorry for any frustration this might have caused you. We’ll be sure to update the documentation once we find the answers you need.

Thanks! Let me know if you need me to provide any more information to assist in your investigation.

So I’m on an Ubuntu machine and here’s what I’m using to start consul:

consul agent -server -dev

And here’s how I’m starting my broker locally:

sudo docker run --net=host -e LOG4J_NETIFI_LEVEL=debug -e BROKER_SERVER_OPTS=" \
'-Dnetifi.discovery.environment=consul' \
'-Dnetifi.discovery.consul.address=http://localhost:8500' \
'-Dnetifi.discovery.consul.serviceName=netifi-broker-dev' \
'-Dnetifi.broker.ssl.disabled=true' \
'-Dnetifi.authentication.0.accessKey=9007199254740991'  \
'-Dnetifi.authentication.0.accessToken=kTBDVtfRBO4tHOnZzSyY5ym2kfY=' \
'-Dnetifi.broker.admin.accessKey=9007199254740991' \
'-Dnetifi.broker.admin.accessToken=kTBDVtfRBO4tHOnZzSyY5ym2kfY='" netifi/broker:1.6.7

Which should show us some disappointing realities:

2019-08-08 03:31:47,992 DEBUG c.n.b.d.ConsulDiscoveryStrategy [main] Using consul url http://localhost:8500
2019-08-08 03:31:47,992 DEBUG c.n.b.d.ConsulDiscoveryStrategy [main] Using consul service name netifi-broker-dev
[...omitting irrelevant lines...]
 2019-08-08 03:31:48,478 DEBUG c.n.b.d.ConsulDiscoveryStrategy [OkHttp http://localhost:8500/...] no brokers found in consul for netifi-broker-dev
2019-08-08 03:32:03,482 DEBUG c.n.b.d.ConsulDiscoveryStrategy [OkHttp http://localhost:8500/...] no brokers found in consul for netifi-broker-dev

So this shows us that the DiscoveryStrategy is being used - it’s still on the user to do the registration of the service - in this case the brokers. Once the brokers are registered in consul by “you” - the discovery code will consume the data from Consul. And we can validate that more-or-less from the netifi-java code and the fact that Consul records the following API usage:

    2019/08/07 22:30:56 [DEBUG] http: Request GET /v1/health/service/netifi-broker-dev?passing=true (723.488µs) from=127.0.0.1:49596
    2019/08/07 22:31:11 [DEBUG] http: Request GET /v1/health/service/netifi-broker-dev?passing=true (204.934µs) from=127.0.0.1:49596
    2019/08/07 22:31:26 [DEBUG] http: Request GET /v1/health/service/netifi-broker-dev?passing=true (129.406µs) from=127.0.0.1:49596
    2019/08/07 22:31:41 [DEBUG] http: Request GET /v1/health/service/netifi-broker-dev?passing=true (125.995µs) from=127.0.0.1:49596

Historically we’ve never had to manage the discovery registration bit because we rely on Nomad to automatically do the registration for us. We use a simple TCP check in Nomad:

service {
    name = "broker"
    port = "cluster"
    check {
        type     = "tcp"
        interval = "10s"
        timeout  = "2s"
    }
}

You could also use an HTTP check on the websocket endpoint via GET /health on the websocket address and port.

Would having the Brokers manage their registration in Consul be helpful? What version of Consul would you expect support for, and how important is ACL Token support?

@Alan Thanks for the detailed write up! It helped me resolve the issue. I did not realize that the Netifi brokers did not handle registering themselves as a service. Once I registered the brokers, things started working. In case someone else comes across this thread, I was able to register the brokers by running the following (changing the ID for each unique broker):
curl -X PUT -d'{"Name":"netifi-broker","ID":"netifi-broker-2","address":"10.19.276.2","port":7001}' http://myconsulclusterip:8500/v1/agent/service/register

I do think it would be beneficial for the docs to mention this. In fact, it would definitely help to expand the Consul discovery section to include all steps required with examples.

Thanks again!

Edit: I do think it would help to have the Netifi brokers register themselves. They are passed the service name and can generate a unique ID to use. For the address/port, I would imagine you’d just use the netifi.broker.cluster.{publicAddress,port}. BUT, if a user wants to provide more information than the basics, it may make sense to leave the registration external to the broker. Maybe make it an option?
¯\_(ツ)_/¯

Awesome, I’ll pull both of these requests (docs and registration) into tickets.

I imagine we’d want to register the Cluster, TCP, and WS addresses and ports as different services, and mirror the data that the broker advertises to clients, into Consul. Similarly we’d want knobs to turn the registrations on and off - for example, maybe folks don’t want to advertise the WS or TCP services.

Then the question of using Service Names vs using the same Service Name, and different Service Tags to differentiate between different clusters comes into play. Idk if you have any opinions about this @rgrayson ?

I agree with everything you just said. Different service tags to differentiate between clusters would also be helpful!