All good. I’m rooting for you guys. I think this whole setup is beyond awesome. Let me know where I can help.
While we’re troubleshooting this, out of curiosity are you planning to run and scale the Broker in Docker Swarm? in production?
I think I might know what the problem is - it looks like the lambda and the service are in the same group name. So I think what is happening is the broker is sending the requests back to the lambda. I would suggest setting the lambda up with a different group name so the broker won’t try to route messages to it - so in your lambda environment variables set P_GROUP to something like “rest.lambda.deviceService” so that the lambda Proteus builder users that for its group, and then when you use netifi.groupServiceSocket(“rest.deviceService”) it will point to the service group.
Yep - we setup a Docker Swarm broker. In Proteus 1.6 we’re working on adding a better Discovery strategy. We have an EC2 Tag one that would could work with this and we can definitely look into a DockerSwarm one.
Doh. You are right, makes total sense, and good catch. I’ve been sitting here try to reproduce the issue and I can’t after that client side group name change. I even tried the dual apps setup as mentioned before and shut one down and everything is still humming along.
However, in my original sessions (when I first started the post) I know for a fact that my P_GROUP wasn’t the same as the service side group but I don’t remember what I used for the groupServiceSocket(…) value (hence why it’s hard coded in my code). So maybe that was the culprit at the time.
I’m going to continue to play around and if I run into this again, I’ll post a reply with more details. Thank you so much for the time and for looking into this. Hopefully, I don’t run into this issue again and I can move on implementing this stack in my projects.
I am curious though, I would think that if the lambda code doesn’t actually implement the service, that it would not be considered for routing?
Right now we only look at the group, and destination - we’re adding arbitrary tags nice release. Once that’s done than we can add in the ability to add include this routing information. The broker doesn’t do any routing based on RSocket-RPC - the Proteus client wraps the RSocket metadata in an envelope, and it gets unwrapped on the client so the broker doesn’t actually know what your sending.
Got it. Thank you sir.
- When I finished my last round of testing last night, I wrote my last post and left the system alone.
- This morning, I tried to execute a few lambdas to see if I was still up and running and…nada.
- I have not restarted anything and will not touch anything just in case we want to try something in this “state”.
Back to a similar scenario:
- I no longer see the rest.deviceService (the SB app) registered in the broker.
- I’m assuming something happened in the middle of the night
- I do see the lambda client pop in and out during it’s lifecycles in the logs and the broker ui.
- Lambda request eventually fails.
- In the broker logs, this morning’s tests start @ 2019-01-29 06:44:09,52
- My lambda P_GROUP is “rest.deviceServiceLambdaClient”
- My service group is “rest.deviceService”
- I’ve also included the lambda logs which essentially shows that the Lambda service cuts execution at my 1 minute time out setting.
- I also tried it with a 2 and 5 minute timeout but same result (last two runs in the broker log)
Note: these are from last night’s “fresh start” test where I was stating that everything was good till my last runs shortly before writing this reply.
This looks like an error with the setup payload that some other people run into. We have a hotfix proteus 1.5.4 - Can you try with that?
Do I need to update both the client dependency and the broker version?
You only need to update your client
Got it. Thanks. I will try this tonight and report back in.
Morning. Hope you guys are staying warm. Here’s what I got:
- Updated my client version to 1.5.4 in the service-idl sub project. cleaned, and rebuilt for both the lambda and the spring service.
- Restarted the broker.
- Updated all code on AWS.
- Ran a few lambda tests successfully.
- Left it overnight.
- Tried a few lambda runs this morning with fails.
Again, in the middle of the night, the Spring service client went silent forever. This time, I got this in the broker logs:
2019-01-30 05:15:46,229 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-8] removing destination 39544665-d340-4be2-afe3-d645afec8c5d-1 from group rest.deviceService 2019-01-30 05:15:46,229 INFO i.n.p.b.a.ClusterSocketAcceptor [reactor-tcp-server-epoll-8] removing destination 39544665-d340-4be2-afe3-d645afec8c5d-1 to group rest.deviceService 2019-01-30 05:15:46,229 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-6] removing destination pairRequestLambdaClient-3 from group rest.deviceServiceLambdaClient 2019-01-30 05:15:46,229 INFO i.n.p.b.a.ClusterSocketAcceptor [reactor-tcp-server-epoll-6] removing destination pairRequestLambdaClient-3 to group rest.deviceServiceLambdaClient 2019-01-30 05:15:46,229 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-6] removing destination pairRequestLambdaClient-3 from destination map 2019-01-30 05:15:46,229 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-6] removing destination pairRequestLambdaClient-3 from group rest.deviceServiceLambdaClient load balancer 2019-01-30 05:15:46,230 INFO i.n.p.b.b.RemoteLoadBalancedRSocket [reactor-tcp-server-epoll-8] all members removed from remote load balancer for group rest.deviceService 2019-01-30 05:15:46,230 INFO i.n.p.b.b.RemoteLoadBalancedRSocket [reactor-tcp-server-epoll-8] removing remote load balancer for group rest.deviceService 2019-01-30 05:15:46,230 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-8] removing destination 39544665-d340-4be2-afe3-d645afec8c5d-1 from destination map 2019-01-30 05:15:46,230 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-8] removing destination 39544665-d340-4be2-afe3-d645afec8c5d-1 from group rest.deviceService load balancer 2019-01-30 12:40:06,871 ERROR i.n.u.ResourceLeakDetector [reactor-http-server-epoll-12] LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information. Recent access records: Created at: io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:331) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176) io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:113) io.netty.buffer.ByteBufUtil.readBytes(ByteBufUtil.java:433) io.netty.handler.codec.http.websocketx.WebSocket08FrameDecoder.decode(WebSocket08FrameDecoder.java:304) io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808) io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:410) io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:310) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) java.base/java.lang.Thread.run(Thread.java:834) 2019-01-30 14:21:15,446 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-7] received new connection from destination pairRequestLambdaClient-0, group rest.deviceServiceLambdaClient 2019-01-30 14:21:15,447 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-7] new destination connection from access key 9007199254740991, destination pairRequestLambdaClient-0 and group rest.deviceServiceLambdaClient 2019-01-30 14:21:15,458 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-7] adding destination pairRequestLambdaClient-0 to group rest.deviceServiceLambdaClient 2019-01-30 14:21:15,458 INFO i.n.p.b.a.ClusterSocketAcceptor [reactor-tcp-server-epoll-7] adding destination pairRequestLambdaClient-0 to group rest.deviceServiceLambdaClient 2019-01-30 14:21:16,447 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-8] received new connection from destination pairRequestLambdaClient-1, group rest.deviceServiceLambdaClient 2019-01-30 14:21:16,447 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-8] new destination connection from access key 9007199254740991, destination pairRequestLambdaClient-1 and group rest.deviceServiceLambdaClient 2019-01-30 14:21:16,462 INFO i.n.p.b.a.BrokerSocketAcceptor [reactor-tcp-server-epoll-8] adding destination pairRequestLambdaClient-1 to group rest.deviceServiceLambdaClient 2019-01-30 14:21:16,462 INFO i.n.p.b.a.ClusterSocketAcceptor [reactor-tcp-server-epoll-8] adding destination pairRequestLambdaClient-1 to group rest.deviceServiceLambdaClient
Is something happening to the rest.deviceService service? Because if it was printing this means all the connections from the broker to that service are closed and its removing information about it.
From what I can tell at 05:15:46,230 the spring service leaves the broker and then at 14:21 the lambda connects - which I bet is timing out because there is no service to handle the request.
If the connection was getting closed on the spring application it should automatically reconnect. Maybe something is causing the connection between the broker and the Spring service to be killed, but there is a keep-alive so there should be errors on the broker is it’s running. Is proteus.client.keepalive.enable set to false on your Spring service? Can you think of any reason why the Spring service would shutdown?
I didn’t know there was a “proteus.client.keepalive.enable” so I imagine it’s set to whatever the default is. Would that be “netifi.proteus.client.keepalive.enable” in my application.yml? I’m using the “proteus-spring-boot-starter” btw.
The service isn’t shutting down. This app actually has a scheduled task that sends a message to another AWS service every 30 mins.
As far as connectivity, I thought about this myself. I mean, I have seen strange things with Docker Swarm networks before but even if there is some interruption, I would think there would be some recovery mechanism? Also, I’ve turned off the broker before and see exceptions on the Spring side but in this case, there’s nothing. Tonight, I can try removing Swarm from the equation by just running the broker and app locally and just going through the same exercise.
Now, does the broker try to reach out to the Spring app outside of its client side initiated connections? Meaning, the Spring app’s container doesn’t have any exposed port other than the single port that is used for the Spring Web side of the app. I don’t have to expose 8001 (or some other port) so the broker can talk to it do I?
Depending on what I find out tonight, I’ll also try to work on an entire sample project that I can easily repeat the issue but be able to hand it over so we are on the same page.
Also, is there a minimum Protobuf and Spring Boot version that’s needed for this?
The client will attempt to reconnect anytime the connection is broken. The client should tear the connection and try to reconnect if the keep alive failed. The broker is still taking traffic because I can the lambda connecting.
[quote=“hsteidel, post:35, topic:66”]
As far as connectivity, I thought about this myself. I mean, I have seen strange things with Docker Swarm networks before but even if there is some interruption, I would think there would be some recovery mechanism?
[/quote] There is - it should try to reconnect to the broker when a connection breaks - you should see messages in the Spring applications log files (maybe console out if it’s docker) about it trying to reconnect to the broker and printing error messages if it can’t. I looked at the code keepAlive is to true so if it isn’t getting keep alives it should time out the connection and then try to reconnect if the connection is half open.
All applications connect to the broker - the only thing you need to expose is the broker.
Is it printing anything in the Spring application logs about failed connections?
Protobuf 3.6.x and Spring 2.1.x - the current version imported proteus-spring is protobuf 3.6.1 and spring-boot 2.1.1
Just wanted to drop a quick update:
I’ve ditched my code and I’m currently only playing with the Proteus quickstart: https://github.com/netifi/proteus-spring-quickstart/tree/master
One caveat is that I have set the proteusVersion to 1.5.4. I want to eliminate anything “different” about my own code and I’m now focusing on the “server” side client of the quickstart as I think that’s where my problem is.
My current hypothesis is that, given:
- A stable network connection with all the right ports open, etc.
- No requests are made to the service over a period of time (not yet determined)
- The service doesn’t shutdown or reboot during the evaluation period of time
- The broker doesn’t shutdown or reboot during the valuation period of time
- The server is removed from the Proteus broker (reason unknown)
- The service will not be seen again until the service app is rebooted (meaning it doesn’t self heal)
In all my tests I’ve seen that, in this state:
- The service side code doesn’t seem to think it’s disconnected
- Any new client requests are not able to communicate with the target service
- Rebooting the service “fixes” the issue
Since I’m seeing this from a “black-box” approach, I’m trying out a few scenarios over the weekend to see if there is anything environmental about what I’m seeing but I am going to continue to use the single Docker deployment of the broker. Ah, also, when I start a round of tests, I’m always making the client perform a few requests just to ensure everything is good to go.
- Last night, the quickstart service did disappear on me at some point in the night. (So I was able to reproduce the issue with the quickstart but it was a shot in the dark / impromptu test so I’m going to do it again more formally and collect logs)
- This morning at around 8:45am, I restarted everything but I put the client in a “loop” to make requests every 10 seconds. So far, everything is humming along perfectly and I suspect that it will be fine until tonight around 9pm when I stop it (I don’t think I’ll see the issue).
I am trying to POC a scenario where I know for a fact certain clients are not going to generate any data after 5pm and on the weekends (business hours only). I anticipate other services (different idls) will send data all around the clock but still, I need all services to be available every morning and not disappear on me. I’m still hoping it’s something super simple and incredibly dumb on my end that’s the culprit, I’m just trying to corner the problem so I can discover what it is.
If you have any tips or anything I can enable on the broker or clients to help produce logs that are useful, I welcome them.
Can you enable rsocket frame debugging in your spring client - basically sent the logger
io.rsocket.FrameLogger to debug. I would like to see if any keep alive frames are being sent.
If there aren’t any exceptions in the spring application it feels like a half open connection - the spring app will try to reconnect as soon as it doesn’t have a connection. Here’s the code where it does it for reference:
The keep-alives should kick in and detect the connection died and reset. I will run a test on my side with a broker that just there and see if it does the same thing.