Spinnaker Summit 2019 Preview: Debugging production issues in any environment can be challenging, and Spinnaker has its production learning curve. Problems aren’t always replicable in a smaller environment, and debug messages can be verbose and confusing to triage what’s happening.
Our DevOps Chat guest Chuck Lane, Salesforce lead software engineer, is giving a talk on “Debugging and Profiling Spinnaker Applications Live” at the Spinnaker Summit 2019. In Chuck’s talk, you’ll learn skills like remote JVM debugging, custom profiling builds and the magic of figuring out what’s going on with a multithreaded microservice using htop.
Chuck’s talk is on Saturday, Nov. 16, at 3:40 PM PT. Spinnaker Summit 2019 is Nov. 15-19 in San Diego.
As usual, the streaming audio is immediately below, followed by the transcript of our conversation.
Transcript
Mitch Ashley: Hi, everyone. This is Mitch Ashley with DevOps.com and you’re listening to another DevOps Chat podcast. Today, I’m joined by Chuck Lane. Heโs a lead software engineer with Salesforce.com. He works on Sales Cloud, which is the Salesforce you and I use, know and love, or our CRM capabilities. Chuck is talking at the Spinnaker Summit 2019 in San Diego, and his talk is about debugging and profiling Spinnaker applications live. And I think live means, like, live while they’re running, in production, and see whatโs going on. We’ll explore what that is. This talk is on Saturday, November 16th, at 3:45 p.m. in San Diego. Chuck, welcome to DevOps Chat.
Chuck Lane: Thank you. Itโs nice to be here.
Ashley: Excellent. Awesome to have you here. Would you start by just introducing yourself? Tell us a little about you and what you do at Salesforce.
Lane: Yeah. So, my name is Chuck Lane. I am, as you mentioned, a lead software engineer. And basically, my chief responsibility at Salesforce is to help bring Salesforce into using Spinnaker for our public cloud based deployments. So, as we make a transition to try to move away from first party architecture and move over into the public cloud, one of the technologies that we’re using to do that is Spinnaker and I’ve established myself as a subject matter expert in Spinnaker. And so, I kinda help to bridge the gap between the traditional development and deployment cycle and what that looks like using Spinnaker in a containerized world.
Ashley: Okay, great. You used a term there I don’t know if I’m familiar with. You mentioned something about moving from first party architecture to the public cloud. Whatโs that mean about the environment you’ve been in that you’re, and what you’re moving to?
Lane: Yeah. So, basically, what I mean there is that, historically, Salesforce has owned their own data centers, and we’ve deployed to those data centers. We are moving, embracing in a big way public cloud architecture, whether thatโs through AWS, whether thatโs through GCP, or any of the number, any other number of cloud offerings.
And so, as opposed to doing what some people would call a lift and shift where you just take your software thatโs meant for, to be run on hardware that you own and move it into the cloud, we’re doing a fundamental re-architecture of the software to make use of all of the great things that cloud architecture allows us to take advantage of.
Ashley: Mm-hmm. Fantastic. Thatโs interesting to hear a little bit about your own evolution at Salesforce. So, your talk is about using Spinnaker, you know, really about debugging and finding out problems that are happening in production, but sometimes you might struggle with trying to replicate it into a smaller environment to get the same problems working.
Can you tell us a little bit about what sort of led you down this path to figure out how to do this? Was it a big problem that was happening or something you saw repeated over and over that there wasnโt a good solution to and you figured out how to do that? How did this come about?
Lane: Yes. So, basically, as you’re taking workloads and migrating them over, I mean, whenโyou know, when you just start with a small subset of workloads, you know, the path is relatively straightforward. And, you knowโhopefully, anywaysโand you donโt run into too many issues that canโt kind of easily be solved.
But the reality is, as you transfer more and more workloads over to the public cloud, thatโsโyou know, the devilโs in the details, right? So, thatโs really where things can pop up where you’re hitting various limits or software isnโt performing in the way that you would expect it to perform. And these are the situations where weโwhere it can be very beneficial to jump into production software and really, you know, sometimes slap a debugger on it and see exactly what it is that itโs doing that differs from kinda what you expect. You know, and that can help you kinda tailor your workloads in such a way that you can make things run more smoothly.
Ashley: Mm-hmm. I know sometimes using debuggers, enabling themโthat can be helpful, but it also can be too much information, trying to sort through what all is happening, trying to find what should you be looking at. What are some of the challenges you’ve found by turning on debugging or using debuggers?
Lane: Yeah. Well, so, I mean, there definitely is that problem that you’re saying just as far as too much information thatโs coming out. One of the nice things about Spinnaker and the way the services are written is, you have a very fine grained tuning over which libraries inside of Spinnaker you tell it to print out and debug information.
Ashley: Mm-hmm.
Lane: So, if we know that there is a hiccup with one of the data binding layers or a hiccup with one of the authorization layers, then we can, through the config files, we canโand Java settingsโwe can really target that explicit directory, or I’m sorry, that explicit library and say, โGive me all the debug information that you have.โ
Ashley: Mm-hmm.
Lane: But, you know, failing that, I mean, there are definitely times when itโs been advantageous and the best course of action is just to go down and actually look at the code and see what itโs doing. And, again there, itโs best to do that underโyou canโt do it in production, usually, but what we can do is simulate a load thatโs similar to what we would see in production and really, really take a look at whatโs going on to help us identify those algorithms that might be o event squared instead of a login or something like that.
Ashley: Mm-hmm. Now, are there some specific techniquesโI know in your description of your talk, when you talked about using remote JVM debugging, custom profile builds, even using htop, you know, a UNIX command or a LINUX command to help you figure out whatโs happening with multi-threaded microservices. It sounds like a variety of different approaches that you’ve kind uncovered and learned that you can use to figure out whatโs happening.
Lane: Yeah. And so, I mean, in general, we like to run as closely to the open source build as we can, the ones that are provided by Spinnaker, which is ultimately provided by Google. But there definitely are circumstances where we need an extra tool, be it htop, be it something like Glowroot, which is a Java application profiler where what we’ll do, then, is we’ll go in and we’ll build a custom image using the Spinnaker images as a base and then tack on those additional libraries and put them into the Spinnaker ecosystem and then launch them up just to see what additional information we can get out of there.
And so, you know, one scenario where that was really useful to us, we ran into a scenario where cloud driver, which isโcloud driver is the main tool that talks back and forth to all of our different cloud infrastructures.
Ashley: Mm-hmm, mm-hmm.
Lane: Sometimes calls to cloud driver were taking upwards of, well, two to three minutes to respond. Now, you know, timeouts are set in such a way that if it doesnโt hear anything back in about 30 or 60 seconds, then it just disregards that load and, you know, so that created quite a bit of problems for us.
By building a custom cloud driver image that had htop in it, we were able to see that the majority of the processes that were running were actually running basically commands to reach out to the Kubernetes clusters and get a list of all of the name spaces that are in the Kubernetes cluster. And, talking it over with some of the Spinnaker developers, what we found is that if you have 15 or so clusters that you’re connecting to Spinnaker, then thatโs not so much of a problem to go and query each of those to get the list of name spaces. But as you scale up, on the order of 400 or 500 different clusters the way that we do, then a lot of the delays and a lot of your time can be just essentially Kubectl calls waiting to return to you the lists of name spaces that you need to go and scan.
So, we implemented a workflow based solution for that, basically, where we let our teams know that when they’re creating their clusters, they should use Terraform to go ahead and create the clustersโI’m sorry, the name spaces that they’ll be using. And then, if they need to use a name space after the fact, then we provide a pipeline that they can use that will dynamically add in an additional name space that their cluster will start to scan. So, that saves us from scanning all the clusters all the time, which was causing quite a bit of performance bottlenecks.
Ashley: Yeah, I would imagine that would have some overhead, maybe a lot of overhead if you’re doing that frequently. So, it sounds like a way to both reduce overhead, but also to get the information faster.
Lane: Exactly, exactlyโyep, yep.
Ashley: Mm-hmm. Very cool. So, in your talk, are you going to be doing any demos? I know thereโs always the demo Gremlin, or are you gonna be just showing, you know, talking about what some of these techniques are?
Lane: So, I plan on doing some demos. As far asโ
Ashley: Thatโs really cool.
Lane: Yeah. You know, I may do the, whatโs theโthe Easy Bake Oven a little bit as far as, โHereโs the behavior and hereโs what we’ve found in code.โ But from what I understand, a lot of that stuff can kinda be dependent on Internet connectivity at the actual site. So, if itโs something where we have good Internet connectivity, then by all means, I plan on walking through a couple of debugging scenarios, as close to what we would do in real time as possible.
Ashley: Yep. Well, you know, someone somewhereโitโs been a while agoโgave me the great advice of, you know, โHave your live demo and have your disconnected demo ready in the background.โ [Laughter] So, you can always at least show something locally if you canโt get on the net. So, if you’re depending upon the networkโI’m not sure if you are for your demo, but always a good lesson, right?
Lane: Sure, exactly.
Ashley: Great. Are there any other kind of lessons learned, common mistakes, or mistakes that you or others might have made along the way, kinda so thereโs hard things you learned by trial and error that you plan on sharing?
Lane: I mean, well, there areโwhew. You know, I’ve been working on Spinnaker for a couple of years. So, thereโs definitely been a lot of hard lessons learned, here. But honestly, what I would do is, I would encourage anybody who wants to get into Spinnaker to not just hang around the Slack channel, because the Slack channel does have a tendency to get overrun with, you know, just kinda people posting their stack traces and just saying, โHey, has anybody ever seen this before?โ
And, you know, I’ve gotten a lot more success by going through the commits, looking at the people who actually authored the code, and then reaching out to them directly with more than, โHey, can you explain this to me?โ but rather, you know, โHey, I see what you did here and I see what you did here. I’m running into problems with these lines. Do you have a different approach, or is there something that I can be doing differently?โ
And the other thing that I just canโt overstate is the value of being a member of one of the special interest groups.
Ashley: Hmm, interesting.
Lane: Soโyeah. So, I’m a member of the Kubernetes V2 special interest group thatโs lead by Eric Semene and Ethan Rogers from Armory, Ericโs from Google. And it has, you know, it has been just an absolute wealth of information and, you know, honestly, I don’t know if we ever would’ve gotten nearly as far as we had without those two people.
So, you know, yeah, I would just say that the community is really friendly and we always welcome new members who are ready to learn.
Ashley: You know, I think both of those are great suggestions, and I really appreciate that you made those. Because, one, things like the Slack channels on projects, open source, those can be a bit intimidating. Sometimes they’re not approachable, because thereโs just so much noise so much happening on it and people reaching out for help like you’ve talked about, โHereโs a stack trace I’m trying to figure out.โ
But also, your recommendation of reaching out to the code authors, you know what, itโsโpeople like to help each other. And, you know, it might seem like, โHey, the people who wrote this arenโt gonna have time to bother with meโโthey love to hear from people that are using their stuff to talk about it and help them out, but also hear about how they’re using it, and of course, they’re always looking for ideas and feedback and stuff like that. It sounds like you’ve had that kind of an experience.
Lane: Yeah, absolutely. I mean, and you know, the big thing is just, you know, as a coder, itโs easy to tell the people who are coming to you and just kinda wanting you to fix it for them and the people that are coming to you that have really tried to tackle it themselves. And, you know, I canโt speak highly enough about the latter rather than the former. I mean, you know, just give it your best shot and when you get stuck, reach out to somebody and itโs, you know, it can be immensely valuable.
Ashley: Itโs kinda like going to a foreign country. Wherever you are, if you’re American going somewhere else or vice versa, everybody appreciates you trying to speak the native language, and at some point, when they see you struggle enough and how far you can go, they’re glad to help you and, you know, speak in your language.
So, same kind of thing with helping people with code. If you’re gonna just fob your problem off onto the developer of itโnot appreciated so much. But they appreciate that you tried to take it as far as you could. As you mentioned, go look at the code and figure out whatโs going on.
Lane: Yeah.
Ashley: You’ll get immense respect, you know, even if you arenโt a coder at that level, at that level of software developer, it’ll mean a lot to the developers.
Lane: Absolutely, absolutely.
Ashley: Well, hey, I think you’re gonna have a fascinating talk, and I love that, you know, this idea of trying to figure out whatโs happening in production and some of the techniques that you’ve come up with and developed and have experienced and the fact that you’re sharing those with others. I’m curious, do you contribute any code to the Spinnaker project in any areas? Are you primarily a practitioner user of it?
Lane: So, I have. I’ve got a few small RPRs that have been pushed through, but really, the bulk of my commits right now have actually been to the Spinnaker website, the documentation. So, you know, I don’t know Java or Groovy or Kotlin, maybe, as well. Some of the patterns they use are a little bitโI come from a .net world, so they’re a little bit foreign to me.
But yeah, I’ve definitely written a number of different documentation pages and I found that thatโs a great way to kinda get in and I’ve even got some PRs that are coming ups soon that arenโt doc related. So, yeah, hopefully, you’ll see my name more.
Ashley: Excellent. Well, you know what, documentation is important, too. Thereโs some folks doing talks at Spinnaker Summit that, you know, kinda ran into that point where, โHey, thereโs documentation for doing it one way, but not under this set of configurations or software.โ So, thatโs a contribution, too, so congratulations for being a part of the community and for sharing, also, your experience at the Summit.
So, Chuck, appreciate you being on the podcast today.
Lane: Oh, thank you. Itโs been an absolute privilege. Thank you so much.
Ashley: Absolutely. Fantastic. I wish you all the best with your talk, I’m sure it’ll be great, and hopefully folks listening to this podcast will draw some more interest and bring folks to listen to you.
So, Iโd like to thank our guest today, Chuck Lane. Heโs lead software engineer at Salesforce.com, so you can imagine the environment heโs working in. Thereโs some super good lessons that Chuckโs bringing to the table. Heโs gonna be talking at Spinnaker Summit 2019, which is in San Diego, November 15th through the 19th. His talk is debugging and profiling Spinnaker applications live, and his talk is on Saturday, November 16th at 3:45 p.m.
Iโd also to thank youโyou, our listenersโfor joining us today. This is Mitch Ashley with DevOps.com. Have a great day and be careful out there.
โ Mitchell Ashley



