Resource background image

Keynote

Managing Dev Environment with Magic

Learn how Chef helped Prodege rebuild their applications

Back to Keynotes

At Prodege our development team works on features and updates across our multiple consumer sites. As we acquired new sites and brought on more developers it became difficult to ensure that everyone's environment was up to date. Using Chef shortened our new laptop setup from a multi-day process based on docs that were quickly out of date to a simple script that takes moments to set up.

Prodege, the brand, is an internet media company that has 125 million members across our different brands. And they have different ways of earning different variations of points to be able redeem for gift cards and cash for free. We've given out just under $1.8 billion since we've started. So one thing I will warn you is this is a lot about process. So a lot of these slides are images, and we'll get to the code a little later on in the session.

About four years ago, we were looking to move all of our servers from our data center into AWS. And we needed a way to be able to rebuild our servers before we move them to Amazon. One of the things we ran into though was we wanted to first rebuild our production environment in our data center before rebuilding an entire Amazon to see if there were any issues in rebuilding the environment versus if it's an issue of being in Amazon.

Before we could do that though we first have to figure out a little bit about how our actual systems were built. We had them build an application in about three years or so. And the folks who had done it last had moved on to other companies and we had some documentation that at that point was a little bit outdated. And as you folks know from working with manual systems every server was considered slightly different and there's a lot of mixing and matching to see what needed to happen. So we chose Chef as a tool of choice to go ahead and rebuild our applications because we needed something that was iterative. We knew we could not do golden image and rolling down new updates that way. The process in the data center took about three to six weeks to reset a server back to a fresh Windows install. Not like a kitchen destroyed command. So we had to go ahead and do it step by step by step. And it took us a few months while we figured out all the different parts of our application. Now this was also a great project for me at the time because I was the senior manager for the quality assurance team, and we were looking to see how would he get our QA environment to match production more than they were matching our Dev environment. One notable difference between the two was QA did not run a web server. We are not in production. We didn't have QA. QA trying to figure out a little bit how we could replicate the production environment into QA so that we would find issues in lower environment versus production. So once we're able to rebuild everything in our data center we then rebuilt them on Amazon and have moved. The one nice note is we used to have four production servers and three QA systems. Now that we have Chef and automation within it, we have we've actually built and destroyed in production QA environment a couple of 100 servers, and in test kitchen we've gone up and down and tested a little over 5,000 systems. So that's a lot easier than the much slower process we had then. So that's why we're going the Chef and that's actually what leads us to our next part. So our development team here at Prodege has been growing for the last couple of years. We've hired, we've also gotten some companies that have hired. And we've found that as everyone joined and set up their systems, they had to go ahead and get access to one tool or another.

We had documentation, but it was getting woefully out of date. There weren't a lot of steps but they were quite specific and we ran into a lot of sharp edges where if you ran it and powerful versus command from where things would happen and we spent a lot of time debugging these issues for every new developer. And since I had a lot of knowledge on how our applications were built, having first moved into the Chef, how long people set a lot of time debugging these new developers trying to figure out how we could fix their issues, right, and, oh, if you're missing this permission you probably had that or you can figure this and so on, see what they would do. We did have the documentation, it was spread out across about a dozen docs. And there wasn't one person in charge of updating it, as well as we ended up with an unbroken telephone where someone would join in the beginning of the year and set up their environment, and near the end of the year that we have set up their new environment and the doc had changed but they were setting up the old way, which is what they knew.

And another thing that was fun that made this bit of a problem is, as existing developers set up new systems, hardware refresh and so on, they'd bring their configures over from the previous box, which is a three-year cycle, sort of three years out of date, and we're looking through configurations and a lot of, some settings have been around since before I joined the company.

Last year's when this really became a bigger issue, at the beginning 2020 when COVID hit. While Prodege is a remote friendly company, a lot of folks had desktops in the office for working on, had laptops, working from home or meetings and so on. And everyone transitioned quickly over to developing on the laptop versus their desktop. We spent a lot of time setting up new systems and trying to track down the issues. I myself set up my development environment three times in about an eight month span due to preemptive hardware refreshes and updates. And there has to be a better way, right? So in order to go ahead and build this, we first started by taking a-- grouping all the documentation together and trying to see which of these docs we already have steps for our application server. We install Java, we need it both in our developing environment and in our server environments. So we just copy that recipe in, we need to get-- pull that recipe in, and so are we able to pull a decent amount of the recipes in together, and start running that in Dev kitchen on its own policy until we got to one error or another, we're able to tweak those up. Our applications run on AWS. So we ran test kitchen there for a while until we ran into a bunch of issues, specifically for the developers which was Windows 10 specific issues. Windows Server 2016, which is what we use in most of our production environment, has different APIs than Windows 10 does. We had to take a step back and figure out which of those steps were different and how to treat those as well. We also put in a number of tools for developers for debugging tools. We have to add an IDE, an IDE config. And for the most part, a lot of these configurations weren't intended to be automated. They expect you to go and if you're using a gooey tool, to go through the gooey. And we to spend some time figuring out where config files were, setting those up and not getting them, going back and forth. In the end we also spun out a VMware server just to run the Windows 10 RSOs and was able to run test kitchen there. So test kitchen definitely gave us the huge benefit of being able to run tests against all sorts of systems. And sometimes even in parallel what it is to do a smoke test like Amazon, which carried most of the settings and then the VMware server could handle the more specific windows 10 issues as we ran into those as well.

So the other thing that we ran into, and this was actually a nice feature we bumped into on the Chef side, is we're able to use the Edit resource command quite a bit, where the top of the screen here is from the app Cookbook and we log4j. We have an XML file that we pass out as a template. And it has properties that we pass in. And we did it again with a Dev cookbook, we take the exact same thing, right? We already have those properties about the host for the program log and we had already had a Dev great log versus our production gray log pass those variables in and then just tweak the code from the template was coming from. And that allowed us to sort of specify where we needed it to be pulled from, because our Dev can say, that's part of the production config it wasn't worth templating out the variables there. We also added support for AWS workspaces.

We have a number of contractors that we prefer to be using Prodege hardware on Prodege networks, and instead of having them use virtual machines in our offices, we were looking at locating the servers close to them and using workspaces in the regions where they were available next to those users. There was a very strange thing with workspaces which is that the user's directory is on the D drive versus the C drive as it is elsewhere. So we had to go in and add in a bunch of places and some logging that said if you're on a workspace. And right now it is using the [INAUDIBLE] to check out that they're there. Long term, we're going to see that and the other things that we can pull out via Ohai We're able to switch the user path be the D users directory. One really amazing thing though about workspace is even with all the shenanigans in getting it setup, was because directory is separate from the actual host, and workspaces actually has a, part of our client and we have the API, you can go and say, "Please recreate my workspace." And what it'll do is build your new workspace, move along your D drive, your user directories, the new system, install the OS and so on. Once it was able to get started off, you could rerun, if you had put your Dev setup steps, which I'll get to exactly how that runs a little bit later, if you put that Dev setup strip on your D drive, you already had a config file, you could just rerun the install and run commands and be up and running in a little bit.

So from seeing that your machine was completely broken or whatever you wanted to restore back to OS level, you could be back to developing within an hour and a half to two hours after the Chef code had run and the AWS code had been updated. So now we have the majority of what we need to actually get this running, now it's how do we run this? So we were building this, like, I said in test kitchen. It allowed us to-- test kitchen has all the create and destroy stuff for us, we just said, we need a server, we don't need a server, we need to server, and it would spit it out, destroy it, and so on, both for the VMware and the AWS code as well. So we simply said, "OK, we have this process of work for test kitchen, how do we run test kitchen on nodes on our own?" So our Chef flow relies heavily on policies and policy archives. There's a link here at the bottom to get our repo that has our demo that I gave last year, or that Chef made up. And everything we do gets deployed first as a policy archive, into our artifactory and then we do a Chef push to upload policy files, we use that archive. The main reason we do that is so we can have one policy archive that gets pushed for, say, a lower QA environment, and then left there for a week or two before we push to production and QA can move up as well. But we're able to use those same configurations. And with all the Chefs and Chef wording jokes there were no test kitchen images that worked well. So this is what we had. So we ended up with this as our final product for the scripts. The first half of script is installing chef-client. Pretty straightforward. We do just-- our systems currently are exclusively support Windows. That will likely change the next year or two. We're adding support for Linux as well. Right now we expect Windows, so we're using PowerSheel transcript option to go ahead and log all the steps they do into a nice little text file.

So if there's anything that needs to be debugged, you know, you can just remote in to the desktop, they can just send you the entire log file, re-install chef we load the path. And the next step is to actually go ahead and run the chef-client command. Send over the transcript. We accept the Chef license, and we have our licensing for all our sources via Chef and opsworks and we call the chef-clients command, which I'll walk through. The chef-client command-- we did some tweaking. We found that the chef-client command take the localmode option, which is an alias to the -z for chef-zero, so I'll run chef-client without needing a Chef server. We have a config.json file that the developers will write before they go ahead and configure it, before they run it. This includes things that ohai can't pick up on its own, including the developer's name and their specific environment server configuration. For example, most of our systems, we just use initials to specify which machine I put server name equals MB. And then this recipe actually allows us to pull the entire code based out of our factory. And we just push a compiled tgz via our tooling. We do that every time there's a new update, we keep some of the latest so we also have an alpha version at times or a more extensive version while debugging, or whatever else we need to do. We can push multiple versions into that folder. Now we also pair this with a Terraform repository that allows folks to put in their server initials. It can be like I said, at either an API or their computer named and generate all their DNS records for them as well.

So that gets approved and automatically deployed. And we moved a lot more of our actual authentication steps into Okta and and similar tooling. So when you run this the chef-client plan it will set up everything for your environment and for all the tools that you need it, using Chocolatey, to make sure they have all the debugging tools and so on. It actually has some API to be able to figure out with the Bitbucket server, and it will generate a SSH key for you for your instance, added to this bucket, and already pulled the source code locally. So you really can just go ahead. And there are a few manual steps in IntelliJ, because that-- we just cannot get around those, or at least we haven't got around those yet. But you run this, come back, 30ish minutes later, once everything is downloaded, everything is configged, to figure, there are a couple of reboots, because Windows like to have things be rebooted. But once that's all done you can go ahead and start developing within a few moments. So that's what we built, but there's some nice benefits that we had here. First of all, it's a declaritive process. We have a very specific this step and then that step, and then this step, and this depends on that. It's a lot easier for us to figure out what needs to happen when, as well as it being in Git, if we have a lot more history that we allow anyone to make a pull request for updates. We can go ahead and-- if someone noticed that we have either a bug or wanted to add another resource to one of our property files and so on, they can go ahead and make that pull request if they like. We also have the identical configuration and tools for everybody. So there's no longer this person using this text editor or that text editor.

The editors that we're encouraging to use are on all the machines. We added some logging tools on all the machines. We use tools like reddis and loggo and we don't include those kind of databases as well. So if you want them-- We also have identical bugs. One of the things that we found out, or I found out after about three months of having this out there, was when I had initially pushed my configuration for IntelliJ, I had actually left all the debug breakpoints in the code and enabled. So for new developers, they open up their source code and hit Run and it would stop. So it would hit a break point and wait for the next command, and after that was pointed out, we were able to pull those out and turn it off and refresh. We also had an issue where we had accidentally disabled the Git symlinks for Windows.

We handled them a little bit differently. And we didn't know that because of how we were-- there were so many people we were able to say, "OK, wait, we give everyone the wrong configuration without symlink, here's recommended to flip it back to include symlink now that we need them." And we added Chocolatey to most of our packages. So we can-- it's a lot easier for us to say, "Hey, you know, we wanted to use this specific point version of node.js run chef upgrade node.js -version" whatever it is-- or simply say "Listen, you know what, we want to add this tool, go-- we're now encouraging everyone to install vscode choco install vscode." And everyone has-- you know-- gets aall that they needs to get there. And of course, the main thing we did this for is the timing. We went from a day or two to about an hour runtime. So figuring out the 20 devs who'd done this in the last couple of months, it's about eight weeks of time of dev time that we've been able to save, in addition to all the other benefits that were above as well. One of the nice thing though, we added a testimonial page in target repo. And we use this as essentially a good way of showing that we need to continue this development on this project. So here are some of the folks that have said some things about the dev setup set.

My favorite, of course, is from Kaleena, because when she joined it was a bit of a broken telephone process between her and the main dev team, and it took her weeks to get configured. And when-- she was one of my beta testers for that. It took about a day of debugging with her. We reached configuration, probably a little bit different. It was definitely better than it was before. So there was quite a bunch, a nice part about being able to work like that and it's definitely encouraged more people to use it, and allowed us to get more out there. So that's what we built. And I'm happy to take any questions, if anyone has any, about what we've put together. Awesome. That was amazing. As a matter of fact, I'm coming from a developer background. I was at .netdev for a decade prior to coming into DevRel recently and so I honestly thought initially you were talking about another tool called Magic. But now I'm realizing that, "OK, so you use Chef to set up development environments." And I think that's very innovative. I haven't seen that use case before. And I think that, as someone who has set up many a development environment, that's pretty awesome. Yeah, we were setting up dozens of developers each year and we run into all sorts of issues. And the randomest things would pop up because-- one thing that we ran into was one developer decided he, instead of running his command in command prompt, he would run them in cygwin. And things broke entirely differently for no good reason. And we had to reimage computer and start over. Wow. Yeah. Yeah, that sounds about right. So tell me-- OK, so this, we thought this was a very innovative way to use Chef.

What are the next steps for you all and kind of deepening your, relationship I guess, with Chef in this way? So there's a couple of parts here. I mean, there's definitely constantly updating and making sure everything's up to date. We keep making changes on the application side as well. And we've been working on making sure that we keep up to date with new configurations for-- we're training some new ones in the next few weeks as well for an app professional base there. We're looking or considering how we could run this more often or more on demand. We'd have to figure out-- so one of the things that it does ask is check out source code for you. And it expects you to be checking out the latest either master or main to be able to say, "This is the current place you should be developing from." We need to have a lot of logic to figure out.

If someone's currently working in this environment, they'll make changes to that. In an application environment, you always-- Chef is the only thing that should be doing anything, but on development environment, we need to allow allowances for them to be able to use the machine to configure it. The other thing we're looking into is seeing what Chef Desktop can do, which I think that came out last year from Chef, and seeing another way of allowing these sort of ad hoc runs and reporting a lot of users way of doing that, especially as I think the Chef Desktop team's working hard to include a lot more features for Windows, Linux, and Mac, which we're looking to expand to as we start supporting it. That's great. Great stuff.

Does anyone in the audience have any questions for Mendy? I think this is actually really amazing. I could see-- I never even imagined that use key. It's definitely isn't a-- strongly supported use case. And there's actually code inside of Chef that we ran into. For example, PowerShell, we wanted to see about getting user input on the fly, and Chef had code that prevents that. So we have to work on different tweaks, including that config file ahead of time, rather than doing things during runtime, because it's not really meant to be used this way, but it has definitely worked out. OK. That's awesome.