Resource background image

Keynote

How we Deployed Chef and Migrated from Red Hat Satellite

Know how DiscountTire migrated from Red Hat Satellite to Chef, and more!

Back to Keynotes

After almost 20 years of using Red Hat Satellite to manage our Linux systems, Discount Tire had deployed over 700 systems in AWS using Chef as the primary development tool. We reviewed what was left on premises and choose to work with Chef to convert all Red Hat Satellite systems to management and patching using Chef.

Our technology has evolved over the years, and it was a little over 20 years ago that we finally decided to get our presence on the internet. And the way we ended up doing that was we hired out to a guy that did web development in North Scottsdale, who worked out of his garage.

And those of you who have been in IT for more than 35 seconds probably realized that a garage does not make a great data center. So in about 2002, we decided to bring that web server in-house. At the time he was running it on an SDI machine. We brought it in-house, and that was our first Red Hat Linux server in our data center.

So management was a little reluctant to go down the path of Linux at the time, Linux was still relatively new on the scene and it had not been quite as heavily used. Plus there was the fact that at the time Discount Tire was very much an IBM-centric company. A lot of our stuff ran on iSeries, most of our IT developers worked in the iSeries platform. So that was the direction we had been going at the time, so going in the direction of Linux was met with some reluctance.

But eventually our Red Hat footprint started to gain some ground and we actually started deploying various other technologies other than just the website. We installed Lotus Notes Domino, because, yes, we were Lotus Notes Domino shop. We had our ESP software running on Red Hat. We used our business intelligence tools like Micro Strategy were running on Red Hat. And then we also had some servers for developer tools like Subversion and Git and Jenkins, and that kind of thing.

And then as things continued to grow and security became a little bit more of a concern, we bought the tenable suite of scanning products and security products, and so all of those are running on our Red Hat platform.

So as that grew, one of the things that we realized was we need to have some way of managing these things, and that's when we implemented Red Hat Satellite, and that was about probably 2008. And so when I took over in 2012, one of the first things I noticed was that the patching was way behind at the time.

And at the time there also wasn't much of a process in place, there wasn't a method in place for patching all of our Linux servers and keeping them up to date. So we started working on that and taking a look at what we could do and eventually we developed a standard for our patching processes with a monthly rotation and everything, and that served us really, really well for a number of years.

But then we started moving into AWS, and we started having to, because of the technologies we were using, we had to start experimenting with other flavors of Linux, like SUSE and Amazon Linux and things like that. And satellite wasn't really an option for managing those systems for a couple of reasons. One was because satellite only can manage RHEL. It can't manage other types of Linux.

And then the second reason was because satellite didn't really have the capability of managing ephemeral systems. So as things would get stood up and torn down, satellite wasn't really good at handling that and identifying when a new server was put in place and when one was being decommissioned. So there wasn't really an automated way to do that.

And so as our presence in AWS continued to grow, we had some challenges ahead of us. And I'm going to let Michael go into a little bit more about how that process evolved. And then we'll get back into how we found into Chef.

Thank you, Dan. So we were tasked with moving our website over to a new platform in 2015. And at the time we decided to host it out in the AWS. And it presented some challenges with how we spend up infrastructure, automate AWS configurations, deploying and manage applications.

We set the standard from the beginning that we wouldn't perform any manual operations in preprone or prone environments. And to accommodate that, we looked at a multitude of tools like Ansible, Chef, Puppet. And with those items, they led us into using AWS OpsWorks which utilize a Chef Solo. We felt it was the fastest way that we could dip our toes into the Amazon world with the smallest learning curve and then get some of the benefits of satellite.

We chose to follow our standard SCM practice, and we started to create a monolithic repository with long live Dev QA stage and prod branches. And OpsWorks allowed us to point directly at those branches for its lifecycle events. It made iterative development really easy, and we didn't have to package or build Cookbook archives. But it took us a couple of years to realize the pitfalls of that SCM decision.

When you have multiple people writing Cookbooks in different stages of the life cycle, it's hard to not step on each other at times. Let's say that Dan ready to push a Cookbook from Dev to QA and I wasn't, and if you couple that with our knowledge around that during that period, we just push the whole Dev branch to QA and went on with it. Fortunately, we didn't break any of the applications during that time frame.

And then after we came to the realization of our choices, we set out to solve that particular issue we started researching solutions on the internet. We found an article called The Bookshelf Way, and that became the basis for our next generation, and the decision was made to migrate SCM system process to utilize microrepositories using Gitflow and Bookshelf to package our Cookbooks.

With those changes in place, it was easy for us to create an automated Jenkins pipeline to iterate Cookbook versions, and then to create car archives for use with our OpsWorks staffs. It's more ideal, but it's more of a administrative headache when you want to do things quickly. It took the ease and quickness of pointing stacks directly at branches to separating them out into the local fumes.

So with that, we also did a little progression and started to play with Chef Automate out of necessity. We had certain applications that required us to use a certain version of RHEL. And we also wanted to use EFS of Amazon, and those particular items at the time were not supported in OpsWorks. But through our progression, the more we use Chef Automate, the more we realize that was the direction we wanted to go.

And like Dan said earlier, we were unable to manage Amazon existences with satellite software patching. It was really hard to do. We had to do a manual update, or we had to create a new AMI and destroy the system and recreate it.

And tying that in here, we had to find a better way. So with Chef Automate, it was our vision that we could unite our on-premises and cloud systems under one pane of glass negating the need for satellite. We just had to figure out the patching dilemma and determine the scope of the project.

At its inception, the scope included all of our Linux servers which was in the 600s, their applications, and we were also running out of support time for our contract with satellite. So with that race against time, we engaged Chef and they walked into the picture and helped us pare down the initial scope to just on-prem instances and their AWS configuration policies.

And also, thankfully that with our Outlook stacks, we already had an existing foundation we could use to build our on-premises Cookbooks. So we copied everything that was related to our configuration, and then after some tweaking testing and various trial and error, we eventually came up with the working set of Cookbooks that we could use for our on-prem Chef environment. And with that, I'm going to hand it back to Dan to talk about The Nexus.

Thanks, Mike. So as we're talking about, the whole plan here was to migrate from satellite onto Chef. And to do that, we actually used some heavy usage of satellite itself, since that was already in place. And one of the nice things about satellite was that it was able to report on config files, and we could see if config files had drifted from what they were supposed to be.

And so what we are able to do is as we were building out all of our Cookbooks and configurations in Chef, we could start running those Cookbooks in some of our development environments, and then we would be able to look at satellite and see whether or not the config files had drifted too much, or whether Chef was actually configuring the systems the way they were supposed to be. So that was actually pretty useful.

Then when it was time to actually start regular deployment, we actually use satellite to push out config files, and some of those included things like the validator PAM that we needed to contact our Chef for server, the client config file, we also pushed out a JSON file for bootstrapping that we could use with the Chef Client launch.

And then we pushed out a script, just a basic batch shell script, that would download the Chef Client, install the RPM, and then run it and point it to our bootstrap JSON file, which basically told it this is going to be the policy that you're going to use and the policy group that you're going to be in.

So even with all of our testing and everything, we wanted to take things in steps so that we weren't doing too much at once, and especially if anything were to break, we wanted to limit the blast radius, so the speak.

So what we did is the policy that we put in that bootstrap JSON file was just a simple onboarding policy that had nothing in the run list other than Chef Client. So all of our other Cookbooks we left out of it for the time being just so that we could get everything bootstrapped and registered with Chef and have all the nodes reporting in the Chef and have Chef Client running every 30 minutes.

Once we had all of that set up, then we would go ahead and take smaller chunks and we would assign them over to the appropriate policies that actually had all the various Cookbooks running on them. That way we can watch them and make sure that nothing bad was happening before we moved on to the next chunk. So we did it a little bit at a time, which was nice.

So one of the things we wanted to make sure that we did was we wanted to document all of our processes for how we were authoring, how we were updating Cookbooks, and all of that, and how we were testing everything before it went out. So we documented our entire process so that everybody that is involved would be able to refer to that document and know that when you modify or create a Cookbook,

before you push anything, you run cook style, you make sure you clean up any syntax and formatting issues.

And then we have a kitchen YAML set up that will allow you to run test kitchen locally. And the test kitchen runs locally, but it actually spins up EC2 instances for RHEL6 and RHEL7.

And then we do still have some RHEL5 in our environment, so obviously you can't use EC2 for that, because there is no RHEL5 in EC2. So for that, we had to run a local docker image sent to us five, and at least get what we hoped would be good enough tests through that methodology.

So with that, let's take a look at some of the challenges that we had to overcome in order to get everything rolled out and get this project done.

So some of the things that we had to come up with to make this environment work included building a pipeline, because we wanted to try to automate as much of this stuff as possible. And so currently, we use Atlassian tools-- Jira and Bitbucket and things like that. And then we also have Jenkins, that is basically be the engine of our pipeline.

So everything starts with Jira. So we create a story in Jira, and then from that story, we can link it into Bitbucket and we can either create a new repo or we can create branches on an existing repo, and that sort of thing. And then once we pull that repo to our local workstation, we do our work, we run our code, we do all the tests, and then we push it back up to Bitbucket. Create a pull request.

And then once it's merged into the parent branch, there's a pulling job on Jenkins that sees that you've now merged your branch and so it runs a pipeline job. And based on the name of the repo, it will identify whether that is a Cookbook, or if it's a policy, or if it's an inspect profile, or what have you.

So if it's a Cookbook, all it does is it runs that one Cookbook through test kitchen, also using EC2 instances. So we don't have a way to run tests through RHEL5, because on Jenkins, everything is running in the cloud. So for RHEL6 and 7, it'll run test kitchen, it'll make sure that Cookbook passes, and then it will go ahead and publish that Cookbook to our supermarket.

If it's a policy, it will find that and it will also run it through a test kitchen on RHEL6 and 7 in EC2, but it will run the entire run list so that we can make sure that everything in that policy runs from beginning to end without any issues. And then once that's done, once we get a successful test kitchen run, it goes ahead and it pushes that policy up into Chef in front, and it assigns it to the dev policy group.

And finally at the very end of that Jenkins job, it actually creates a new Jira story, that is, a sub task under the main story, and that is going to be used for once we have allowed the policy to soak in dev and we're comfortable moving forward, we can take that Jira story assign it to Jenkins, and then Jenkins will pick that up and it will go ahead and deploy that same policy over to the QA policy group. And then so on and so forth until you get all the way through production. If it's an inspect profile, it grabs that profile and it just publishes it directly to inspect.

So one challenge that we had to overcome when we were setting everything up was that we have a whole bunch of different networks, and subnets, and locations, and things like that. And some of these have different Active Directory servers, different DNS servers, and things like that.

So for instance, we have our corporate network, which is our main network that everybody in-house works on, we also have a DMZ, we also have a firewalled off area for PCI stuff, we have an alternate data center that mimics all of these various configurations so there's a DMZ at our alternate data center, there's a PCI network at our alternate data center, and so on and so forth.

So all of these different things would require us to have some way to programmatically identify, hey, where does the server point to DNS, where does it point for LDAP, where does it point for NTP, and all

these other services? So with some great help from Colin, from professional services, what we came up with was a custom resource.

And we had a hash that we built that would take the first three octets of the server or the node, and it would use that to identify what network it was on so that we knew, based on the network, we know that you're going to point to this Active Directory server, you're going to point to this DNS server. And we would be able to build that all out and be able to push those values into our config files in our templates.

So with that challenge met and overcome, the next big thing was patching. I know we've talked a lot about patching, that patching is a big deal. So one of the things that we wanted to make sure of was we didn't want untested patches making it into our production environment.

So one of the things-- the scenario being, of course, once you push out patches to your dev environment, for example, you have to give it a little bit of time to make sure that nothing breaks, and then you move into QA and then you let that soak and so on and until you get all the way through production.

But the problem is, if you deploy all the latest patches to dev and then it's not until two or three or four weeks later that you're deploying to dev, or to production rather, new patches could be available. And if you're just simply updating all of your packages directly, then chances are you're going to get untested patches in your production environment, and obviously we don't want that.

So what we came up with was we created a Jenkins job that would run on the first day of every month. That job would connect to the Red Hat package repositories and it would take an inventory of every single package and what version that was at that time. And then it would write all of that information into a data bag.

So then it would take all of that, and then we had another data bag that would handle all of the scheduling, and so it broke it up-- that was just basically a JSON formatted data bag that would break it up into the various policy groups that we have. And then for each policy group, it would specify a date and a time window that would be in your patching window.

So with that established, we would include our patching Cookbook in our base policy, so it would run every time Chef Client runs, but every time it would run, it would check to see if it was inside of its patching window. It would look at that data bag, and if it wasn't in its patching window, then it just skipped on to the next thing.

If it isn't patching window when it runs, then what it does is it goes through that big data bag that we stored all of the packages in, and it looks at every single package that's in there, compares it to whether or not it's installed on the system, and if it is, it compares the versions to make sure that it's the same version that we already have installed. If it's not, then go ahead and update it to the version specified in that data bag.

The first Sunday of the month, for example, that would be the window for the dev group, and they would run through, they would get that data bag, they would apply all those patches. And then the following week, QA systems would do the very same thing. By the time you get to production, since you're looking at that data bag and the versions that are locked in there, now you know that you're not getting untested patches that have been released since dev was deployed. So that worked out really, really well for us.

So one of the things too that we realized is that, so we had to generated a Red Hat token for Jenkins to be able to authenticate and build that data bag. But that token also expires after 30 days of not being used, which is not ideal when so many months have 31 days. So we didn't want to have to keep generating new tokens every month. So what we did is we created another Jenkins job which its whole

purpose in life was to establish a connection with subscription manager using that token every week, and that way and just keeps our token alive and so far that's been working really well.

So one of the obstacles that we have run into with Chef is that Chef doesn't really have a great way right now to run ad hoc commands on a selected group of servers. So if you wanted to grab all of your web servers and maybe run a particular command on them to make a change or to look at something or do some troubleshooting or whatever it is, you can't really do that very well in Chef.

So one of the things that we decided to do is we are going to register all of our on-prem servers with Amazon Systems Manager, and that way we have a way to run ad hoc commands remotely, which will hopefully work really well. But at the moment, that's not actually fully functional. But we're working on that and making some progress in that area.

So with that done, let's go ahead and take a look at some after the fact things that we ran into once the actual implementation of Chef was complete.

So I mentioned that we have a DMZ and unfortunately for us the DMZ is not able to route to our Amazon environment, which is where our Chef servers are. So having our DMZ servers talk to Chef infra could be a challenge if it can't communicate with it.

So what we did is we just created a proxy server in our corporate network that would stand in the gap and get those servers connected. So we used just a HA proxy server running on a Red Hat system on-premises. And then eventually, that actually migrated into also using that same HA proxy server to help us proxy a Nexus pro repository, that is also up in the Amazon, and we have a RPM repository up there to host a lot of our custom packages and RPMs that aren't part of the standard Red Hat repos.

So after a couple of weeks of being fully on Chef, we started to find little things which obviously always happens that we overlooked or whatever in our deployment. One of the things that we got was, I got a call from our Domino admin who had been doing some maintenance on his Domino servers. And when he restarted the server, the Domino services didn't come back up.

And so we jumped on the machine, we started doing some troubleshooting, and we realized that the problem was that the kernel parameters in CIS control file were not correct for those particular servers.

We were using a fairly generic template that we were pushing out to all of the servers across the board, and we had overlooked the fact that their Domino servers required different internal settings. So we had to basically go into our Cookbooks and we generated some conditions so that if it was this particular host name, or this group of host names, or what have you been, it would use these kernel parameters versus the standard ones. And so we were able to get that remediated.

Something else that we all found that, it took a little while to figure this one out because it was an NTP issue, we started getting a lot of authentication issues where suddenly our monitoring tools weren't able to authenticate. Support personnel were unable to authenticate. They had been using a cached credentials for a while, but nobody really noticed it and everything. And so eventually, once those caches expired, suddenly people weren't able to log in.

So we ended up, of course, doing some more troubleshooting and determined that it was because our system time had drifted too far, and since we use Active Directory for our authentication. Kerberos does not like time drift, so it just basically thought you've know, lock the door, and kept you out.

So what we found is that the NTP servers for certain networks that we have, apparently I did not have correct information when we were telling it which servers to point to, so we were able to correct that, get a new NTP template put out, and then, so we updated our NTP templates and we also changed our custom control that I talked about before. Our custom resource for the sites gave it a new value for what NTP

servers they would be pointing to based on their network. And once we get that deployed, that problem ended up going away. So that was good.

Another authentication issue we ran into ended up having to do with our SSHD config and the access.Conf, because we were specifying what groups and users were allowed to connect via SSH into those servers using SSHD_config with the allowed groups parameter. And also in the access.Conf, the very last line was to deny all access and then ahead of that we would put groups that were allowed to connect in.

But again, we were using a very standard config file for those, and we had overlooked the fact that some servers actually had other groups that were logging into them to do various work. So what we did is we created a hash in the attributes file to list out all the groups that should be in there, and then we put conditionals on certain posts so that if it were those hosts, then it would add specific groups to that list. And then we would use-- we would iterate through that list to build those config files. And so we were able to overcome that challenge.

The other one that we had was sudoer Basically, a similar scenario where we had a very standard sudoer configuration that we were pushing out for our system admins, our application administrators, our network folks, and a lot of our services and stuff running, monitoring tools, and things like that use service accounts to log in.

But we, once again, overlooked the fact that some servers have certain things that other people need to run certain commands on with elevated privileges. So thankfully, we were using the sudoers.d modular files. And so what we had to do there was again creating a hash with a list of sudoers files and then put a conditional based on the host name again to say if it's this host name, also include these additional sudoers files in that list so that those people could run with escalated privileges.

So the last thing that we were trying to do was, in the interest of tightening things up, we tried to standardize on the CIS benchmarks. And of course, inspect has a built in profile for CIS benchmarks. And one of the things that we are going through is we're trying to mitigate some of these CIS where the tests were failing.

One of the things we noticed is of course we would eventually get to one where it said, hey, it's recommended that HTTP services are not running on your server. Well, that's great for most servers, but if you're running a web server, you need those services running, and I don't want to fail the check every time.

So we went back and forth on a couple of ideas based on the history of some of the previous problems that we overcame. Obviously, the first thing we thought of was to create a hash, create a list of all of the web servers, and then use that and say, hey, if you're one of these servers, then skip over that check. But one of the things that wasn't ideal about that was then every time we added a web server, we would have to add that to the list, we would have to rebuild and push out new code for our spec profile, and we didn't want to have to do that.

So what we came up with, again, thanks to Colin. What we came up with was a tagging procedure. So we would tag systems with the knife command, and we would just tag it as a web server, and then we did one modification one time to our custom profile that said if you have this tag, then skip over this check. That way now, if we were to add a web server, I don't have to go back in and modify custom profiles, I don't have to push any code or anything like that, all I got to do is run a knife command, tag that system as a web server, and now it'll automatically go ahead and skip that particular system or that check on that system.

So I really like that solution. I think that'll work for us going long term. So that's pretty much how we got to where we are, and now I'm going to send this back over to Michael to talk about our next steps and then close us out.

Thanks again, Dan. So right now currently, our Linux landscape consists of around 560 servers between on-prem and the cloud, and we're mostly running Amazon Linux too with a small mix of RHEL in there. It is our intention to utilize Habitat in the future for our applications, but before we go there, we would like to move the rest of our AWS instances out of OpsWorks into our Chef landscape so we can get the benefits of patching that Dan spoke about earlier.

And with that, we'd like to thank Chef, the support services that helped us get to here, and all of the other companies that helped us along the way. So thank you everyone. And does anyone have any questions?

I know we've got at least one. John and Dan, if you want to turn your cameras on you can. You can mike back if you want to stop sharing. Then we can have everybody's faces.

There we go.

Magic. So, Dan Webb suggested, have you tried using chef-run for ad hoc commands?

We have not. I'm not even sure that we've really explored that. I know because of the fact that we're using policies, I know there was one option of using named lists to be able to do that. But the only issue with name list is that you have to know about them ahead of time that you're going to want to run those particular commands.

So that particular thing wasn't an option. I don't know if the Chef run-is the same thing as what you're talking about, but that was what we had come up with. But we're always open to more ideas.

Yeah. For sure. Chef is massive, so I think the ecosystem is massive. I'm sure there's more stuff to explore.