A Windows View into DevSecOps Success at Bluestem Brands
Best practices in optimizing management of their Windows-based systems
With roughly $1.5 billion in annual sales, Bluestem Brands is the parent company to 7 well known eCommerce brands including Fingerhut and Appleseed’s. In 2020, the company faced increasing financial pressures and sought out ways to streamline operations. The Bluestem Windows DevOps team set their sights high, proving they were as strong, determined and adaptive as the prairie grass for which their company was named. An agile shop already, the Bluestem Windows DevOps team knew the answer to their current challenges was to address operational inefficiencies by normalising their systems and automating everything they could. To do this they made the decision to migrate from Microsoft SCCM (which they had been using for OS updates and compliance), upgrade from Chef 15 to Chef 17 and implement a policy as code approach.
During this session you’ll learn how the Bluestem Windows DevOps team streamlined Windows patching, saved 100s of hours related to compliance checks, completed 2 Chef Infra Client upgrades and implemented a fully automated Chef Infra Cookbook pipeline, including deployment of true pre-prod environments for testing, all in LESS THAN A YEAR! The team will also share lessons learned and best practices for those interested in optimising management of their Windows-based systems.
A quick background of Bluestem Brands. Bluestem Brands is headquartered in suburban Minneapolis, and is the parent to five dynamic ecommerce retail brands. Bluestem offers a unique mix of retail and payment options for a diverse range of customers with a wide range of financial needs. Whether it's through our unique retail business model or diversity of brands or our company culture, we are forging our own path. With roughly $1.5 billion in annual sales, we at Bluestem have our sights set high. With so many distinct retail brands and a national footprint, there is always something exciting happening at Bluestem Brands.
Our Northstar brand, which includes our Fingerhut brand, has a distinctive value proposition, offering great national brand merchandise with unique payment options to a wide range of customers. That provides a unique and better shopping experience.
Our Orchard Brands, which includes our Haband, Draper's & Damon's, Appleseed's, and Blair, offer both uniquely designed and national brand fashion. We focus on serving each brand's customers with courtesy and respect, providing them with the products and services that they want and need, along with outstanding customer service. We're continually innovating and improving.
There is a passionate commitment to the company and to our customers that you just won't find in most other organizations. Our values of trust, teamwork, and transparency are apparent in our words and actions every business day. There is something different here at Bluestem Brands. And also, we are [INAUDIBLE] hiring. So go ahead and hit the link here if you would like to start at Bluestem careers.
So I'll just quick give our environment background for our Chef environment, and we'll get into more of the details later in the presentation. So our environment consists of Microsoft Windows Server 2003, 2008R2, 2012R2, 2016, and 2019. So when we started our journey, the Linux team already had an existing Chef infrastructure to get us started. And with that, we built our Jenkins pipeline with GitHub Enterprise. And we finally built our own servers in the cloud using Packer and Terraform. And here's just some of the challenges that we have ran into while we were building out our Chef environment.
Some of it was SCCM limitations. [? UI-based ?] can be cumbersome compliance processes, limited ability to automate flows as part of the CI/CD pipelines. Change management bottlenecks, limited ability for multiple people to make changes given current cookbook structure. Cookbook dependency issues with policies, hard to rebuild in an automated fashion. Orphaned processes, and another big challenge for us when we started out building out our Chef environment was our company had actually filed for bankruptcy, and shortly after, COVID-19 hit and the world checked out. And so with that, I'm going to go ahead and hand it over to John Haggerty, who's going to talk about Chef Adoption. Thanks, Tyler.
So I'll be touching a little bit about how we went through in the process of adopting Chef. You can go ahead and go to the next slide. I'll be touching on some key points here-- stability, patching, application deployments, [? compliance, ?] reporting. Talking about our Chef journey, where we were, where we're at now, and the future plans we have at Chef. So where we were, actually-- we were using an application called SCCM, System Center Configuration Manager. And we were using that for all of our patching, application deployments, compliance, and reporting. And the issue that we had with SCCM is stability. So the program itself can be challenging to use, especially if you haven't had any formal training on the product.
At times, it can be difficult to try and find out why certain machines listed in the SCCM says that the endpoint is active, but in fact, it's not. And it's really hard to try and figure out why the client can't be pushed down to endpoint. What [? we've ?] found is moving away from SCCM and to Chef, and for servers, it's really been rock solid. So that's been a great help for us. And going into the patching process with SCCM, we had literally eyes on glass during patching. We couldn't rely on the system to make sure that our servers got patched within their assigned patch window. We would actually log on to every server to ensure that patches were applying, which really is not feasible long term. And then this goes back to the stability, as we would run into client issues.
And most of the time, we didn't actually catch those until we were in the middle of our patch window. So moving to Chef, how that helped us with patching is we actually just moved right into WSUS. So we use Chef to configure WSUS on the client side using the WSUS cookbook.
So the Chef client cookbook, we're using that to configure the client side of Chef, basically just to tell it which WSUS server to talk to, and any other client sides [? getting ?] configurations. And that's really it, what we're using Chef for. And then what we do is we actually use PowerShell scripting to configure the WSUS server side like approving updates thinking between our WSUS servers, managing our patch, group maintenance, and so on and so forth. What we saw was an increase of patching stability to the point where we no longer need to be watching every patch window. So that's obviously a great improvement. Next, I'll be talking a little bit about our application deployments. Again, we use the SCCM deploy our monitoring of security applications.
And with that, we really needed someone that was an expert in SCCM to do this. We did lose some of that expertise over time and tribal knowledge. And another fallback is, it didn't have a CI/CD pipeline. So this means there may or may not have been any local dev testing, and it may or may not have been tested in some lower environment. So Chef gives us the ability to test our deployments using Test Kitchen, and we always deploy to non-production first as part of our CI/CD pipeline, which Jason will be going over later on.
Next, I'll go over how we do compliance. Again, we are doing compliance with SCCM. We didn't have consistent behavior. We encountered unwanted behavior due to the lack of a CI/CD pipeline, and the ability to develop locally. Chef InSpec and Ohai attributes combined with some PowerShell scripting gives us the ability to completely automate our audits. That really is a huge timesaver, going through these audits that we would have to log on to each server and take screenshots of the configuration. So that's a thing of the past.
Next is the reporting aspect. We were using SCCM reporting. SCCM is great, but it comes with a whole slew of these canned reports that we didn't need or use. It was collecting a bunch of information that we didn't need, and we didn't use.
And so it was hard to find which report we really wanted. And there was a limited ability to automate the reports. So Chef Automate is easy to use, and we leverage the API wherever it makes sense. You can go to the next slide. So here I'll be talking about what we went through to get where we are today, and some of the mistakes and successes that we made along the way. So back in 2015, we started seeing that Chef and Microsoft were working together to make Chef a better solution on Windows. We saw Jeffrey Snover, the father of PowerShell, speaking at ChefConf that year, and introducing Windows PowerShell DSC, and how it's supported on Chef.
At that time, we were interested, but not really ready to make a move to get off of SCCM and move to Chef. So fast forward to 2017, we went through some intro Chef training, we kind of got our feet wet. And that's where we kind of started experimenting with Chef. In 2018, we started a vendor-led initiative to try to get Chef on Windows. This wasn't a complete success. The problem is, we tried to bite off too much, really. What we were trying to do is get everything off SCCM-- patching, compliance, [? AD ?] deployments, reporting, along with we were trying to do server build, SQL configuration, [? IAF ?] configuration. And so it was really too much, and it really wasn't a success. But what we did do is we did engage with Chef professional consulting services, and we scoped that down to a smaller engagement. Basically, what we did is we went and we just focused on getting patching and compliance off SCCM and into Chef. That was a huge success, and after that engagement, we had Chef running on everything.
So over the year 2020, we were able to build cookbook for base server configuration, like configuring Chocolatey, configuring PowerShell [? excuse ?] policies and repositories, configuring .NET firewall policies, implement a bunch of CIS configurations. And we even went as far as using Chef as part of our package imaging process to use it to deploy updated code to our existing servers as well. So where we're at today is we're focusing on removing the remaining functionality of our current configuration management tool, which is managing antivirus. So we're going to try and do that with Chef and manage antivirus as code. Along with, we're going to try and standardize application deployment processes.
Right now we have a mix between Chocolatey, Chef, and SCCM. We are going to be taking a look and seeing what Habitat may have to offer there. And we're going to choose something and make that the standard. And then we're also going to try and tackle SQL Server baseline configurations. So that is all I have. And with that, I'll hand it off to Jason Arvin to go through our CI/CD pipeline. Hey, thanks, John. Hi, everyone. I'm Jason Arvin, senior DevOps engineer with Bluestem. I'm going to share a little bit about our CI/CD pipeline, starting with a quick overview of its components, then walking through the entire flow, and lastly, talking about some challenges that we overcame, and some that we're still working on.
So the first slide here, we're using a Jenkins Pipeline. And we have Jenkins installed on a Windows Server 2016. And being that we are Windows DevOps, our Groovy script is written heavily in PowerShell Core into Windows batch scripting. We're using GitHub as our code repository, and requiring feature branches, pull requests, peer reviews before we can merge to master to deploy the code to our environment. And Test Kitchen, we're using AWS to deploy two ec2s when we run Test Kitchen, one Server 2016, another Server 2019. Go ahead move their slide there. Here is just the flow of the swim lanes of our Chef pipeline for most of the changes that happens within the cookbook pipeline. The cookbook pipeline is represented in blue, and the policy files, or the policy pipeline is represented with the orange. At the top swim lane there is our local development.
We start by developing new recipes locally on our laptops, and run Test Kitchen with our AWS ec2s. And once we're happy with our local code, we'll push that to a feature branch. And as we work through that to the next swim lane, we run Cookstyle and we clean up policy files, run Test Kitchen on our feature branch through getting Jenkins. And once that comes back successful, we'll have a peer review, a pull request, and then we'll merge to master.
Once we merge to master, we'll run Cookstyle again, clean up, and then we'll trigger our policy pipeline. The policy pipeline then starts [? the ?] [? cleanup ?] by removing all the log files in each policy directory, runs Chef Install to install the new policy files, Cookstyle for typos. And then we run Test Kitchen again for each policy. And then once all is happy and good, we'll deploy to non-prod. After [? non-prods ?] [? it, ?] we'll let that bake for a day or two. And then we'll put in a change request and push those changes to our prod environment by manually running the Jenkins job with our prod deploy parameter, just a checkbox to deploy it straight to prod. Now, in that second to last swim lane, and I guess the last swim lane as well, there's a loop over policy directories. That's kind of crucial in how we designed our policy directories and our pipeline. We're putting all of our policies into one Git repo. So we will loop through each policy and deploy it.
Go ahead and move to the next slide. So when we loop through each policy, it was originally a challenge because Test Kitchen, when deploying new ec2s nto AWS, would take 5 to 10 minutes per directory. So as our Chef environment grew, we had more and more policies. So our pipeline runs were growing from starting at 10 minutes to 20, 25, 30 minutes, which just became a little bit too long and kind of annoying, especially if you fat finger something or typo something and you need to change the code again real quick, it's another 30 minute wait before that policy is actually pushed out. So what we did is in one of our stages in our Jenkins Pipeline, we are using PowerShell Core instead of just regular PowerShell, because PowerShell Core has the option to run a foreach loop in parallel. As shown in the code in the bottom right there, we loop through each directory in parallel and run Test Kitchen. So that took our pipeline runs from 25, 30 minutes down to 5 to 10 minutes, which is much more manageable when trying to deploy code to our non-prod environment to let it bake in before we push to prod.
Next slide. One other challenge we're still working on is that most of our servers are domain-joined naturally. So there's some config stuff that we want to get and want to test on that are specific to our domain servers. And when we launch our Test Kitchen instances, we don't have them join the domain, because we don't want to go through the trouble of removing them cleanly from the domain after the Test Kitchen run that would only last 10, 15 minutes. So when we need to test something specifically about our domain for Test Kitchen, we do have a couple of servers in our VMware environment that are domain joined, fully configured with Chef, and that will log into and build out this folder structure on a temp drive or temp directory, you'll see on the right there, the cookbooks, and then test directory and the recipes directory.
And that allows us to run Chef client locally and test just specifically the one recipe we want before we run it through the pipeline. Because there are some instances where we have recipes that don't run during Test Kitchen, because they'll always fail, because it's looking for a domain-specific component, and that the ec2s are not domain-joined. So on the left there is the one-liner we run after building up the file structure that is just a Chef client, z for local mode and o to override the run list. Then you put in your name of your cookbook and the recipe specifically you want to test, and the output below will show whether the test was successful or not.
Once we're happy with that, then we'll commit that to our feature branch, but we'll have to add the exclude in Test Kitchen so that that recipe won't run in Test Kitchen. Again, because it's not domain-joined, it'll just fail the pipeline. Once that excludes, then we'll push the non-prod, wait a couple of days, push to prod. And then we'll finally see the results in our environment of that new recipe that we've developed. And that's all I have. Thanks.
Now we'll have Dustin Giles talking about our ServiceNow integration. Hello, how is everyone doing? Thanks, Jason, for introducing me. My name is Dustin Giles, lead engineer here at Bluestem. I'll be going over the journey with Chef CMDB integratIon with ServiceNow. I do want to start by saying I'm not a ServiceNow expert, but I do play one on TV. So enjoy. So before the Chef CMDB integration, we started with ServiceNow express as a POC. We slowly started building out our CMDB over two to three years. We have limited buy-in, limited budget for the project. It wasn't until last year we had the buy-in and the desire to build a more robust CMDB. Prior to the Chef integration, we would build our own integrations.
Although they were a lot of fun, building, testing, and figuring out how to hit different tools, API, or CLIs just took a long time. We also found that some of the tools that we were utilizing with SCCM, we had to build custom integrations, we were using WMI calls per server. And it was good, but we had to hit every server a couple of times. And it was just a little clunky. And then with SCCM, we lost some of our engineers, so some of the talent left with them. As well as, in general we're looking to get rid of SCCM. And this would be a good opportunity to move off of it. We were also seeing data inconsistency. Like, for example, we'd pull from VMware. And the OS data might say like, Windows 2016, but in reality, it was 2019. So we couldn't rely on some of our data sources. And that can begin to be a big problem. So finding the right source was kind of half of our battle. And then another thing that we were doing for the Chef integration we were using a lot of SCCM reports, [? especially ?] for like, what software is installed on our systems. And we later found, we were building too many reports, it was getting very littered with all different kinds of reports that didn't need to be there. And we didn't like all the clicks you had to do with the reports.
So you can go to the next slide. As we started pushing through with finding out about this integration, the integration was new. So that was one of the challenges. We actually found out about the integration right around ChefConf last year. And then we started almost immediately. So the problem there is, usually don't have a lot of places to google when something's new, or any community docs. So we struggled a little bit there. We jumped right in, and we started just hammering out working on what this Chef integration can do for us. One of the things that we found, and is probably a miss by our side, because we're jumping right into it, the Chef integration does populate quite a few fields out of the box. These would be default fields that you would not have customized or made yourself.
One of which that we did find that kind of hit us, but we had a lot of the ServiceNow information backed up, was the description field. In the graphic to the right, you can see a before description and an after description. So before, we would describe our server and what it does in that description field. Once we implemented the integration, we found that it was describing the OS and the build number so it overwrote what we had. The other challenge we had is, there's very little expertise in-house for ServiceNow. So we didn't write any background scripts [? I ?] [? could ?] do in ServiceNow. And we heavily utilized our skill set using PowerShell. Go ahead into the next slide. So the integration, the benefits of Chef plus ServiceNow. ServiceNow made it to be our one-stop shop for installed software, client runs, clients reports, and attributes. This data appears to be pulled every four hours in batches of 50 by default. So in the screenshot, you can see all in that picture the Chef version, the client run was a success.
Our client's report does show failed status. So we will have to look into that. The other thing you can see is that attributes table. I'm going to go into that a little bit further in the next slide or two. But go ahead and go to the next slide. So that attributes table that we're using, x_chef_automate_attributes, this has a lot of data. But maybe some people don't know how to utilize this data. So we went from using our known tools, such as ServiceNow API and using PowerShell to pull that information. So on the right hand, you can see the different data that we had in the automatic table. It's all in there, it's JSON, it's hard to read. So how do you pull that? So we went to PowerShell and ServiceNow API. And down below is an example of the PowerShell code to do this. And then I'm actually going to demo this so you can actually see this live. Go ahead into the next. So it's always fun doing a demo [? I'm ?] always refer back to my buddy Jason. So here you can see our URL. You're able to hit the attributes. I'm searching for the server name TESTSNCHEFCONF. Do a Invoke-RestMethod to get the results. And then I use this ConvertFrom-Json and convert it as a hash table. You will need PowerShell 7 to use the -hashtable. But what that will provide you is a list of different things inside of that automatic table. So here you can see the different things you can kind of dig into.
You have platform family, installation type, memory, [? ad-domain. ?] So you can kind of dig through a lot of that. I'm going to go through each of these. And you can kind of see, here's the CPU that tells you 1 real, and then the amount of cores. The kernel will show you the OS build version, the OS caption name. And then you can dig into some of these a little bit further as well. And then I'll just quickly show you memory. You can find uptime. And one of the cool things that [? we've ?] found in there is all the install packages. So you can kind of see what you all have installed directly from ServiceNow. And then you can combine a couple of different properties from the hashtable, and get kind of a full report quickly from this type of data. Go ahead and go to the last slide, Tyler. The next steps that we have for the future is, we're looking at adding more data to our CMDB, utilizing that attributes table. We feel like we've just started with the tip of the iceberg. So we're going to be going through more of that. The other thing that we've started reading through is how to change the defaults for the data feed. The data feed is the feed that how often ServiceNow gets the data populated into it.
Currently it's every four hours, and it does batches of 50. I showed the little patch file example that we're going to tweak, maybe have longer intervals and maybe bigger batch sizes [? to ?] [INAUDIBLE] [? play ?] with it, so wish us luck. It does seem like it's not hitting every one of our systems, because we do pull in all of our Windows and Linux servers. Then we have quite a few. And then finally, I just want to thank everyone for the opportunity to share our journey with the Chef integration ServiceNow, and I'll pass it off to Steve.
Hello, I'm Steve Perkins. I'm a DevOps engineer with the Windows environment and Bluestem Brands. To summarize, I'm going to cover our process and how we keep our Chef clients up to date, and discuss an issue that we had overcome [? with ?] [? update. ?] To start with, we'll go, reasons for wanting to stay up to date. Now, like Tyler had mentioned before, we had Linux running on Chef prior to our implementation on our Windows servers.
The Linux team had run into some major roadblocks on major version upgrades. We [? wanted ?] to avoid those kind of issues by staying current, and of course, take advantage of the latest and greatest Chef has to offer for our environment [? and ?] the process. As far as the actual update process, it's very straightforward. We'll take our updated installer, stage it on a deployment Chocolatey server. We'll then change the version number and [? checksum ?] in our default.rb for our baseline policy file as shown below. The pipeline then deploys to non-production servers which get updated, allowing us to test for any issues before we go to our main production environment. I know it sounds pretty straightforward.
As we all know, there are bound to be issues. Next slide, please. So the main issue that we had run into recently, there was a roadblock where some of our older servers were running PowerShell 2.0. The updater utilizes a ConvertTo-Json commandlet available on PowerShell 3.0 and up. To solve for this, we found a function, and it was called ConvertTo-STJson. This was created by Joakim Borger Svendsen. And details and code to be found in the link shown onscreen. This allows us to take the ConvertTo-STJson and create an alias for that allowing the functionality for PowerShell 2.0. So we take the function, add it to a PowerShell profile, and then create an alias for ConvertTo-Json inside of the profile. This allowed the same scripting that worked on all the other servers to work for the ones with the older versions of PowerShell. We [? had ?] also scripted an update to some of the core.rb files for Chef Infra to allow for profiles to load, and change how ConvertTo-Json commandlet gets handled.
Next slide, please. With that, I'd like to conclude a presentation of the Windows Chef environment for Bluestem. Today we covered Bluestem's background and environment, our process of Chef adoption, breakdown of our CI/CD pipeline, details about our ServiceNow integration, and how we kept our Chef clients up to date. I'd like to give a shoutout to our Chef success partners, John Roach and Skylar Lane. We wouldn't be where we are today without them. And I want to thank you for attending our presentation. [APPLAUSE] Great. Thank you, Steven and the rest of the Bluestem team. Absolutely great presentation. Tons to take in through all that.
We have received a few questions from the audience out there, if the team wants to come back on camera. Hello. And I'll try to parse these out by a presenter as we receive them. One question, I guess, Jason, I'll start this one off with you. On the pipeline you showed, [? or ?] using InSpec files for testing. How does that play into the pipeline? We are using InSpec. We have a separate pipeline for that. This was specifically our cookbook and policy files for Chef Infra. But we do have InSpec running with our Chef automated environment in a completely separate pipeline for it. Awesome. John, a question. You mentioned that you're using the automate API for everything you can. Can you give examples of that, and what drove you to using the automated API? Yeah, sure.
So we do have audits that take place, and we are using a lot of InSpec. So it's really convenient just when we get the audits in, audit requests to provide evidence that we can just hit the API and generate reports and pass that along to the others. It just makes it very simple, quick, easy, and we don't have to spend time, hours of logging onto servers and taking screenshots of configurations. Great. Looking through the questions here. OK, Dustin. [LAUGHS] ServiceNow, where do you view your ServiceNow data? Do you have a dashboarding tool, or how does the reporting roll up for that? Yeah we have a couple of different ways. We actually can use the [? GUI ?] itself, so logging into ServiceNow. And then we also are really heavily using PowerShell to create some custom reports using the API of ServiceNow, and just selecting the fields that we might need. One of the things that I quickly showed in the demo is the installed software, if you want to ever find it on [INAUDIBLE] servers. It's a really quick way to just populate that. Great. And there is a follow-on question there on how do you use that data to drive remediations? Have you built any automated processes, or how does that tie into your remediation process? So one of the biggest ways that we've actually seen the benefit of it is doing a MS throwup of pulling the CPUs and knowing the different cores for the audit itself.
So our Microsoft throwup was a good way to show us some of the detail that we were able to populate with Chef [? Infra. ?] Great. Throughout the presentation, you [? called ?] [? out ?] policy files. A lot of our Chef users out there still a little scared off of policy files. Is there any advice on how to get started with policy files and what the first steps are? I can take it. So yeah, we were a little scared at first too, because the Linux team is not using policy files. And when we started down the journey of trying to adopt Chef and implement it on Windows, we weren't really knowledgeable of policy files. But with the engagement with Chef professional services, they really walked us through how to configure them and how they work. And after getting that just base knowledge, I think it's the way to go. Even the Linux team is coming over, and they're thinking about switching over to using policy files instead.
Awesome. OK, I have one more question out there just for teams out there still using SCCM, and [? pullbased ?] systems and maybe not pipelines yet, you guys went on this incredible journey over the year. Talk about how do you get started? How do you start talking to your management about the benefits, and moving forward with the strategy of pushing everything into a pipeline? Well, if you want infrastructure as code, SCCM is really not a tool. SCCM is a powerful tool and it does have its place. And I'm sure a lot of companies are happy with using it. But-- and I would just say start small, right? Have one or two things that you want to accomplish, and focus on those, and get those before you start moving on to the next thing. You don't have to try and take everything out of SCCM and put it into Chef all in one go. We are still migrating off the last bits of SCCM and into Chef. So it is a process. Great. OK, I lied.
There's one more question just on after the upgrade of Chef Infra Client 17, what are some of the biggest benefits you've seen? Yeah, I can take that one. Sure. The biggest benefit is, we've found is getting away from the supermarket audit cookbook, which Chef Infra 17 audit, the compliance stage is built into the client. So we worked again with our Chef success team to remove our dependency on the audit cookbook, and just use the built-in compliance phase, which was super easy and super simple, and now we don't need to hit supermarket for it.