Keynote

Story of how SAP Selected Chef InSpec for Cloud Security Posture Management

Know how SAP selected Chef InSpec for Cloud Security Posture Management

SAP is undergoing rapid growth in public cloud in the last four years, doubling their cloud resources every year across AWS, Azure, GCP and AliBaba Cloud. As scale increased with no end in sight and existing SaaS solutions posed challenges, public cloud security increasingly got visibility inside the company following the Capital One media storm of 2019 as a different approach needed to be found. SAP found that Chef InSpec fit their requirements particularly well and through containerization could scale to previously unimaginable scale.

In this session, Jay Thoden van Velzen, Head of Security Operations, Multi Cloud, will describe their challenges, how Chef InSpec answered them, and how co-innovation and partnership efforts benefited SAP, Chef and its community.

Just to give a little bit of an overview of SAP, because I think we're like the really big software company that most people have never heard of before. Especially, if you don't work for a very large enterprise, but SAP is actually the top three software company in the world. I think it's Microsoft and someone else that I didn't check to look up ahead of us.

But we are typically known for our EFP software, but we also do analytics supply chain management, human capital management, master data management, data integration, and experience management with packaged solutions for 25 industries, 12 lives of businesses and solutions on premise cloud and hybrid level. That's just about last marketing for the company that you'll hear from me during the session. What is important is that we're dealing with a very big size where 100,000 plus employees globally of which about 24,000 are in research and development.

And one of the things that I always find interesting because it gives a bit of a perspective of maybe the seriousness of the things that we do. But 77% of the world's transaction revenue touches SAP system somewhere along the way, which is quite staggering. And if we talk about business transaction revenue, that is 85%. So we deal with really big stuff on a day to day basis, and like large corporations rely on SAP software to do their financials to operate administratively. So we take that with a big sense of responsibility.

Increasingly, we provide our software a service instead of more traditionally in the previous three decades since the 70s, on premise software. And we're increasingly also using public cloud so that is a big aspect of where I'm coming from. My team, the Multi Cloud team that really interfaces between SAP lines of businesses and the different cloud providers such as AWS, Azure and GCP as well as public cloud in China to provide the services on different types of platform. SAP has always had this motto of, "SAP doesn't pick winners, customers do," specifically when it comes to platforms.

ERP traditionally, you could have in any database platform that you wanted that was of a particular reasonable market size and that was essentially the same idea with Multi Cloud. A lot of our customers have already chosen their own preferred supplier and we're happy to provide solutions on whichever supplier that they want. So it does pose a unique cloud challenges and when we talk to others or they ask us like a Multi Cloud, "Is that a good idea we're thinking about it?" I usually say, you have to. Like for us, this was an a-priori business decision, and I know of a few others that we've spoken to where that was also the case either strategic or for pick the best of the body for a particular workload.

But it is not something that you do like, especially if you have to provide security governance building and so on around it that the Multi Cloud team does, you end up doing everything multiple times. So that is not something that we necessarily recommend. But if you have to, then you really have no choice and that's what you do. Now, what has also happened is that SAP grew in cloud resources very rapidly to about 30 million to 40 million today.

And the Multi Cloud team started in 2017 to create some organization around this. Initially the use of public cloud was very much related to development environment sandboxes. Try something out, it's pretty much how everybody starts. And then, it became clear that teams were just using company credit cards to start running actual workloads and now it was time to do something and intervene, and that's essentially when the Multi Cloud team was created. So we didn't have that much going on, but it allowed us to do things like set up accounts within the organization centralized billing and so on.

And what is probably a good idea to briefly clarify is what do I mean by a cloud resource because that's not necessarily obvious. To a certain extent, somewhat meaningless term but you have to count something. And what we don't like is to just count VMs because that misses lot of landscape. So a cloud resource is basically anything or any virtual resource that can be created inside of a cloud account, or a subscription or a Google project. So this includes the storage networking components or anything essentially that you could express as infrastructure's code or a man of service that you could create.

So in January 2019, we were at about 1.8 million. We had already the projected growth of growing at double in size every year and over the last two years that has indeed happened. We even exceeded it a little bit and then current growth rates were actually ahead of that this year. So its scale is something that is on our mind the whole time. It's like what I start my day with and what I go to bed with. Because if we don't do anything, if we don't prepare for this pros, and especially given that we typically plan two to three years ahead, I need to be prepared for now having security operations around potentially 30 million sources three years from now.

So this is something that is dominating constantly everything that we do. So the team is entirely based around infrastructure's code trying to bring those practices and evangelized in through the organization in order to make sure that other teams also realize that we can scale in public cloud just on the basis of people power. So we went through this initial growth wave before I joined the team. There was a program that was helping to accelerate payments into public cloud. I will admit that the focus was to accelerate people into public cloud and not necessarily to provide a secure and compliant framework.

Although we did have already the policies and procedures defined by SAP Global Security Organization. This is an organization of about 800 people in SAP that's responsible for a wide variety of security functions. Including setting policies and compliance frameworks, security incident response, threat detection and a whole bunch of policy and contractual regulatory-related items, and how do you drive a security culture through the company. And we were already working with these guys up to a certain point, but I think we're both in hindsight agree that relationship wasn't necessarily super mature.

But these specific policies and the hiring procedures are in context of the SAP's security framework. They try to extract common security requirements across the landscape. So if you think of requirements for ISO or SOC, or CIS benchmarks, or Nests or any frameworks. What we try to do is abstract those out a little bit so that a good security baseline exists. So when we go through regulatory requirements or certification audits, you're very likely to pass those. So think of these as good security practices and policies that match much of what you are probably familiar with.

Like I'm sure they have encryption at risk, friction and transcend UX MFA. It's a whole series of those types of policies and procedures. They also are for each individual Conf provider so we have individual ones for AWS, Azure, GCP and Ali Cloud. And one of the problems that we had was that these were maybe a little bit too well hidden. So you had to know where they were did have the strength of corporate policy already so you had to do it, but we didn't have a whole lot of follow-up capabilities.

One of the reasons was we deployed our first CSPM tool at the time. And I remember this is in 2018 or so when this was first deployed. The market was not particularly mature. All of the players providing cloud native security tooling were either very new or looking to port security tooling from data center into cloud where it didn't necessarily apply. We picked the first one we're not really going to talk about who that is where that's not particularly relevant. But the problem we had already is that we were already deployed in GCP.

We knew that we had to provide coverage for Ali Cloud but we didn't have resources deployed there yet. And this tool only provided coverage for a AWS and Azure at the time. What we did do is implement the alerts according to the SGS policies and handling procedures so that we had at least could detect to what extent teams were complying with those policies, and made that available to the individual lines for businesses and developer teams. To consult, which as we all see, they didn't do very much. That was pretty much where I joined the team, and it was an interesting moment in my life.

I've been doing IOT and Operational Technology Security for SAP as I'm mostly looking at the manufacturing and industrial side of the house. And an old buddy of mine used to run the Co-Innovation Center had just been named Head of Multi Cloud operations gave me a call and said, "Would you like to run network and security operations for me?" And I've been in a variety of security roles, but those were mostly advisory so I've always telling somebody what to do or recommending. But this was a chance to say, OK, let's go do something, this could be quite interesting. And he sketched me a little bit of what the situation was.

But in hindsight maybe, that was only part of a partial story. The interview rounds were very interesting because it was a team already and even more so now of a lot of fairly young cloud native people that were really brought in to the organization because of their cloud native skills. Whereas, SAP, as a whole, perhaps, but it's not quite there yet. So I was more familiar with the pace of the old organization. And I remember in one of my early weeks, we talked on a Tuesday about maybe would be good if we had a checkbook somewhere in the picture so that we got something very practical.

And on a Friday, somebody gave me a call and said, OK. But let me show you. And I thought, he's going to show me a design concept or something like that. And he's like, what do you think? And it's like, Oh, that's a real web page. Yeah? OK. All right, when is this going to production? When you say yes, right now, yeah. That was not quite the way of working that I was used to. This is a really amazing group of DevOps people that do things really fast, understand the landscape really well and I'm privileged to work with them.

But there weren't that many of them ever doing heroic work. And we had leadership that was a mix of native people as well as some more people used to SAP or veteran leaders. You can think of sometimes there's a bit of a clash where you get the really corporate types and then the very cloudy type. When I walked up to our floor the first time, I was still has the culture of SAP. So I put on a shirt with a collar and I got to the Multi Cloud floor and it was all hoodies and even my global vise president was wearing a hoodie. So I was like, OK, I'm cool. I'm hopped on.

Finally, I can be relaxed at work because that's what I'm comfortable with. And because of the Cloud ops and security affect, I thought maybe, I should wear two hoodies. And what also happened was obviously given that by responsibility for security operations, I started looking. OK, let's look at our CSPM tool, what's the state of the land? And it wasn't that good. It wasn't that good. We were still in that growth phase where it was more important to move into cloud as fast as possible. And there was an attempt to not do the extremely stupid thing but the situation definitely needed some attention, and we were very lucky.

I put a slide deck together to let my own organization know what was going on. We had needed some attention. And was actually also an interesting part of it is that I asked the team if they had tried to communicate that before, but they were asked to put the information into a fairly-rigid corporate template for quarterly updates. And the way they had ultimately come out maybe they hadn't quite grabbed the attention as it should have. So I have a background in analytics so I'm a big fan of trying to visualize problems with charts, and put it back together start socializing that a little bit.

And then, we were very lucky because in July 2019, we had a new global security leadership in place. Tim McKnight now is our chief security officer had been brought on a couple of months before. But brought in this number too was Richard Puckett. At one point, they actually both went upper level so that Tim McKnight is now Chief Security Officer and Richard Puckett is the Chief Information Security Officer. And when we met, they were doing the rounds and they are based in the Philadelphia area but came over to Palo Alto to meet the team was there at a meeting with them.

They were still in the middle of the introduction and I passed over from my deck and Tim said to Richard, Yeah, I know that we had some agenda to talk about that rather than some overall strategy. Let's talk about this slide back over here and pass that over. So at that moment, knew that we had at least gotten a little bit of attention of the state of the landscape. And that was literally right on time because three weeks later in North America, people will remember maybe in August of 2019, there was a lot of noise around Capital One Breach that exposed 100 million American citizens or Americans personal information including credit cards.

These are all like credit card application type stuff so it had all the private information that you could think of and about six million Americans. And the funny thing is that when a lot of the people in Infosec first heard about this, it was like, Oh! Another one. I went home that day, but it had some really interesting features. This was not the usual attack where someone simply tries to steal stuff, or the recent events where there were some interesting almost-political motivations behind it that probably caused the media storm to be a bit higher than it otherwise would have been, and it led to some questions.

I also want to make really sure the fact that this became a big thing inside of SAP has nothing to do with the team at Capital One. We actually know some people there who did a really good job. They at the time were no doubt far more sophisticated that we were given the landscape. And even if you looked at the details of the incident, which I encourage you to do, this is not simple stuff so let's be very clear with that. I'm not referring to this as the incident that happened that Capital One I'm talking about here, but what the resonance of that instant inside of SAP.

Because it did lead to questions from customers and basically broke out of the normal security media information cycle into the executive board. And we were asked or the chief security officer was asked, "Tell me you don't have this problem?" And he was like, "Well, I'm not sure if I can really tell you that we Don't". So this led to a series of discussions I've reached out to the Multi Cloud team myself and my colleagues. And I asked them, what do we do about this? What do we do to get this situation rectified as soon as possible? And that became something we called internally "Project Evolve," with the initial goal to within about two months get control over the situation and get rid of all high severity policy violations in the landscape.

Now, let's be clear, that was not a realistic goal to begin with. First of all, we're always going to have some new resources get deployed every day. And it's not everything is so tied down that teams are limited to various restricted landscapes. So we're always going to have some and the time scale perhaps was the most ambitious. I think both Tim and Richard were relatively new to the organization. They didn't quite understand the size and scale of it and how fast it moves.

Plus, this was work that came on top of the work that we were already doing. Because we were also deploying on a second CSPM tool. The first one, as I said, didn't provide GCP support. They also did not intend to provide Ali Cloud support, and ultimately got replaced by a different tool to be on board during the Q3 mostly of 2019, which at least gave us visibility into GCP. It was hard to remediate GCP landscapes if you didn't even know what kind of alerts you had. And the monitoring of Ali Cloud then starts in early 2021, if I remember correctly.

From the start, we had hoped for greater visibility. It was deployed as a separate instance because at the time, we were the largest landscape that they had on their view. And we had some initial teething problems but that's normal way here on board accounts, that many accounts. I think we were almost 6,000 or 7,000 accounts across AWS, Azure, and GCP Global at the time. And I think in 2019, I think we were about just under or around three million of these sources. Because we weren't making very quick progress and we felt powerless.

When we first put the "Project Evolve" together, that was on the basis of a couple of short and medium term recommendations. The short term was OK and started driving the organization down find the ownership. The account owners called them to the responsibility and make them fix their landscapes that proved to be actually quite difficult. But we hadn't counted on and this happens when you grow very quickly, you create a record of who owns the cloud account. But if you don't follow up then after a while, maybe there through changes in the organization. People move different teams, some of that account ownership gets a little bit diluted.

If you then call them up, and say, hey, we have some issues with policy violations in your cloud account. If they didn't tell you, Oh, but I left that team, now you have a problem. So now, who do we go to? So that the problem has got a little bit bigger because they got attention that makes sense this is fairly normal security. If you don't monitor anything, then it's very easy to realize that you're good by the time you get more visibility. You start first finding out that the problem is a lot bigger than you thought it was, and then you get ready to tackle it.

So as part of the initial recommendation in August, we had also added some medium term recommendations of how we could potentially provide more guardrails in the landscape to do some things like central logging use, organizational policies and improve our scanning capabilities. So we held a workshop in November 2019 for something that ultimately became a Secure Cloud Delivery where we brought in the public cloud partners. Sorry if I say hyperscale, that's a very SAP specific term for public cloud providers. But we brought in AWS, GCP and Azure.

We brought in our partners from SAP Global Security and a bunch of people from the Multi Cloud team. So this was like, how are we going to plan and tackle the larger problem of not just saying, hey, we have alerts, but actually, how do we follow up. We started a ways within office hours. I call every week where we could provide guidance and weekly follow up but that had to be something more. And one of the sobering moments when in the introductory like, hey, here we are and this is what we bring to the discussion.

One of our public cloud providers make this statement that our defaults are set for ease of onboarding, not security. And that takes your breath away, it's also 100% truth and it makes economic sense. It's not something that we can blame these guys for their actions and their businesses to provide infrastructure as a service. They want you to use that as easy as possible. If you first have to read five manuals and unlock 15 locks before you can run anything, then they'll probably run to the station. This is reality unfortunately.

So there is, of course, the shared responsibility model for public cloud, and unfortunately, not nearly enough people read that. I personally like the one from AWS because it's actually really clever and makes a very clear distinction of what are you responsible for and what are they responsible for. And the way I always express it internally to our colleagues is that everything that you create in the cloud account is your responsibility. So if you thought that they were going to do it for you, no, you actually have to configure that correctly. And how do we do that?

This is like late 2019 and we're aware of what's going on in the DevOps community and Infosec community where it's clear that we've lost the battle very much by putting security at the tail end. We need to do this at the very beginning as early as possible in the hands of the developers and not just at the quality gates at the end or repentance at the after deployment. But make sure that this is grained, deeply ingrained, and everything that we do when we produce code. And as a factory of producing software, this is critical for an organization like SAP.

As I think most of this audience will know fixing in cloud environments is endless and exhausting job. Patching in the cloud seems silly if you can simply recreate your deployment corexit infrastructure as code. And what we really wanted to do is enable teams along this journey and provide tooling where possible to assist them in that. So how can we make sure that we give them compliant scans as early as possible so that they find policy violations in their development environments, and we can encourage even a behavior where you create your infrastructure in public cloud?

You scan your landscape, you see if it's OK or not. If it's not, destroy it and then repeat the cycle. So use compliance as code for your landscapes. We have had some other pieces around that we'll get to as well. But when it comes to sharing fact, that was really the key part that we wanted to stress. Also, for any tooling that we want to put in place in public cloud with the security tooling or otherwise, is that it has to be scalable. So anything that we do means to be able to accommodate to potentially 3x annual growth and not break so that we have to change to something else. It also has to be deployable.

One of the nice things about having organizational roles and public landscape is that you can Deploy Tooling, that's cloud native, without having to bother anybody just apply to the landscape. We've done that with CSPM tools. We do that in Chef InSpec as well through organizational roles being able to scan any cloud account that is inside of the SAP organization and which cloud provider. And it has to be affordable. That sounds a bit silly and we get sometimes from vendors that you're big. You make a lot of money so why don't you give me a lot of money? If we do that with every vendor, then very soon we're not a very big company anymore.

It's like how do the rich get rich? Mostly by not spending anything and watching very careful with where it goes. Realize also that this has only become sharper in the months since then through the COVID pandemic. Where every dollar or euro in our case is flipped over five times to see if it's really necessary. So you have to make sure that you build something or deploy something that's sustainable and ideally doesn't grow with the size of the landscape. One of the big problems in public cloud, which is natural, because the more you use of it, the more you pay. But if you also buy tooling, ideally you don't want that tooling to become more costly with how successful you are, because time becomes essentially just a percentage tax on everything that you do.

So interestingly enough, during this workshop, one of the consultants from the public health provider who had actually been at Chef before said, have you considered a Chef InSpec? And the funny thing was is that we actually had. In the summer, we had looked at it was part of a set of evaluations that we do constantly to see what's out in the landscape. And we looked at it was like, OK, that's a lot of work. We don't really have a lot of people on the team. But now, this suddenly brought up by somebody who knew the struggles that we work through, and at least, understand the size and scale involved. If you work for a public high providing provide us with consulting, I assume that what you're talking about when you talk about big scale, like we're small compared to those providers themselves.

And it did make a lot of sense to us, though we didn't have a lot of people so bid for choir certainly and people investment to get that going. But it would fit very well because it would be compliances code and fitting entirely within the workflows of the infrastructure's code, CICP pipelines, and so on. So certainly, for the cloud native members of the team, they were very excited about it and said, OK, what about scalability, how does this work of the size of our landscape? And Tim Jones, together with Joe McCrea on my team, has to have a separate session talking about the technical stuff. He very confidently said, we'll containerize it, it should scale.

And I was like, OK, you're the cloud native expert. I have never myself containerised anything yet, so are you sure? Yeah, pretty sure just fail. No words. All right, good. Then since I trust Tim blindly, I never even doubted that he could be wrong. We then decided the outcome of that workshop that we would come up with something that we called secure by default cloud accounts. To some extent, we are still on that journey but the idea was to set up preventative controls. So this would be baseline control setup as organizational policies on each landscape that enforced MFA, enforced encryption at rest, assured that TLS is of 1.2 plus.

Bad logging is turned on at all levels of Grenada's server known, logging is turned down a whole series of those that we now have in place. But initially, I think we had a list of about 12 or so that we wanted to implement. And then for the text of controls, we thought that it was where Chef InSpec would play a role. Especially on the Chef Left developer enablement side aligned with those preventative control. So at this point, we definitely thought that this will be complementary to our CSPM tool that we were literally still deploying.

We also wanted to build our reference architecture guides for each platform that translated the policy for each platform to what you actually had to do. So the policies that don't expose admin ports to the Internet Explorer's database force to be internet. And architecture guides actually explain, OK, this is what that means. Within GCP or Azure, Ali Cloud context, as well as infrastructures called Building Blocks. So known to be a secure Terraform modules that could be cobbled together through a vast file to create secure by default landscapes. All of that in close alignment with the SAP Global Security Policy team, our were fully part of this planning exercise.

Where we wanted to make sure that all of it was aligned right. So that we've gotten to the point now where the labels of the policy are implemented in the alert. So that there's full traceability all the way down including into the architecture guides and the preventive controls. So if you hit something, you know exactly that you had that policy that Detective Control is associated with that. You can fix it with that set of Terraform modules, or the sets of remediation that you could execute. Preferably not through the admin console if you really have to, but here in SAP, we do that instead. That was really the idea.

So that led to our first meeting with the Chef InSpec team in Belfast in February 2020. By then, we had got some level of support for our proposal. It was actually submitted before the end of the year saying like, Tim McKnight, this is what we think we should do, and it was rejected. And we're going like, it's rejected, what are you talking about? Said please resubmit at the beginning of the year, we're starting a bunch of security initiatives and we want you to actually be a part of that structure. So we became secure cloud delivery and board sponsored program one of five major security initiatives and actually 17 total that were monitored on a weekly basis by the executive board. So now, the pressure was definitely on.

This was our last business trip before the pandemic. We have a team in Belfast and Chef team met us there. And the fantastic Titanic hotel over T, and biscuits. And so it was very civilized. Very good. Matt, Mark, Colin and Anna from Chef most of whom have still been supporting us along the way. So we talked about the scale that we had in the landscape. We mentioned the number of cloud accounts we had and they asked us about targets.

We weren't particularly familiar with the terminology. Like what do you mean by targets? What's the targets? Well, anything that you might possibly scan. It's like, OK, well. In that case, we're probably talking about three million, 3, 500, 000, and at least four. Just to get started, we probably need about 4 and 1/2 million. To which the reaction was like, how many targets!? And our reply was like, Yeah, we've got this a lot.

It's not the first time that we speak to a friend but we tell them the size of the landscape. And some of them, especially during the COVID time when you have a lot of this on video, you can see actually people seeing the reaction. So at one time or so, the guy tried to make a note, he tried to write and he actually lost his pen. Another one literally flinched and told him, look, it's OK. We get this all the time. This is quite normal but we found out later that at the time, the top end of the volume rate card ran out at $100,000.

So when you come in with 3 and 1/2 million, that's a little bit of different conversations. And maybe, this should have been an alarm bell for me that we were embarking on something really special. But Tim was so confident and he never doubted at any point in time. And it's like, hey, we're all good. This is going to be a good adventure for all involved. So we got our initial budget approval so we managed to get actually a bunch of headcount into the team that helped us support building this all out. We set up for the organizational support structures.

This is actually, when it comes to this security compliance aspect, I always stress this. It's not enough to have a tool, you need to have the entire support behind it. You need to be able to distribute the alerts in an appropriate way. We have centralized monitoring and it's bringing the organization. And you need to make it something that the organization cares about, not just the security compliance people care about. And the more you can make that visible and the more you can help teams with that, the better it is. We've been running our office hours now every week.

In the morning, it's the Pacific so that we had European and one side of Asia. And then, again in the evening for the US West Coast. And APJs to be a central point every week where teams can ask questions about anything relating to security and compliance in the landscape. That's been proven very valuable and we have a distribution list of some 350 people and regular attendance that exceeds 50 to 80. Which is remarkable for a weekly call that people don't have to join and don't get punished for if they don't.

We start to develop our first controls and designed what the Kubernetes Chef InSpec cluster would look like. And we deployed the first deployment and development landscape in GCP on GKE. And that brings us literally to July 20. Even When I put the slides together, I went back and verified, what that really our first release? That happened four months after we first talked to Chef, but that was actually true. We had our first event the controls were rolled out so those were specifically addressing things like encryption in transit as well as centralized logging.

That's by the way, a really nice one that you can do but organizational policies are very easy to do and it's very effective. If you have centralized logging, you can basically set on an organizational policy log to this storage bucket. And every log from every cloud account goes into the same place, goes into storage that could go straight into our same landscape. It's one of the easiest things you can do, and one of the most effective things you can do from a security perspective and public cloud when it comes to the physical-- sorry, sidetracked.

The detective controls that was, of course, the big one for the Chef InSpec. So we had our first development cluster the way we run global scans. And then we published our first version of the consumer container. So this one is deliberately designed to be run wherever the line of business wants to run. So install that on your laptop, give it your credentials and it will go off scan that cloud account and give you a report back in JSON HTML, and another one that I forgot.

Again, watch Tim's and Joe's session. Because it's just a docs container that runs anywhere where docs are run. So you want to plug it into your pipeline, you want to plug it into any compliance scans. Or we're now getting to a point where you actually have to submit to a good Chef InSpec report as part of a security validation to make sure that you pass your landscape. And of course, what we want to do is please run it as many times as you want. So that before you get to security validation and any other security air quality gates, you actually are already in good shape because you did this as part of your initial development.

We advise teams like when you start with this, don't build one development environment. Get a cloud account too. Just trash and create infrastructure so that you don't bother your colleagues, and create destroy, create destroy as many times as you need to in order to come up with this compliant landscape that now can house your machine images much better approach. We also released the first version of the architecture guides and building blocks for each of those landscapes. And we got very close with the SGS policy team.

This is really cool because in the interaction of actually materialized in the policy, you find little things that maybe are easily missed when you write a paragraph here and then several blocks down you write something else. So the policy actually got better. And any misalignment in there got done because we were actually implementing it. And also, we had a couple of cases where you can really see that we came in dialogue between the intent of what the security team was trying to achieve, and how practical that actually was on each of the landscapes.

So I don't have any offhand case, but it happened again and again and again. Where I was like, OK, we'd like to achieve this goal. And they write a sentence saying, it's OK, that's nice. But this is either really hard to do in this landscape, but if we do this or the other, or consider this valued configuration, we can very easily detected or can actually enforce it on an organizational level. So that policy got better with the employee implementation to the point where it became much more practical.

And even as a policy, easier to read what was actually intended with the policy itself. So that whole alignment and then it's through, the policies and the controls themselves, has been super helpful. That they allowed us to find things that we otherwise wouldn't have found when anomalies happened in the compliance scans. We could actually figure out exactly how that happened because of that. That's very valuable.

Meanwhile, I'm sad to say we had more CSPM tool challenges. We made progress on the basis of the alerts that we got. And to be frank, a lot of the challenges that we had, I think we would have run into with anyone else that was based on a side service. And I think this is so unique specifically to this particular vendor. One of the problems was, again, scalability. It was almost predictable if we were going to be the largest one largest instance. But things happen. We're talking as software developers, these things happen when stuff scale that you don't know.

So I'm not necessarily surprised when we run a tool and suddenly it's scales up to levels that they've not seen in the landscape, or they could never even think of low testing before. You're going to run into surprises and it took us a while before we could get data out of the system at all. So it was alert we're ingested and needed to get it out. That eventually worked through API call and later on subsequently also through the UI. But when you have to distribute alerts out and drive an organization, you can get data out of your tool which is tough.

Another thing that we ran into is I'm totally accepting of the fact that tools have false positives that happens. The security tooling, almost all security tooling, is not that good, and always requires a little bit of a follow up and try to have a few of them as possible. If for every false positive that you get reports that you have to go back to a vendor, this becomes a multi week exercise.

It's nobody's fault, but that is simply the reality. If meanwhile the entire executive boards is sitting on top of lines of businesses and saying, why did you fix that yet? Why did you make progress from last week? And you have a feeble answer. Like yeah, we're talking to the vendor for a week or two. They're still investigating. It's not a good story under that kind of pressure. It also leads to a lack of credibility within the organization when that happens. At some point, you got a list of three or four that are in-- under review or fix, and you have to go through the alerts. And so like, yeah, this step, there's a qualification you need, a qualification there.

That's really difficult to explain to executives, especially when you're already talking about a fairly complex security topic that they have trouble fully grasping. And then another thing is-- and this is actually what we see with all security tooling now, is context is everything. And the bigger the landscape brought, the more important that became. Like when we are now at some 12,000 active cloud accounts.

If we can distribute that out in a meaningful way, you're just like shouting into the desert, asking some random group of people to do something. So everything has to be associated to who is actually ultimately responsible for it. And with all security tooling essentially, we export the data out, map that to our multi-cloud database.

This is a database that has ownership information for each of the cloud accounts, but more importantly, it also includes a cost center that is verified daily against the core SAP system. That tells us the organizational hierarchy that call center belongs to. That goes eight levels deep. Not everybody makes use of that, but many teams do. So you can go down all the way into the team, create a pivot table and actually navigate the whole thing down.

And the importance of this cannot be underestimated. Even more importantly to do that association, in an organization that is constant flux. So if you have to maintain an organizational structure like SAPs in your security tool, then we have a mountain of work every time the organization change. And that will happen at least twice a year. There's new acquisitions that are coming in, so you have to have way too much work organizing that stuff within the tool.

So it's much easier to handle that by a database associate the alerts that-- with that particular cloud account, with the metadata of that database. And the structure reorganizes itself on the basis of the back-end SAP system that translates the call center information. It maybe a little bit geeky, but this is a massive operational benefit when you're dealing on the size like this.

So there was a point of why I was spending specifically extra time on this, is that what suits us best really is an alert engine. We don't need an interface of a tool. We probably are going to export it anyway. Get it into the scene, distribute it out. Add additional organizational context to it, and I'm still not aware of any security tool. And we talked to a lot of vendors that allows very-- that allows say an easy integration into a custom system, to associate additional context to information.

Yeah, you do it same systems, because you had tables in there. But you do a lot of the work by yourself. What suits us best is an alert engine, and that brought us to a very interesting question. As you recall, Chef was initially thought more as complimentary to this CSPM tool, and associated with the preventive controls, but it's now October 2020. We are sort of midway through our three year contract with the previous vendor.

By now we have heard that if we don't change the contract terms and with the growth of our landscape, that our renewal at the end would be at a number that frankly was very easy too, because a number like that, because they can immediately say, "No, you don't." You're going to have to think about it, so now we had a problem. What do we do, and can Chef in fact actually replace our existing CSPM solution?

Now there would be some consequences to that, but let me first run through a couple of benefits. First of all, really just economic, we're already licensed for it. So no additional cost. It would mean that, we are now half a year into the COVID pandemic. SAP everybody else is looking at, is every budget expense really necessary?

So if we could do this with Chef Inspect, that would actually save the company money. Make use of something that we already acquired, maximize the investment and so on. By now also, because of the growth that we had in the team, we actually went like, wait, we actually have the resources now to even consider this, whereas a year ago it was out of the question.

And if we have the time until the end of the contract, which is Q1, 2022, of the current CSPM tool. We have, essentially, a year to get it right, and a quarter slippage if it doesn't. And it's like, OK, that sounds reasonable.

A very important thing-- and why we talked about the false positives before-- is now we would own this ourselves. We would write the rules. If there's something wrong with the rules, either we find it in our own testing and we own this ourselves, or a line of business comes with us. We have the authority to scan their landscape. We can sit with them and say, what do you have in your landscape? Run that-- because, of course, our vendor doesn't have that same level of access, right?

So it becomes an information-- grab information, make screenshots of an admin screen. Pass this along. If they need to build a new environment, we can just test directly on their Canary and development landscape. It makes massive speed in dealing with any such issues.

Also, what's very nice is that-- and we don't take this lightly-- is the benefits of-- let me mention this one first. Sorry. A reduction in the operational ticketing load. This sounds maybe weird, but we would on-board users into the CSPM tool. So I think it would be created-- hey, I'd like to have access. OK, good. Accounts need to be placed in certain folders. Hey, can you do that?

If we do this entirely through our self-organizing, multi-cloud database that already have ownership associated with it, we don't have that problem at all. You just get on board. You get your data handed to you. Off you go. That makes a massive difference in actually reducing the load on the team just handling tickets.

And another really important thing-- and this ties back to our multi-cloud notion-- is that we are constantly at risk, I would say, for our executive board to decide that we're going to deploy another cloud platform. I'm not saying that that's coming. Don't take this as any hints. But it can happen at any point in time. And we have to be prepared for that.

Ali Cloud we were able to convince some vendors to support, of course, also because they saw some benefit into it. But what if somebody picks something that is strategic or whatever, and where the vendor says, we already have a contract with you, so you can't make it a condition of sale. We have no intention to provide this because we don't see a market for it. Then we would have to find a solution.

Chef InSpec being open source, we could actually fix this problem. And we actually solved the Ali Cloud problem together. By working together on the first Ali Cloud provider that is now available and we continue to contribute to, Chef InSpec has coverage for compliance scanning in Ali Cloud. It's fantastic. We're happy that it exists, that we can do it. If another weird one comes that we have to support, we have the ability to do that again. Meanwhile, the community gets the benefit of being able to run compliance scans on other cloud landscapes.

And this extends beyond, because we have made a couple of additional changes on the AWS one, I know explicitly because a couple others. Because we, ourselves, were interested in providing the control, we were able to add to the API back end, implementing new additions that AWS had just added to Cloud API. So that's super good for us, super good for all of you. And we're happy to continue that as we build out more and more of our use cases.

At the same time, the reality is that I answer to a lot of people in this company. And nobody had done this before. The size was a bit insane. And if you still look for CSPM solutions, I don't think you see Chef InSpec in the top 10 there, when they probably should by now.

What we also found is that that was just simply the reality and what we see with other vendors. But Chef did not have the experience with the scale. Neither did we. So this was definitely a bit of a gamble.

And just to add to the pressure, I mean, we're talking about SAP's compliance requirements. This is a function that is required for certification, ISO SOC 2, all kinds of regulatory things, [INAUDIBLE]. There's a whole bunch of them. If we get this wrong, we're in really big trouble, putting a whole lot of people into trouble. So there's some reputations on the line, most notably my own.

Though it's definitely like, OK, what could give us to the point where we say we are confident enough that we can pull this one off? So the first thing, of course, given the scalability question, is that we had to do scalability testing.

This actually was planned December 2020. Team got ready, deployed on Kubernetes, all set. And it went spectacularly well, to the point of uneventful. We had 7 million cloud resources scanned during the run, three nodes in GKE, three hours to scan the entire landscape across four cloud providers.

Just by pushing it, to see if we could, we pushed it to 280 running containers. But it's really more like 150 in normal operations. Add a Cloud One cost that actually here is projected, but proven to be less than that, which is very nice, if that's all it does. We even found that we could keep a whole bunch of history within that price point. So that was very good.

I heard about the results of this after it happened. It was done on a weekend to minimize anything that could possibly happen to any of our cloud landscapes. They are 24/7, of course. But the reality is that most of these are business software, so they have a bit of a Monday-to-Friday thing. So let's see what happens. Push this up. And everything good. So I'm very happy with the results.

And then Tim tells me, you know that Chef had a team on stand-by. I was like, stand-by? Did they think that anything could go wrong? He was like, yeah, there are 20 people waiting for it. I was like, 20 people? Now I'm not so sure whether I wanted to know this in advance or after, because that sounds like something could happen.

He actually told me that the reason he found out was-- it was set up. But he got a call saying, hey, Tim, did you postpone the testing? He was like, oh, sorry. No, no. We did the testing. It was all good. Thank you very much. Oh, OK. Let the team go then. What? Yeah, we're all good. Went well. It was a really good result.

And because of the leverage that we have with the-- reality is that we run scans on a daily basis for the global compliance, especially since the teams can run it any time they want. So they don't have to wait for us. So we have a three-hour window that could extend this, essentially, to 12. By then, we'd probably start to get a little worried. But we can also always throw more nodes at it at the same time. And because it is very much running on a schedule that creates workloads, once it's done, it dies down to almost nothing. So it's very busy for a couple hours, and then it drops down.

So that led us, basically, to, OK, can Chef InSpec replace our existing CSPM solution? Well, at least on the aspect of scalability, it absolutely can. So that made us feel better. And it's still humming along nicely without any trouble whatsoever.

So as a result, 2021 really became a conversion year, to build out the entire rule set, which is about 150 or so controls. But ask Tim and Joe. They will keep you straight.

We've had two controlled releases since then, with a lot of this all ready and working and in use by teams in the company. And we actually recently found the first false negative in the current CSPM tool after something unfortunate happened in a production landscape that hadn't been detected in the Canary landscape. We ran a Chef scan against it. And Chef did find the policy violation there, which is yet another milestone, really.

And that brings us, really, to today, Chef Conf 2021, telling you all about this crazy journey that actually worked out. We're now running 40 million cloud resources per daily run across 12,000 cloud accounts. We're setting up deployments uniquely for China now, because that, ultimately, seems to make more sense than to run it from outside.

And I have to thank this fantastic team, both on the multi-cloud side, as well as in global security side, that were directly responsible for making this all happen. It was certainly not me. This was a good team of people.

And there's, of course, way more people in a supporting role that should have been listed. But then the forms would get real small. That also goes for the Chef InSpec team that have been with us along the way. And at least 20, 25 people that we've touched in some way or other. So it's been fantastic success.

We are due to release the full, high, and medium control sets by the end of this quarter. And then we should go live at full conversion at the end of 2021. If anything unexpected happens, then we have another quarter or so to get it right. But this quarter, we'll be really focused on how do we get the analytics right, how do we get the API access right. There's a bunch use cases like that, that maybe Tim and Joe can also talk about.

And that brings me to the very end of the story. And I realize I've gone a little bit over time. But it's--

No, that's great. It's a great story, Jay. Thank you very much there are some questions, though, I have, that I'd like to make. And maybe you can answer them. You mentioned compliance, some of the requirements that you had.

So what sort of compliance audit requirements do you have? And how has it worked-- how has this work changed your complete kind of audit process before and after implementing Chef InSpec?

So it kind of comes as a larger package. So it's really the policies drive the detective scans, the preventative scans, as well as the developer tooling that we provide. So the linkage of all of these in the package, sort of multiply the whole. Where I think the Chef piece really comes in is both of the efficiency of our own team and not having to deal with the issues that you might have with a SaaS provider, but especially the type of behaviors that you start encouraging.

Like it was not accidental that we wanted to provide a consumer container. That was really the focus from the start. And to do that in Docker mean that, OK, you, developer, can run it from your laptop. But you can also run inside of your infrastructure. You can run it inside of your pipeline. You can run it, essentially, wherever you want, because we have a wide variety of different pipelines, tooling, et cetera, that people use. So you want. To be as flexible as possible.

And combining that with organizational follow-up has been very powerful. For instance, being able to say in combination with the preventive controls that we have-- we saw this especially with ports exposed to the internet that shouldn't be, like 22, Admin Force, RDP, as well as managed database services, like MongoDB or something like that. And being able to detect that, tell people that the preventive control is coming-- because it was released in Q1 this year. So we announced already in Q4 2020, these controls are coming, here is the Chef container, scan your environment for it, go fix it.

If you think you fixed it, run it again. Don't ask me or wait for a week for a report to come out. Run it again, keep doing it, and if you have any questions, come ask us. So that you prepare an organization for these preventive controls in essence, so that once they're implemented, first of all, they're prepared for the legacy that's already there, but now also they have the Chef container in hand to prepare themselves when they build new landscapes so that when they-- they avoid running into controls that also remediate and say certain things in their landscapes in a somewhat forceful way.

Nobody wants you to be auto-remediated. We want you to get it right from the first time around.

Did you have any audit-- security audits that have been shortened or decreased with this new kind of approach that you're taking?

Yes. But realize that we're not entirely fully live yet. So one of the aspects that is definitely an important aspect is that, yes, we provide the developer tooling for everybody. But we also have the central compliance scan that cannot be tampered with.

So our global Chef cluster scans the entire landscape. Like, nobody touches that. It's essentially a scheduled--

Because it's a single source of truth.

Single source of truth, scans the entire landscape. Either you have an exception or you don't have an exception. If we don't know about it, it doesn't count. We collect all of that. Essentially, it goes to-- the seam landscape goes into reporting, generates analytics, and dashboards, and presentations, and executive updates throughout the organization.

And if somebody in the organization says like, oh, yeah, but I really have only this, because those ones don't count, it's like, well, that's not what my record says. And in case of dispute, it definitely happens that our board area, COOs will come and check with us as like, Tim says they don't need to do these ones, what do you think? And sometimes it's, OK, yeah, that's actually a thing. But in other cases it's like, no, get out of here. [INAUDIBLE].

Another topic. You mentioned building the team from SAP, and, I guess, in advance of this. But the skill set that you had, the teams may have had to learn a lot about Chef and InSpec in particular. But how did your team kind of build their InSpec skills?

That was actually fairly simple. We kind of hire for talent rather than necessarily existing skills. So we look for people that hopefully have something cloud-native, hopefully know how to script, or develop something well, or know DevOps or ops very well. Hopefully, you have two of those. If you have three of those, fantastic. If you have four, then I probably can't afford you.

But you get that kind of sense, where we are looking for skills, talent, willingness to learn. And the more we were able to define what we were doing-- because we already had something in place. So our initial recruiting was probably a bit more of a gamble, because we had to explain we're trying to do something that is kind of difficult to articulate, and I don't have anything for you to read.

But more recently in the last six to nine months after that initial release, you can tell, you know, this is what we've done, talk about it. And then few people have a better idea of what it is that we're looking for. Because we're looking for someone interested in intersection. And I want to make a distinction between SecDevOps and DevSecOps.

DevSecOps, absolutely belief in that is a slightly different thing that's doing DevOps or building applications in a secure way. We do SecDevOps, which is SecOps with DevOps baked in. And of course, it has a lot of overlaps. But it has to be understood that the role is still a SecOps role, and not like let's build cool new product together, because the cool new product that we build is SecOps-related and compliance scanning for the entire company.

Did the team have to do anything special to accelerate the ramp-up time with Chef InSpec?

Honestly, I would say no. It's probably better to ask Tim and Joe, because they had to do it themselves.

OK, fair enough.

I think there was like-- we had an initial five-day consulting that got most of the team onboarded. I don't think many had much Ruby experience, but the syntax is fairly simple to figure out. And it's also sort of "InSpec Ruby," because there's a sort of limit of things like you're going to use. And they've taken to very quickly.

I've never heard from somebody like, oh, I wish it was in, say, Python to make it easier. It works really well.

Yeah, one last question. So you talked about SAP's journey in the quite extensively impressive discovery of the issues that you didn't know you had, and how you explored Chef to find out, and worked together in a way to overcome that and to find out the scalability level of everything. So with that journey in mind, what are some of the kind of top recommendations you have to companies that are in the beginning of this journey or starting out? What is the good, the bad, and the watch it, the what--

Realize that the basics are hard. And it's still about the basics. So I think there's a little bit too much of this, oh, let's go get this fancy tool that does AI/ML network monitoring that sends you alerts to your cell phone or something like that. Just realize that asset management is harder, that getting an organization to fully adopt MFA on the cloud accounts is hard and requires work.

A lot of this is organizational. Tools certainly help. But a lot of this stuff, you need to get done, get by, and get detection, get enforcement done. Like, policies without actually verifying that people are doing it have no sense. Just reporting that somebody did something wrong by itself is also not enough. So you need to even build, like within the organization, structures that follow up, create accountability and responsibility.

We all love the idea of DevOps teams having more autonomy. Fine. But stick to the rules as well. And that's the balance that we have to figure out. And our hope is to make the secure simpler and easier. So rather than teams themselves figuring out and maybe getting it wrong, we'd like to give them tools to then say like, hey, would you like to save a hundred hours' worth of infrastructure work on a platform that you don't really know well, and that you're going to have to do again when you move to a different cloud provider where you will have to do the same thing?

Why don't you use our toolset and save you a whole bunch of time. And if our scans say that something's wrong, you either change it, or you can come to us and ask like, what's wrong, because I used your stuff. That accelerates developers, of course, fantastically.

Configure, Deploy, Manage

Enterprise Job Orchestration

Security and Compliance

Keynote

Story of how SAP Selected Chef InSpec for Cloud Security Posture Management

Company

Using Chef

Legal

Connect with us

Configure, Deploy, Manage

Enterprise Job Orchestration

Security and Compliance

Achieving DevSecOps Success - Any Cloud to Any Edge

Featured Topics

My Account

Keynote

Story of how SAP Selected Chef InSpec for Cloud Security Posture Management

Company

Using Chef

Legal

Connect with us