Updated: Apr 22, 2022
The three ways of DevOps are:
Experimentation & Learning.
To receive the promised value of DevOps, rather than just go through the motions without getting that value, you need to incorporate all three deeply into your organization’s culture, but the inertial forces and ossification of the status quo make this very difficult.
Let’s focus on Flow as an example. The concept of Flow is borrowed from the Lean/Kanban Development community which is informed by The Theory of Constraints among other inspirations. There is more to it than this, but a key aspect can be summed up by Gene Kim as, “Any improvements made anywhere besides the bottleneck are an illusion.”
It’s similar to the thought that “a chain is only as strong as the weakest link”. Think of the processes that deliver value for your organization, like the production of software, as a “value stream”. There are various steps in that value stream, but only one is the primary bottleneck. If you make efficiency or effectiveness improvements in a step other than that one, it won’t matter because work will get backed up or its effectiveness is limited by the bottleneck, just as strengthening one link in a chain won’t make the chain any stronger unless the link you strengthen was previously the weakest one. On the other hand, once you improve the bottleneck (or strengthen the weakest link), the bottleneck moves to somewhere else (or some other link becomes the weakest).
A Comcast Story
While at Comcast, I questioned the impact of a particular security leader’s work because none of his team’s efforts were not on the obvious bottleneck (which I'll highlight as an example below). His response was, “We’re doing good things” but the goodness of those improvements were, as Gene Kim would say, "an illusion".
When we drilled down a bit, it became clear that he and all but a few of his staff had no experience in the bottleneck area and felt that it was out of his team’s domain of responsibility. Another leader might have found a way to step out of that constraint, but you generally don’t get to a mid-level specialist manager position in an organization the size of Comcast by stepping out of your lane. We needed to give him and his team a path forward, which was the Transformation Blueprint concept that is the general subject of this blog.
The happy ending to this story is that the leader's team ended up providing the largest and perhaps the most effective group of federated coaches contributing to the Transformation Blueprint-like program (aka Greenhouse) at Comcast.
The Bottleneck is Resolving Findings, NOT Creating Them
Let’s explore the typical bottleneck for application security. In an admittedly unscientific survey I did on LinkedIn, respondents indicated that rapidly resolving vulnerabilities rather than finding more of them is the bottleneck by a factor of 2.5 to 1. Adding to that is years of anecdotal evidence supporting this as well as various studies. According to Veracode’s State of Application Security report in 2021, it takes an average of a 171 days to resolve a finding. Why then, are we so enamored of new technology that increases our ability to create more “findings”? Why do we focus on spreading a footprint of such tools before giving much thought to how we rapidly resolve the findings from them? If you think that creating more findings is “doing a good thing”, you are dead wrong. Your duty requires that you focus on the bottleneck. It’s better to surface and rapidly resolve fewer findings espeically if they are the more critical ones. Never forget that you get net-negative value out of application security tool findings... until you rapidly resolve them.
How Do You Shift Focus to the Bottleneck of Resolving Tool Findings?
Assuming you are still with me, this fundamentally changes both the criteria for choosing application security tools as well as how to roll them out. The two primary criteria for tool and process are:
The rapidity of the feedback (which is the DevOps second way, but I digress)
1. False Positives
False positives waste time that could be spent resolving. As many as 80% of the findings from SAST tools are false positives. Cutting that to less than 10%, as is the case for the SAST tool that my employer, Contrast Security, makes and none others that I know of, would give you almost an order of magnitude more time to spend on resolving true positives and thus an order of magnitude improvement at the bottleneck, and thus, your entire value stream. Cutting it down to near zero after that, as is the case for Contrast’s IAST solution, gives you roughly another order of magnitude improvement.
False positives kill you in another way, namely in the psychological effects that either build or destroy trust between the security and development. In the minds of a developer, any time of theirs that you waste is magnified 2x, 5x, maybe even 10x from the perspective of trust. If 8 out of 10 findings from the tool you choose or even 3 out of 10 are a waste of time, they will resist and might even look for ways to hide projects from you, turn it off it for already onboarded projects, or reduce its use to only when you explicitly require it. At the very least, you’ve put security in a position where development wants to do the bare minimum for you to go away.
2. Rapidity of Feedback
Now, let’s talk about the second criteria, rapidity of feedback. This is affected by both:
Tool Speed If the tool takes longer than 5 minutes to produce results, it's likely that the team will resist putting it in their pipeline and blocking the PR merge which is key if the process is to achieve the DevOps first way of "Flow". I can't resist another shameless plug here for my employer, Contrast Security. Our SAST tool, Contrast Scan, is an order of magnitude faster than the competition and our IAST solution, Contrast Assess provides feedback even faster, essentially in real-time with other testing. Now back to your regularly scheduled blog post...
Where in the SDLC the Feedback is Generated After a bunch of experimentation and analysis on various places to plug in security tools into the SDLC, I've settled on the belief that the best place, by a good margin, is that this feedback should be provided in the pull request (PR) from a single/few developer branch (usually what's called a "feature branch") to the next higher level branch, which could be the master/main branch but is often a staging or QA branch. It should also be configured to block the PR merge if the findings are out of the team's current working agreements (aka "policy"). The rest of this blog post is essentially a discussion of why and how for this idea.
Triaging and SLAs Are Harmful
If your current process is to triage tool findings before putting them in the development team’s product backlog, you are killing the rapidity of feedback and DevOps “Flow”. You may have no choice because you chose a tool whose findings include more than say 10% false positives, but if you currently triage tool output before bothering the development team, follow along with me for a minute.
Let’s say it takes you 2 weeks to get around to triaging. So, +2 weeks here.
Then it goes into a product backlog. The product owner for the team then must decide to put it into a sprint. That only occurs on two week cadence so that adds another 1 week on average in the best case but is typically much longer. If you provide the team with an SLA, that makes it worse, because they won’t schedule it until that SLA time is up and you’ve pinged them about it being over the limit. So, +1 minimum and likely +several
Add another +2 weeks for the length of the sprint before that vulnerability is fixed in production.
By triaging, you’ve added a ton of labor to your security team’s plate and delayed resolution by a minimum of 5 weeks. You are exposed for that entire time because the team released that code before you even triaged findings from it.
It gets even worse. You’ve also made it much harder on the development team. The code is long out of their head. The developer who pulls the card to fix the vulnerability might not even be the one who originally wrote it. This adds as 2x-10x to the developer effort to resolve the finding. The lesson here is that if you can’t resolve a finding the same day that it’s found, you will never be able to resolve them in reasonable amount of time because the cost of doing so only increases steadily every day after the finding is found.
Resolve Findings the Same Day That They Are Found
Resolving findings the same day they are found!?!? That’s crazy talk. If we currently average 171 days, how are we ever going to do less than 1 day? Here’s how:
Use only tools with very low false positives like Contrast Assess and/or Contrast Scan.
Don’t triage findings. Feed them directly back to the development team.
With the excess labor the security team gained from stopping triaging, think of every false positive as a major problem that you need to fix and jump on it. A good first step is to turn off the rule that caused the false positive but then figure out how to eliminate the false positive so you can turn it back on for true positives that it might find. Report it as a bug to the vendor. If you have a tool that lets you modify the rule yourself, like Checkmarx or CodeQL, do so. GitHub may even pay you for improvements to a CodeQL rule. Note, most false positives come from an unknown santitizer. Even tools like Contrast that don’t allow you to modify the rules, often allow you to add to the source, sanitizer, and sink lists.
Turn the policy dial down for each team until the team commits to fixing the few that are over that low policy dial in the very next sprint. I start teams out with one of their applications, with one tool category (usually SCA since I think those fixes have the best bang for the buck in terms of risk reduction) and with the policy dial set to only fail for critical SCA findings.
Help the team to install a policy check in the pull request as a branch protection status check but leave the status check as not “required” for now. When the sprint is over and the scan is clean at that policy setting, switch that status check to “required” meaning it will block the merge for any findings over the threshold after that point.
From this point forward, findings at this admittedly low policy dial, but also the most critical findings, have a median time to resolve (MTTR) of less than 1 day.
Now help the team install a second branch protection status check in not “required” mode for the next increment of the policy dial. Coach the team to keep incrementing the policy dial up like this until they’ve reached the point where the juice is no longer worth the squeeze, which will vary by circumstances.
Here is the policy progression that I usually recommend:
Criticals for vulns in the code you import (aka SCA, 3rd party, OSS) findings
Highs for SCA findings
Criticals for vulnerabilities in the code you wrote (aka IAST/SAST, 1st party)
Highs for IAST/SAST findings
Coach them to move on to another application that they own. You might even have them interleave policy dial increments across all actively developed applications.
What does the security group do now?
Now your security team is out of the vulnerability management business, but rather they are “coaching” and "toolsmithing" these things:
The policy dial setting for any given team and application. What’s the current setting? If that setting is less than the juice is worth the squeeze point, how long has the team stalled at that policy dial setting? Maybe you jump in to help them get to the next policy dial increment.
The MTTR for finding resolutions below that policy dial setting. Anything over 1 day for is a red flag that you should investigate.
False positives. The security group's job is to tweak, change, upgrade, etc. the tool until it no longer generates false positives like this but still finds the intended true positives.
I created two new roles at Comcast to do the above. The first two were the purvue of my "Coaching" staff. I hired former Scrum Masters with no security background into this role. The third activity above was fulfilled by my "Pipeline Engineering" staff. I hired developers, again, often with no security background into this role. They had to be more comfortable with creating their own plugins for CI/CD than in configuring via the UI an existing plugin. Both roles had other responsibilities but that will be the subject of other blog posts.
I've heard just about every objection imaginable to what I've written above. I'll try to pre-address the common ones here.
Objection 1: Going team by team and application by application doesn't scale
In fact, the opposite is true. It scales a lot better than what folks are attempting to do now. Here's the math.
The coaching role was measured by how effective they were at getting teams to mature and how many teams they did this for. Their primary approach to coaching was a workshop that the coach facilitated for each two-pizza development team every 90 days. The first of these workshops was scheduled for 90 minutes; the follow on ones, 60 minutes each. Call it 1.25 hours on average. A typical coach working at a leisurely pace of 12.5 hours per week can do 10 of these workshops each week. There are 13 weeks per 90 days but considering time off, off-sites, and the like, let’s say they are only doing this 10 week per quarter. That means a coach can be responsible for 10 x 10 = 100 teams, still leaving almost 30 hours per week to do ad-hoc coaching work. Myself and one other person served as coach and pipeline engineering for the first 40 or so teams we added to the program, and I only needed 4 dedicated coaches and about a dozen federated coaches (who covered more like 30 teams each while doing other work) at Comcast to cover all 600 development teams.
Pipeline engineering was similarly efficient because after the first 10 or so teams, they had a handful of recipes that worked for the remaining 80% of teams. At no point, did we have more than a half dozen folks working as pipeline engineers and only 50% of their time was client facing. The rest of the time, they were doing software development creating the CI plugins and management and reporting applications we used to manage the program.
This was a labor savings of 4x compared to the cost of the traditional app sec services at Comcast which was largely vulnerability management, gatekeeping, and cajoling oriented. Also, teams who had fully adopted at least a few of the "essential 11 practices" had 1/6th as many vulnerabilities and incidents in production as teams that were doing none of the essential 11. That's my definition of a win-win! 1/4th the cost for 6x the improvement!
Objection 2: Our teams are not mature enough in DevOps to do this
We hit this wall at Comcast after about 125 of the 600 total teams but we knew a year or so before then that we were going to hit it so we prepared. We started working more aggressively with Chief Software Architect's office to help him establish his own DevOps transformation effort focused largely on non-security DevOps practices including consolidation around fewer CI/CD tools.
My DevSecOps team also stepped out of our security swim lane but remember our Coaches were Scrum Masters or Agile Coaches and our Pipeline Engineers were Software Developers, so they knew how to coach teams to high levels of maturity in basic software engineering before they even learned about high levels of application security maturity. My pipeline engineering team produced 6 of the top 10 most used CI/CD plugins (yes, we wrote the plugins from scratch) used by the chosen CI/CD platform, Concourse, and 2 of those had nothing to do with security.
We also started requiring that teams meet certain basic DevOps prerequisites before allowing them into the program. For instance, they had to have at least one CI pipeline in Concourse.
We had created a wave of viral adoption where more teams wanted into our program than we had resources to support for the last 3 years of my 5 year journey at Comcast. They wanted in because their sister teams told them how much easier and effective it was than working with the legacy application security services and joining our program got them largely out of dealing with that legacy.
Objection 3: That might work in your kumbaya world, but in my world, we just can't trust developers with the security of the software they are developing
This is the toughest one to get security folks to believe, but the truth is that you are much better off helping them to mature to the point where they are worthy of that trust than you are trying to continue to bolt-on and after-the-fact inspect-in security.
This reminds me of my past leading agile transformations and giving talks where I declared that a throw-it-over-the-wall approach to QA was going to be replaced by moving that responsibility to the development team. I would get swarmed by the QA folks in the audience as soon as I got of stage telling me that they could never trust their developers. If you fast forward ten years now, there are very few siloed QA departments and no more Chief Quality Officers. It's almost all done by the development team today, which is often expanded to include some folks who primarily do QA automation engineering.
Similarly, the DevOps and cloud-native (derogatorily referred to as "shadow IT") movements disrupted traditional IT and Operations.
It's now Security's turn. It's not a matter of "if" but "when".
Objection 4: That sounds like a great future to strive for but we have some urgent fires and/or we need some quick wins before we get there
I'm reminded of the adage, "An ounce of prevention is worth a pound of cure."? The most interesting effect of the program I've described here is that once a developer fixes a vulnerability that was found in the pull request that she just submitted to the pipeline, she never writes that kind of vulnerability again. Each vulnerability found becomes a learning opportunity and that effect applies not just to that project or that team but any future project or team she works on.
You must segregate both your thinking and your labor into urgent and important. It's incredibly hard to have the same people doing incident management also doing this sort of long term prevention work. The urgent always takes priority over the arguably more important but longer term work. At Comcast we addressed this by standing up my program as an experiment. Then, for the first few years, we didn't replace any of the traditional application services or approaches. We were simply an alternative that we were trying out for a small group of development teams. Only after we had the data from 150 teams who had been in the program to show how much more efficient and effective this approach is did we start to wind down those traditional services.
Also, when I first stood up my program, my security leader peers would invite me and my team to participate in emergency incident response work. I declined as politely as possible because I knew that getting sucked into that meant we couldn't focus on the longer-term prevention work.
Objection 5a: All of this automated testing and in particular, IAST, requires a completed "build" and a deploy to a QA environment which is after the pull request has been merged so how can failing checks like this be used to block the merge?
This is a surprisingly common misconception. It comes from the shifting concept of the word "build". Before the advent of CI systems, which occured after most security leaders and even many engineering leaders stopped coding, the "build" was accomplished with a build script which also ran the unit tests. That's not how it's done today, so to clear this up, let's define "build" to be the minimum necessary to produce a runnable artifact.
You want to the build to complete unless there are syntax errors that prevent it from producing a runnable artifact. Then you spin up several/many ephemeral (aka containers) CI worker instances where each is performing a different kind of check (style, unit test, test coverage, integration test, IAST, lint, quality, SAST, SCA, etc.). If that particular check requires a runnable artifact, it copies it to the instance to do its job. In reality, most modern CI configurations have multiple different build configs. You might have a different one to instrument the code for test coverage than for the build instrumented for IAST and an even different one to test deployment, etc.
Objection 5b: But wait, it's very costly to create and maintain our QA environment. We have a sandboxed database with dummy data. We have mocked services. Etc. Who is going to do the engineering work so we can stand up a bunch of them instantly in a CI pipeline run?
In all but rare cases, most of this engineering work has already been done. Think about how a developer works. They typically write a few lines of code and then run a desktop check to see if it had the desired result. The development team has created an environment to allow them to do this and they've usually automated much, if not all, of the process of standing one up because we're constantly swapping people and developer computers in/out.
The key mindset shift is to push the developer desktop testing environment to the right into the pipeline rather than trying to pull the heavy weight QA environment left into the pipeline.