As modern military systems increasingly rely on software coding to achieve virtual effects, the question of how one knows whether these weapons work becomes more difficult to answer – at least when compared to the old physical testing that validated weapons systems.
Retired US Navy Rear Admiral Archer Macy talks to Peter Roberts about testing and evaluation, the pathological state of machines and our need for evidence.
Play the episode
Moderator: Professor Peter Roberts (questions in Bold)
Respondent: Rear Admiral (Retired) Archer Macy.
Moderator: Welcome to the Western Way of War. This is a weekly podcast that tries to understand the issues around how to fight and succeed against adversaries in the 2020s. I'm Peter Roberts, Director of Military Sciences at the Royal United Services Institute on Whitehall, and every week I talk to a guest about the Western Way of War. Has it been successful? Is it fit for task today? And how might it need to adapt in the future? The podcast is only possible because of the kind sponsorship of the good people at Raytheon UK, a subsidiary of Raytheon Technology, a British Company that creates jobs in England, Wales and Scotland, contributing over £700 million to the UK economy.
We continue to be bamboozled by senior political and military leaders telling us that another revolution in warfare is upon us. Having moved swiftly on from the apparent transformative impacts that information would have in dominating the battlefield, and with cyber losing its shine without ever having been able to deliver the much-promised superiority against adversaries, it is now Artificial Intelligence, Quantum Everything and Autonomous Whippets that will, we are told, make the next battlefield utterly different to anything we've ever known before. I've got to be honest with you, I'm a little bored by such promises.
When you join any military organisation, the soothsayers are already promising radical change. Whether the arrival of combustion engines, airpower, submarines, radar, sonar, telegraphy, cryptography, spoofing, chaff, missile technology, nuclear weapons or data, the modern age is replete with examples of over-hype and under-delivery. There is, of course, historical precedent for this. In his magnificent volume on The Evolution of Warfare from Ancient Times, John France narrates a similar path which was trod, not just by the Romans, but also by the Spartans, the Persians, the Mongols and indeed the Goths. From horse warriors to gunpowder, some arrivals do alter warfare, but they do so more than others. The question this episode is posing is why that is? How do you know what you are building and what you're going to get? How do you know in what ways we're going to get what we want? How do you prove it? How will you know that everything will work when conflict breaks out and you use these tools for sometimes the first time? How do you build that trust between operator and system? What delivers something useful rather than a sinkhole for cash?
To tackle this from a, well, practical perspective rather than the usual theoretical lens, I need a guest who was inculcated in technical and tactical aspects of warfighting. Someone grounded in reality probably with some cynicism to the hype from experience of previous projects. We need someone who can look at the claims of the futurists with some practical, technical scepticism. So, roll in former US Navy Rear Admiral Arch Macy, now a Senior Associate at CSIS in Washington DC, the world's number one defence think tank, and Senior Fellow at the Center for Naval Analyses in Arlington, Virginia. Arch has a long association with RUSI, and specifically our Ballistic Missile Defence conferences where he's been a regular guest and speaker. He's a consultant on National Defence and Homeland Security issues, particularly in the areas of integrated air and missile defence strategy, systems and programme. Arch's Naval career was as a Surface Warfare Officer on multiple ships. He commanded a destroyer, he's deployed to all the major operational theatres. Onshore, he's served on the Navy Staff as well as the Programme Offices for cruise missiles, the Aegis Combat System, Naval Area TBMD Program for Integrated Warfare Systems, and he commanded the Naval Surface Warfare Centres. So there are few people who are more qualified to comment on the realities of new technologies, and importantly, why they fail to live up to their expectations. But before we get into that detail, we need to situate Arch into our broader conversation. So, Rear Admiral, retired, Arch Macy, US Navy, what does the Western Way of War mean to you?
Arch Macy: I've reviewed the stated purpose which you had brought up already about whether the Western Way of War is fit for task and how it might need to adapt in the future. Many participants, far more studied than I in the breadth and depth of these questions have commented on them, so I thought I might take a different approach to talk about a related area which I believe is not considered enough in discussions of the evolving Western Way of War, and particularly the technologies and opportunities that are expected to enable it. As you noted in the introduction, my background has included a number of positions related to the development and fielding of combat systems, sensors, weapons and related warfighting technologies. In the course of these duties, I've been involved with the testing and evaluation of various systems. It's pretty safe to say that the more complex or capable the warfighting system, the more difficult it is to demonstrate that it is fit for task and fit for purpose, and to do so in ways which are quantifiable and examine the boundaries of the system's performance capabilities and potential deficiencies. So, the title of my discussion, as you mentioned, would be, 'How do you know?'
The point here comes from the fact that there is a great deal of discussion of the future of war in the Western approach. The massive data collection, AI, Automated Command and Control and the like. So, my question is, how will we know what we built will do what we want in the ways in which we want? How will the developers and the warfighters have the mutual confidence in what will happen when they, 'Turn the key,' and a conflict begins? In particular, I think that many underestimate the difficulty in demonstrating and affirming that what we've built in some future combat system is effective, and will also conform to the rules or behaviour that we desire. To use General David Petraeus' phrase, that the system will, 'Exhibit genuine governance under the laws of armed conflict.'
The 'how' part of technically supported or enabled automated warfare is often neglected in discussions. It's too often left to the developers and the engineers. But users, commanders and effaces need to be involved. Often how a capability is implemented is as important as how well it is implemented, and whether it will do what we want, when we want, and most importantly how we want. And, if we can resolve that to our satisfaction, how will we know that we really did so? In other words, how do we test and demonstrate the system to see what it will really do, rather than what we thought it would do?
Most of my experience in system development and fielding has been in the area of Air Defence. So, I will draw on that to extend the discussion to some more detail. I think we all remember that in Iran Missile Defence, there are two historical examples where tragic engagements occurred involving combat systems that had some partially automated functions designed to improve their overall effectiveness, but which failed in the sense that they did in fact engage and destroy friendly aircraft. These were the 1988 Aegis shoot down of Iran Air 655, and the Patriot Engagement of Fighters in 2003. Both of these air defence systems were pretty much state-of-the-art technology at their time, and both had been through full test and evaluation programmes they're fielding. Yet these tragedies occurred anyway. While there were human errors involved, and I note that humans are part of the system, there are also technical characteristics of these systems which contributed to the overall result in which were not fully understood by the operators. As we go to more automation, we have to acknowledge the increasing complexity of the systems, mainly implemented in software, and the challenge of finding out if they have any bugs before another tragedy occurs. I think we can safely assess or predict that the complexity of the envisioned combat systems of the future will certainly exceed that of the 1988 Aegis System and the 2003 Patriot System by almost unimaginable degrees.
The question for the developer and the user remains in the how to demonstrate to the commanders that the system will perform as desired, while it has likely increased autonomy of decision and action, and that it will perform in accordance with the rules of engagement and the laws of warfare. As I said before, we also observed that a combat system is not just machinery, devices and computers, but also the humans who interact with it. Thus, the way in which the humans interact is just as vital a part of the design as is what the other elements do, and the characteristics of those interactions will have an effect on the outcome. Decisions for actions will depend with significant degree on what data is collected and parsed, in what fashion, and then is provided to a decision note, human or machine, for possible action. The characteristics, good and bad, of this man-machine interface means that having humans in the loop for advanced systems will not necessarily obviate the risks of bad outcomes due to bad choices. How do we design and test to examine a complex warfare system not only for its combat capability, but also for its resilience to bad data and software errors, (TC 00:10:00) adherence to rules of engagement and law warfare, and predictability of acceptable behaviour? We can do this with humans when we watch them train and as they gain experience. But how do we do it with increasingly autonomous non-human or partially human systems?
I don't think it's unreasonable to expect that the fully enabled Artificial Intelligence Air Defence systems that many envision will have capability, complexity and autonomy that far exceeds that even of today's most advanced Aegis or Patriot combat systems. So, the question becomes, in what way will we be able to examine and assess those systems prior to them being fielded and used in warfare, to ensure that they not only do what we need them to do, but also that they will not do what we don't wish them to do? Ultimately, the question is to extend from previous remarks by Dr Paddy Walker in an earlier podcast, how to manage the incremental introduction of unsupervised methods. I believe that we need to accelerate much more discussion on how to answer this challenge in parallel with developing those systems and in deciding how we expect the future Western Way of War will be conducted. And with that, I leave it to Peter to pull the threads.
Moderator: That was phenomenal. I mean, there was so much in there. But I want to go to this big question first. You and I, in our Naval careers, you know, and when we're looking at the military as a whole, we've seen technology be introduced before, we've heard about the hype that's been associated with everything from precision-guided munitions through to net-centric warfare, and then there's cyber and information and data and so on and so forth. And we see all of these things coming through. Now, if you go back to something like, I don't know, Net-centric warfare, right? So, you know, we almost knew that it was never going to live up to the hype. You know, this perfect situational awareness, this removal of the fog of war that we've talked about for generations, and we knew that was never going to be delivered but we invested a huge amount of intellectual effort and money in order to make this happen. We sort of rejected the idea over a period of a decade. But today we've ended up with a massive improvement on what we had before, and in some ways, the investment in NCW, Net-centric Warfare, got us there, right? So, there's no doubt that technology has a role in getting us to advance how we conduct warfare, the problem is that the hype never lives up to the reality, right? Do we have programmes where we can look back and say, 'Yes, that delivered,' and that delivered in a way that actually exceeded expectations? Or do we have just a record of military development that never ever delivers what we hope it will?
Arch Macy: I think there's a couple of factors there. The first is that we learn as we go, so what we thought Net-centric warfare was going to be in 1990 was not what we thought it was going to be in 2000 or 2010. The reason being, that we learned. We learned what we could do, we learned that we could do more than we thought, and we learned there were things that we couldn’t do. I think that's true of most systems. I am a student of Rear Admiral Wayne Meyer, known as the 'Father of Aegis' who is the man who created, built and governed the Aegis program for years. Now, I have to admit, and probably I'm biased on this, but I would submit that the Aegis Program is one of the few examples that really, in the end, did live up to its billing and exceeded it. Originally designed to take on Soviet Backfire Bomber Regiments almost single-handedly in the North Atlantic, it now is capable of intercepting intercontinental ballistic missiles. But that took time. I think the Admiral would have said that he didn't deliver as quickly as he did, but when he did deliver, it worked.
The reason I use that example one is, as I said, I'm an Air Defender and I would submit that Aegis is about the most capable air defence system in the world. So, it's a good example to study on how this happens. But the other reason is to bring up Admiral Meyers dictum of how he did it, which was to build a little, test a little, and learn a lot. He always had goals for what the ultimate capability would be, but those were written in terms of warfighting capability and not miles per hour or furlongs per fortnight, because he said we need to learn how to do this. Now, the other thing that he and his team did was as they developed, they did do testing, and a part of that was learning how to test this system. So the build a little, test a little, learn a lot I think is a guidance that any program should follow.
Now, when you go to Congress or you go to Parliament and you say you want a great deal of money to do something, you have to be able to give them the why. And so that is the prediction of the future, but it needs to be grounded in what it means to warfighting and what steps you're going to take to get there in a way that people can believe you are making progress. Too often we do the trust me thing, or we tend to go way too technical, and that is not necessarily useful to people who have to make decisions about money or decisions about employment. There are reasons that there are officers like me, with a heavy technical background, and there are officers who have different career paths with political science backgrounds, with diplomacy backgrounds, with the study of strategy backgrounds, far more in-depth than mine. And communicating amongst that community is always a challenge, those various communities, there's a challenge. But I think that many programs are challenged by promising too much too soon before they really know what they can do. But at the same time, building their program in such a way that I believe I can do this, these are the reasons why, and these are the steps or the touchstones along the way that will tell both you and me that we are getting there.
Moderator: It feels like spiral development seems to be much more successful, in terms of you start small, you grow, you do this, you know, Meyers' concept of build a little, test a little, learn a lot, gets you to a position where you're starting to establish an evidence base so you can make these decisions. You are providing decision-makers with an idea that this can deliver in the following ways, that it's very quantifiable metrics. And we've gone down that path in quite a proven and useful way, particularly when you look at missile development, maybe. I mean, if I look at how the standard missile has developed, it's been a, sort of, spiral path. We've done this. We've achieved this. We've got the next order. We can develop it here, and the fusing's changed, the seeker's changed, whether it's semi-active or active has changed, the booster has changed, the range has changed. We have done spiral development and each stage we've provided evidence. Again, you know, it's Meyer's Aegis concept, it's build a little, test it, learn a lot, and gauge an improvement. But you come to that with the stuff that people are talking about today, AI, Automation, AI-enabled decision-making. These ideas that somehow will create from scratch something that will have transformative powers, doesn't do that way of spiral development, right? It just promises a lot. And I worry that we're going to get to a situation where, no matter how much money we threw at it, they cannot possibly deliver, nor can they provide the evidence that they will deliver. Do you think there's something in that?
Arch Macy: There is, and there are plenty of examples certainly. That is one of the challenges of doing all of that, and that, to a certain degree, I would say that that is the task that has been handed to Rear Admiral Doug Small, who has been tasked by the CNO to come up with the Distributed Maritime Operational Capability, the Naval operational architecture, and the so-called Project Over-Match to pull all of that together. I had the great privilege of working with Doug in Iraq, and this is a highly capable, highly intelligent officer who does understand his challenge and I think is aware of what you're talking about.
But yes, I mean, a lot of promises but they have not yet been quantified, and at some point I believe in the near future we're going to have to say, 'These are going to be the touchstones of measurement.' These days, there's a lot of discussion of agile programming, where both requirements and solutions are developed side by side in collaborative efforts, and that’s all fine. They use phrases like adaptive planning, evolutionary development, early delivery, and that's good and that's appropriate. That is spiral development. However, one of the risks is that if you don't do large scale system specifications, but just continue to do coding to follow opportunities without boundaries, you won't be able to determine if the system is operating outside of a specification or that it's even meeting your needs. So you can either have deficient performance or with increasing automation, and back to the greater risk that General Petraeus pointed out, it could be operating outside of the boundaries of acceptable behaviour. It could enter what you could call a pathological state, and you won't know that, and if you don't have a general, (TC 00:20:00) and I don't mean loose, but a large description of what the goal is and what the requirements are, then you won't know what to test for or how to test to it to see if in fact it's going to behave in ways that you want it to behave.
Moderator: And you raised the Vincennes one which is really familiar to anyone in the air defence environment, but the idea therefore that we could hand this entirely to a machine, that we know that in the coding we contain the coder's own curiousies and biases. It has a fingerprint of the coder into all these systems when they write them. So, even though we believe AI in a perfect world would eventually reach a sort of singularity, a US AI system will be different than a UK-developed system or a French-developed system. Indeed, the personalities of the key coder that sits behind them is going to be really instrumental in how that AI system behaves, even when it's perfected, even when it is doing its own learning, it will have some base assumption, some curiousies, characteristics and bias that exist within it. So, in many ways we are coding in our own human failings into AI systems, that mean that future Vincennes systems and incidents will happen. And they're just going to happen a hell of a lot faster than we've used to because we won't have a human who's able to react at that speed. So, how could you, and how difficult is it to test these systems in the future? How can we make them more reliable? Is there a way of understanding their boundaries of behaviours and performance?
Arch Macy: Well, you bring up a couple of points which I'll try to address. First of all, you talk about the human influence in the coding. That's true. There are two sets of human influences here. One is the influences or the biases of the people who are developing the system, particularly those who are working on what are the logic frameworks, the neural networks and so forth. Which way should they go? What is the right answer for a different situation? Various situations. The other is the operators, people who are sitting on the consoles. And my bias, sitting the mid-watch is going to be slightly different than your bias sitting the four to eight. And how do those interact with the systems? Or how do we set up the systems in such a way that they minimise those differences to a point where training can enable us, you and I, to perform in a consistent fashion? So, that's one challenge there.
The other challenge is how the human and the system interact. The man-machine interface. Now, we don't have any completely autonomous systems out there right now. There is one. Aegis has a sub known as auto-special, designed specifically to defend the ship against a very high-speed threat discovered late, close aboard, where the system goes into full automatic and engages whatever within certain boundaries it considers to be a threat. That has never been used in real life. It's been used in testing. Because of the fact that it is so capable, and it almost operates on a system of if it flies, it dies. And that's not acceptable in the law of warfare. That is such a critical system that on Aegis ships, only the commanding officer has the ability to engage it, and it takes a physical key to prevent accidental employment.
So, the question is, how autonomous does the system or the weapon system become? There is discussion about how do you keep a man in the loop to ensure that you have ethical behaviour or legal behaviour? I wonder whether the man can keep up with the loop. And then yes, how do you test all of this? To a certain point in simulation becomes unrealistic, and so your challenge to say, 'This is what the world might really look like.' And of course, the question is, does this compare to what the world will really look like in the event that this occurs?
Back to the human-machine interface, one more observation. In the case of the Vincennes tragedy, all of the information to not do what they did was present in CIC. What the British call the ops room. It just wasn't interpreted correctly by the right people in the right sequence. There were some system errors, but they were not, if you will, pathological or fatal. But in the end, the captain made the wrong decision, as history would prove. It's what obviously he thought was the right decision at the time. So, this question of how data is presented, how it is first of all parsed, what is important and what is not in an automatic system which has to look at a vast amount of data, one presumes, what is important, how is it analysed, and how is it presented to the next decision-maker, be that human or silicone, is critical. And then you have to figure out how to test against that.
Moderator: So, one of the key points we've been talking about in previous conversations in this, is that that's all very well for right now, and we can keep this human on the loop, in the loop, around the loop, however you want to call it. But where we start to get to hypersonics playing a real role in this, that the speed of particularly air warfare in the future, and depending on how Russian submarine technology and torpedo technology goes perhaps in the underwater space as well, means that you cannot have a human in the loop and live, effectively. You need to be able to hand this over to the final mode. Turn the key, let Aegis do its thing, let (ph 26.03) Aedels do its thing. Let your ships command system, your air defence system assured do its thing. Turn it over to auto, let it run, and you go back to rely on that coding, the preconceptions and the assumptions that are made by the people who program the system in the first place. That's going to happen, right? So, we've got to have a way of better testing the systems that we're provided with if hypersonics mean that within five, ten years, that humans will no longer be in that loop. Do you think that's a realistic possibility that we're going to be able to test this effectively? Or do you think we're going to have to wait to survive contact with the first engagement?
Arch Macy: That depends on what you mean by test. If you mean to test all possible contingencies, the answer is absolutely not. Again, going back to my Aegis experience, this question applies to any modern, highly capable warfare system, but as you pointed out, when the demands of warfare have to make the system more and more capable and more and more automated it becomes much more difficult. So, let me talk about Aegis and how Aegis is done. And as an example, not to answer the question I'm afraid, directly, but to say, 'This is the approach we're going to have to take.'
The Aegis combat systems fielded by the US navy are certified when they go on the ship by a designated technical authority as to what their capabilities and limitations include. This comes as a result of testing and evaluation analysis performed by both the contractors and the technical engineering centres, tactical trainers, all those who are responsible for the surface ship warfighting. I had the distinct privilege of working for and with a man named Reuben Pitts, who is probably one of the best systems engineers and technical managers the Navy has had in the last 50 years at least. For a significant portion of that time, Reuben was responsible to oversee the certification processes for the Aegis combat system. In discussions he would try to impart the degree of complexity that this involved. And remember that this is twenty years ago. And he used the following analogy. Since the Aegis combat system is computer-based, one can count the number of different data registers and memory locations included in the system. At any given moment, each of these locations contains data consisting notionally of ones and zeros. One clock tick later, some of these data values will have changed as the computation proceeds. Based on these results, the system will decide to take or not take certain actions. These actions can include choosing to engage or ignore an air track, or to recommend that a human do so. So, one question in determining if the system is fit for task is whether the system is making the right decision as it follows the programming. Another question is, whether the system might, based on the patterns of ones and zeros, distribute across the millions of data registers and memory locations, make a decision that is appropriate to the logic, but inappropriate to human desires and rules. In other words, that it might make a pathological decision.
So, the test and evaluation question becomes, how do you examine all the possible patterns of distributed ones and zeros to see if any will result in a bad result? So, then, how many possible patterns are there? Reuben's analogy was this, by way of scale, the wire or plastic mesh inset screens that many of us have on the windows of our homes are usually woven in a pattern that results in a hundred openings per square inch. If each opening represented (TC 00:30:00) one snapshot in time of digital state of the patterns in ones and zeros distributed across the Aegis system, then the question is, what size screen would it take to account for all of the possible states that the Aegis system could experience? The answer twenty years ago was that that fly screen would have to be larger than the Milky Way galaxy. Obviously, it is not possible to test every one of those digital states to see if they have a pathological result and result in an action or decision that's unacceptable. So, the developers and testers are always looking for ways to (mw 30.37) the problem. These approaches include coding to block unacceptable decisions, training crews to understand what the system can and should do and what it should not do, and testing to see if this system performs properly given appropriate state data, and it does not perform inappropriately given other data. But in the end, the technical community, the operational community and the senior decision-makers need to develop enough knowledge of the system that they can form and agree on what Mr Pitts calls a balanced assessment of risk. That everyone concerned has sufficient understanding of the system, operator interactions and capabilities, they have acceptable belief within what they can test and within likely acceptable boundaries that the system will remain predictable and adhere to tactical orders and rules of engagement. But it's a balanced assessment of risk based on what you can do, what you understand of the system, and to what degree you can tolerate the fact that you are wrong.
Moderator: I am fascinated by this, Arch, and whilst you and I might talk very happily about this in relation to Aegis because we're a couple of matelots and we like to dit on. It applies equally to lines of coding in the F-35, to the E7, to TBMD systems, to a HIMARS fire control system or the software that runs the latest variants of Apache. I mean, the heart of the problem seems to me, that in the future, we're deluding ourselves if we think that we're ever able to adequately test and evaluate every eventuality in every system. That we are going to have to make some presumptions and assumptions about performance. And in this, we're going to have to make a trade-off between how predictable and bounded we want things to be, and the utility that we need them to have when it comes to warfighting and experiences that we don't quite understand.
And at the heart of this, it appears to me there is a degree of trust that needs to exist between the operator, the commander, the human, and what we expect the system to be able to perform to. It's an issue that we at RUSI have been talking about for while, we're going to run a big research project on this later in this year and a report about what this idea of trust looks like when we can no longer quantify what we're trusting in. When that thing we're trusting, as you said, has ones and noughts that change every millisecond, let alone every patch and update.
It is a genuinely fascinating question, but we have come to the end of our available time. So, Arch is going to speaking at the RUSI BMD conference on the 13th and 14th May. Tickets for this virtual conference this year specifically dealing with precision strike in the 21st century multi-domain operations, those tickets are available now on the RUSI website. Arch is also working on a paper that is co-authored with RUSI's own Sid Kaushal and Ali Stickings. Arch, it's been a real pleasure to have you with us, and I look forward to reading and hearing more from you over the next few weeks.
Arch Macy: Thank you very much, Peter, both for the conversation and the opportunity.
Moderator: You can find our show on all major podcasting platforms including iTunes and Spotify. Your downloads regularly place us in the top three per cent of nearly two million podcast shows globally. More and more of our listeners seem to go back and dig into previous episodes, both from this series and in series one, and why not? The guests are superb. You can always draw out a new line of thinking to delve into. Thanks for all your feedback, good and bad, we're currently shaping season three of the podcast as we record this one. Much of our approach is based on the suggestions that you've been sending us. I do need to remind you that the accompanying digital output on the way adversaries think about conflict is available in a series called Adversarial Studies. You can find that and our other digital outputs at RUSI.org/professionofarms. You might also consider becoming a member of RUSI. The institute was founded by the Duke of Wellington a few years after the Battle of Waterloo to counter the institutional and systemic bureaucracy he found in the war office. His agenda of free-thinking, stimulating intellectual curiosity and challenge remain at the heart of our work. We receive no core funding from the UK government, MOD or military, and we are a charity so we can't make a profit. Our aim is to provide you with an opportunity to grow and improve as a member of the profession of arms. You can find details at RUSI.org/membership. This show is produced by Peppi Vaananen and Kieron Yates and is sponsored by Raytheon UK. Thanks for listening.
Western Way of War Podcast Series
A collection of discussions with those in the Profession of Arms that tries to understand the issues around how to fight, and succeed, against adversaries in the 2020s. We pose the questions as whether a single Western Way of Warfare (how Western militaries fight) has been successful, whether it remains fit for task today, and how it might need to adapt in the future? It is complemented by the ‘Adversarial Studies’ project that looks at how adversaries fight.
The podcast is kindly sponsored and enabled by Raytheon UK, a subsidiary of Raytheon technology, a British company that creates jobs in England, Wales and Scotland, contributing over 700 million pounds to the UK economy.
Professor Peter Roberts
Director, Military Sciences