Can Complementary Learning Methods Teach AI the Laws of War?

Davit Khachatryan is an international lawyer and lecturer focusing on the intersection of armed conflict, emerging technologies, and international law.

The Judge Advocate watched the feed from the tactical operations center alongside her commander. The screens, each attended by systems monitors, showed more than a dozen developments unfolding at once. An artificial intelligence (AI) led drone swarm was closing on the front line through the city, coordinating its movements faster than any human pilot could direct, an artificial flock of mechanical starlings like a cloud on the radar. A civilian aid convoy had stalled on the northern approach. An enemy artillery battery was repositioning south behind a residential block. In the nearby valley, friendly units were maneuvering under fire. All these pieces were in motion, lives and vehicles and weapons. The soldiers’ behavior would be determined by interactions between their commander and AI.

The challenge here is not as simple as claiming that AI cannot comply with the principle of distinction under international humanitarian law (IHL), also known as the law of armed conflict. The fog of war complicates decision-making for both humans and machines, but does so in profoundly different ways.

For a human commander, the chaos of the battlefield is filtered through layers of training, doctrine, experience, and instinct. Even when overwhelmed, a person can weigh incomplete facts against their mental map of the situation, recall comparable past events, and fall back on moral and legal anchors. This does not mean humans do not make mistakes; they do, often with serious consequences. But even in error, their reasoning is shaped by caution, hopefully empathy, and the capacity to interpret ambiguous information in light of their own individual understandings of humanitarian obligations.

AI processes that same chaos as streams of probabilities. Every sensor reading, target profile, and movement pattern is reduced to statistical likelihoods: how probable it is based on the training data that this object is hostile, how urgent its engagement appears, how likely a given action is to produce the “correct” result as defined in training. In its logic, the most probable option is the correct one. Under extreme operational pressure, the AI focuses on the statistically most plausible, while rare possibilities drop toward statistical zero, far less likely to be considered than they would by a human.

This difference in reasoning is why training environments must be built to include not just the probable, but the improbable: those outlandish, once-in-a-century battlefield events that stretch judgment to its limits. For AI, these scenarios must be constructed, repeated, and reinforced until they occupy a permanent place in the machine’s operational vocabulary.

A credible arms control position would be to prohibit or pause the development of certain autonomous capabilities. Nevertheless, this article proceeds conditionally because much of the stack is already fielded (AI-enabled intelligence, surveillance, and reconnaissance triage, targeting support, and navigation), and because dual-use diffusion (commercial drones, perception models, planning tools) makes a clean prohibition hard to sustain. If states continue down this path with minimal international instruments the question becomes how to embed legal restraint so that rare, high-stakes judgments are not optimized away. What follows sets minimum safeguards if development and deployment proceed.

Table of Contents

How AI Learns

If AI’s logic is built on statistical reasoning, the way it acquires those statistics determines the boundaries of its thinking. This is true for AI in general, whether in a medical diagnostic tool, a financial trading algorithm, or a targeting system on a battlefield. The patterns an AI recognizes, the probabilities it assigns, and the priorities it sets are all downstream from its training.

In the military domain, an AI’s training determines how it operates in relation to the law of armed conflict and the unit’s rules of engagement: what it accepts as positive identification (distinction), how it trades anticipated military advantage against collateral damage estimation (proportionality), when feasible precautions require warning, delay, or abort, and when uncertainty triggers a mandatory hand-off to a human. The two dominant machine learning paradigms, imitation learning and reinforcement learning, can both produce highly capable systems. Yet without deliberate safeguards, neither inherently preserves the kind of rare, high-stakes judgments that human decision-makers sometimes make under the fog of war, moments when they choose to forego an operational advantage to prevent civilian harm. Statistically, those moments are anomalies.

Imitation Learning: The Apprentice Approach

Imitation learning (IL) is essentially training by demonstration. The AI is shown large datasets of human decision-making, each paired with the information available at the time. In a military targeting context, this might include annotated sensor feeds, mission logs, and after-action reports: strike approved, strike aborted, target reclassified, mission postponed.

The model’s task is to learn the mapping between conditions and human actions. If most commanders in the dataset abort strikes when civilian vehicles enter the target zone, and there are enough entries of this behavior in the dataset to show that, the model will learn to mirror that restraint.

IL captures the statistical distribution of decisions in the training data. Rare but important choices, such as holding fire in a high-pressure engagement to comply with proportionality, will be underrepresented unless deliberately oversampled. Left uncorrected, the AI may treat those lawful restraint decisions as statistical noise, unlikely to be repeated in practice. Additionally, because much of the data on which machine learning models reflects past military experience, many AI models will echo the implicit bias shown in the past human decisions on which they train.

A Quadrupedal-Unmanned Ground Vehicle (Q-UGV) goes over rehearsals at Red Sands IEC in the CENTCOM AOR Sept. 18, 2024. (U.S. Army photo by Spc. Dean John Kd De Dios)

Reinforcement Learning: The Trial-and-Error Arena

Reinforcement learning (RL) works differently. Instead of copying human decisions, the AI is placed in a simulated environment where it can take actions, receive rewards for desirable outcomes, and penalties for undesirable ones. Over thousands or millions of iterations, the AI learns policies, decision rules that maximize its cumulative reward. At scale, this training is highly compute– and energy-intensive. That matters because it concentrates capability in a few well-resourced programs, slows iteration and red teaming, and creates pressure to trim the very rare event scenarios that protect civilians and support compliance, while adding a nontrivial environmental footprint. Programs should, therefore, set minimum scenario coverage and doubt-protocol testing requirements that are not waivable for budgetary reasons.

In a military context, this means an RL agent might repeatedly play through simulated scenarios: neutralizing threats, protecting friendly forces, and avoiding civilian harm. The way those objectives are weighted in the reward function is decisive. If mission success is rewarded heavily and civilian harm only lightly penalized, the AI will statistically favor the course of action that maximizes mission success, even if that means accepting higher risks to civilians.

RL’s strength is adaptability. Its weakness is that low-probability events, rare civilian patterns, and unusual threat behaviors will remain statistically insignificant unless the simulation environment repeatedly forces the AI to confront them.

IL can pass down the shape of human judgment; RL can provide flexibility in novel situations. But each carries a statistical bias against rare, high-impact decisions, exactly the kinds of decisions that can determine the legality and morality of military action. Only by deliberately elevating those rare cases in training, through curated datasets and stress-test simulations, can either method hope to produce systems that behave lawfully and predictably under the fog of war. On the evidence of deployments to date, achieving this level of end-to-end compliance remains out of reach.

Soldiers don the Integrated Visual Augmentation System Capability Set 3 hardware while mounted in a Stryker in Joint Base Lewis-McCord, WA.

The Simulation Imperative

Actual combat records, produced by soldiers in logs, after-action reports, or targeting databases, are skewed toward the typical patterns of engagement that happen often enough to warrant recording after the fact. Unprecedented and chaotic situations will strain both the law and the system’s decision-making, yet they appear so rarely in historical data that, in statistical terms, they are almost invisible. An AI, left to its statistical logic, will not prepare for what it has seldom seen.

This is why simulation is the decisive safeguard¹. In imitation learning, rare but critical decisions must be deliberately overrepresented in the dataset, so they carry enough statistical weight to influence the model’s behavior. In reinforcement learning, the simulated environment must be constructed so that “once-in-a-century” scenarios occur often, sometimes in clusters, forcing the system to learn how to navigate them. A humanitarian convoy crossing paths with an enemy armored column, loss of communications during a time-sensitive strike, sensor spoofing that turns friend into apparent foe, these cannot be treated as peripheral edge cases. They must be made routine in training.

The more frequently the AI encounters these manufactured crises in simulation, the more space they occupy in its decision-making horizon. If and when similar scenarios arise in operations, the system’s response should not be improvised.

The Lieber Code in the Age of AI

The concept that, in cases of doubt, the commander should err on the side of humanity is not new. It was codified in 1863, when Francis Lieber drafted the Instructions for the Government of Armies of the United States in the Field, better known as the Lieber Code.

This imperative has repeatedly been encoded under International Humanitarian Law. In the Additional Protocols to the Geneva Conventions², the obligation to take “all feasible precautions” and to cancel or suspend an attack if it becomes apparent that it would cause excessive civilian harm relative to the anticipated military advantage operationalizes the humane minimum in treaty law. Critically, however, many key decision-making states have not ratified all the precepts articulated in the Additional Protocols. Customary IHL Rule 15 similarly requires constant care to spare civilians and civilian objects, and Rule 19 codifies the requirement to cancel or suspend attacks when doubt or changing circumstances create excessive risk.

Faced with ambiguous intelligence or conflicting imperatives, human commanders can recall a doctrinal anchor and choose that privileges restraint over risk. Even when they err, that error is shaped by a human blend of caution and interpretation of context.

For AI, the same scenario unfolds differently. Without explicit design, there is no natural “humane fallback” in its logic. In the face of uncertainty, an unmodified reinforcement learning policy will still pursue the statistically most rewarding action, and an imitation learning model will default to the most common decision in its dataset.

This is where simulation and legal doctrine intersect. Embedding the humane minimum into AI means that in every training run, whether through curated historical cases or artificially generated edge scenarios, the option that aligns with humane treatment under uncertainty must be given decisive weight. In imitation learning, that means oversampling “hold fire” or “switch to non-lethal” decisions until they are no longer statistical outliers. In reinforcement learning, it means structuring the reward function so that restraint in doubtful cases earns more cumulative value than aggression, even if aggression sometimes yields short-term operational gains. The aim is not to teach machines to imitate human morality, but to hard-code a structural preference for restraint even and especially when the law is unclear.

Unmanned Ground Vehicles sketch, The Future Soldier’s Load and the Mobility of the Nation (November 2001), page 7, Gen. Paul F. Gorman, US Army Combined Arms Center

Risks of Omission

Systematic vulnerabilities in decision-making compound in coalition or joint operations. Different states may train their AI systems with different datasets, simulation designs (if any), and legal interpretations. When such systems operate together, the seams between them can become legal blind spots. A particular AI system might abort an engagement that another proceeds with, creating conflicting operational tempos and complicating attribution if civilian harm occurs.

The danger is not limited to catastrophic, one-off mistakes. Over time, small, repeated deviations from IHL in marginal cases, where human commanders might have exercised restraint, can erode the protective function of the law. The result is a slow normalization of riskier behavior, driven not by political decision or doctrinal change, but by the statistical inertia of machine learning models. This is the core paradox: without safeguards, AI systems can become more predictable in some ways, yet less reliable in the moments when unpredictability, when acting against the statistical grain, is essential for lawful conduct.

Finally, military AI does not fail or succeed in complying with IHL by accident. Its behavior is the predictable result of how it is trained, the data it is given, the scenarios it is exposed to, and the rules embedded in its decision logic. How AI functions and the choices it takes is downstream from decisions made by humans in developing, training, and fielding it.

Governance, Audit, and Human Control

Bridging the gap from promising lab results to lawful behavior in the field requires more than good training runs. It needs an end-to-end governance spine that links data, models, code, test harnesses, deployment configurations, operators, and independent oversight into a single chain of accountability. That spine assigns clear decision rights, specifies the artifacts required at each stage, and shows how evidence of compliance is produced and preserved. It starts with curated, documented datasets and explicit problem statements; runs through model specifications, reward functions, and constraint schemas; includes scenario-coverage plans, legal reviews, and red-team evaluations; and culminates in authorization-to-operate, humane control interfaces, and post-incident audits. Every hand-off, data steward to model owner, model owner to system integrator, integrator to unit commander, should be traceable, signed, and reversible. In effect, the system deploys with its own accountability case: a living dossier that ties design choices to legal obligations and links runtime behavior to reviewable logs. Without that spine, even a technically impressive model becomes an orphan in the field, fast, capable, and difficult to supervise precisely when the fog thickens. The pathway from design to deployment rests on a few non-negotiables.

Data governance as policy, not plumbing. If models think with the statistics we give them, then data curation is a legal act as much as a technical one. Training corpora should be versioned and signed; every inclusion and exclusion choice documented; every oversampling decision for restraint labeled with a rationale. That record is what allows commanders, investigators, or courts to see how humane fallbacks were embedded by design rather than inferred after the fact.
Test what you train, and then test against what you didn’t. A system that performs well on its own distribution can still fail in the wild. Beyond standard validation, mandate distribution shift drills: deliberately swap sensor suites, degrade GPS, introduce spoofed friend/foe signals, and remix civilian movement patterns. In each drill, the system should either preserve lawful restraint or trigger a doubt protocol that defers to a human. Where it does neither, the failure should feed back into simulation design and reward shaping.
Non-overridable guardrails in code and command. Constraint layers (identification gates, collateral damage thresholds, no-strike lists) must be technically non-overridable by the model and procedurally difficult to override by humans. If escalation is necessary, require dual-key authorization with automatic logging. The goal is not to box out judgment but to ensure extraordinary actions leave extraordinary traces.
Responsibility matrices are embedded in the system. Every deployed AI component – classifier, tracker, recommender, fire-control interface – should write structured, time-synchronized logs that include model version, data slice identifiers, intermediate confidence values, triggered constraints, and who approved or halted an action. Think of this as a living annex to rules of engagement: not just “what the machine did,” but why it “thought” that was permissible, and who remained on the loop.
Human-on-the-loop that actually has leverage. Meaningful human control is not a checkbox; it is the ability to intervene in time with understanding. Interfaces must surface uncertainty (not just a single confidence score), show near-miss counterfactuals (“if civilians are within X meters, the system will abort”), and offer safe, low-latency actions (pause, shadow/track, switch to non-lethal). If the only human interaction available is “approve” under time pressure, control is nominal, not meaningful.
Coalition interoperability without legal dilution. Joint operations will mix systems trained on different data and doctrines. Interoperability standards should cover not only communications and formats but also minimum legal behaviors: shared constraint schemas, common doubt thresholds, and audit fields. The safest path is least-common-denominator legality: when systems disagree under uncertainty, the coalition default is restraint.
Pre-deployment red teaming and post-incident review. Before fielding, require adversarial evaluations by teams empowered to break things, reward hacking hunts, “blinking target” scenarios, and deception trials. After any incident with potential civilian harm, pull the synchronized logs, reconstruct the model’s decision path, and replay counterfactuals to see whether humane fallbacks would have triggered with slightly different inputs. Treat these reviews like flight-safety boards: technical, blameless, relentlessly corrective.
Make restraint measurable. What we measure, we secure. Track deferred engagements under uncertainty, rate of doubt-protocol activations, guardrail trip frequency, and time-to-human-intervention. Trend them over time and across theaters. If these metrics decay as models “improve,” it’s a warning that optimization is outpacing law.

In combination, these measures transfer human judgment (IL), secure robustness under uncertainty (RL and simulation), and institutionalize restraint via governance, constraint architectures, and independent audit, so that compliance is an engineered property rather than an assumption. The result is a verifiable accountability chain, datasets that show why restraint was learned, reward functions that make it valuable, guardrails that make it non-optional, and logs that make it reviewable. And because what we measure we secure, the system ships with metrics for doubt-protocol activations, deferred engagements, and guardrail trips, so commanders can see whether lawful caution is holding under stress. Only then does lawful behavior become the default under pressure, an engineered property of the system, rather than a hope we place in the gaps between probabilities and intent.

The autonomous system, Origin, prepares for a practice run during the Project Convergence capstone event at Yuma Proving Ground, Arizona, Aug. 11 – Sept. 18, 2020. Project Convergence is the Army’s campaign of learning to aggressively advance solutions in the areas of people, weapons systems, command and control, information, and terrain; and integrate the Army’s contributions to Joint All Domain Operations. (U.S. Army photo by Spc. Carlos Cuebas Fantauzzi, 22nd Mobile Public Affairs Detachment)

Growing a Governance Spine

Military AI will not “grow into” compliance with the law of armed conflict. It will do what it is trained, rewarded, permitted, and audited to do. In the fog of war, humans and machines both falter, but in different ways. Human commanders can depart from statistical expectations to privilege restraint; unmodified systems, bound to their learned probabilities, will not. That is why the humane minimum cannot sit at the margins of development. It has to be engineered into the center of learning, testing, and command.

Imitation learning can transmit judgment; reinforcement learning can build adaptability; simulation can force the improbable to be routine. Around that technical core, a governance spine, constraints that do not yield under pressure, doubt protocols that default to caution, signed datasets and reward functions, synchronized logs and metrics, turns legal aspiration into operational behavior. In coalitions, common constraint schemas and reviewable audit trails keep interoperability from becoming a legal blind spot.

At this point, two mistakes will sink this project: treating compliance as a software patch added after performance, or assuming that speed and scale will eventually smooth away edge cases. They will not. The edge cases are where the law does its most important work.

Compliance with the law of armed conflict must be an engineered property of the system: competence built through training, judgment transferred via imitation learning, robustness under uncertainty secured by simulation, and a non-derogable humane floor enforced by constraints and audit. What ultimately matters is evidence, datasets, reward functions, constraint triggers, and synchronized logs, showing that restraint prevailed when uncertainty was greatest. Only on that basis can militaries credibly claim that lawful conduct remains the default under operational pressure.

¹Where states choose to pursue development and fielding, simulation is the decisive safeguard. A different policy path is to forgo development or to prohibit particular applications outright.

²Articles 57(2)(a)(ii) and 57(2)(b)).

link