Published
- 20 min read
Battleship: An Agentic Coding Experiment
Since the initial public release of ChatGPT back in November 2022, I’ve been following and using this everchanging technology with quite the skepticism. I experience the AI evolution every day, and I admit it has come a long way.
Still, the success stories around it feel unrealistic and overhyped. I constantly hear developers making claims such as, “I don’t write code anymore; I write a good spec and use it with my agents to implement the feature” or “I have a multiple agents working concurrently for me 24/7, and I’ve increased my productivity 10x”.
My productivity has gone up, not by 10x, after a lot of tweaks and experimentation. I spend most of my time crafting prompts, defining features, and reviewing results. These reviews have taught me to be cautious and not trust blindly the results. It’s only through an iterative process that I achieve production ready results.
Sometimes, I feel like an impostor because I never get the result I want in a single prompt. So, I decided to run a test and carefully review and document the results.
Solving Battleship
I had a repo that I used years ago as a coding exercise for engineering candidates. The goal is to solve the Battleship game through the implementation of an API that has been previously defined. It seemed fitting to evaluate the capabilities of modern LLMs to solve this problem
I used Claude Opus 4.6, which at the time was the most advance model from Anthropic, and I gave it the following prompt within the repo:
The README.md file describes the feature that needs to be implemented. Evaluate the file and the project to understand the purpose of the feature and create a plan to resolve it. Once you have a plan create a branch named agent/claude-opus-46 and implement the solution using the plan. You are not allowed to ask questions, so try to figure out the answers for yourself.
You can review the README.md file yourself. I feel like is a good specification of the problem and the requirements to solve it; after all, this is exactly what we gave applicants. I specifically directed the LLM to not ask clarifying questions because candidates were also not allowed.
I put my client in /yolo mode to auto accept all changes and waited to see the results. After a couple of minutes, an implementation was completed. I ran the all the tests and they all passed. I was pleased to see that on top of the acceptance tests that were provided, there were several unit tests verifying the implementation.
Reviewing the Unit Tests
I always start code reviews with the unit tests. We use TDD (Test-Driven Development) in all our projects, so high quality unit tests are crucial. They provide a glance to the design and reveal the code intent.
The unit tests were short and clear. Additional docstrings were provided for further explanation of the requirement.
Even-though it didn’t rename the sample test file (test_sample.py), it did follow the approach of the sample test, and the tests were easy to read and understand.
Diving a little further, the tests showed a couple of problems.
Duplicate Tests
The first problem that I found was with duplicate tests in the TestShip class:
# in test_sample.py
class TestShip(unittest.TestCase):
"""Unit tests for the Ship class."""
def test_horizontal_ship_coordinates(self):
"""A horizontal ship should occupy consecutive x coordinates."""
ship = Ship(x=2, y=1, size=4, direction=Direction.HORIZONTAL)
coords = ship.get_coordinates()
self.assertEqual(coords, [(2, 1), (3, 1), (4, 1), (5, 1)])
def test_vertical_ship_coordinates(self):
"""A vertical ship should occupy consecutive y coordinates."""
ship = Ship(x=7, y=4, size=3, direction=Direction.VERTICAL)
coords = ship.get_coordinates()
self.assertEqual(coords, [(7, 4), (7, 5), (7, 6)])
def test_ship_direction_from_string(self):
"""Ship should accept direction as string."""
ship = Ship(x=0, y=0, size=2, direction="H")
self.assertEqual(ship.direction, Direction.HORIZONTAL)
ship_v = Ship(x=0, y=0, size=2, direction="V")
self.assertEqual(ship_v.direction, Direction.VERTICAL)
The first two tests verify the creation of a Ship object. The third is testing the same creation using a string for the direction. The direction, according to the API, is always provided as a string. It seems that an Enum is used to define the potential values, which is a good practice since it provides constants and groups the related values.
What bothers me is the fact that two set of tests were provided for the same requirement. Looking into the declaration of Direction.HORIZONTAL and Direction.VERTICAL we can see the problem:
from enum import Enum
# Then further down
#
class Direction(Enum):
"""Ship direction on the board."""
HORIZONTAL = "H"
VERTICAL = "V"
The constants are declared as Enum, which provides a grouping of related values. Sounds great! Unfortunately, Enum values do not match their string counter parts:
>>> Direction.HORIZONTAL == "H"
... False
Looking at the initialization of the Ship class, we see the following:
# in model.py
@dataclass
class Ship:
"""Represents a ship on the battleship board."""
x: int
y: int
size: int
direction: Direction
hits: Set[Tuple[int, int]] = field(default_factory=set)
def __post_init__(self):
if isinstance(self.direction, str):
self.direction = Direction(self.direction)
We will further discuss the choice to use a dataclass later on, but let’s focus on the .__post__init__() method. This method converts a potential direction passed as string to an instance of Direction of the same value.
An educated guess is that the first two unit tests were added, then when the feature tests failed because strings do not match against plain
Enum, the third test was added and fixed in the.__post_init__()method.
What this shows is that the model, confronted with a problem, can work around it, but it not always choose the right solution. The use of StrEnum is the clean fix to the problem:
# in model.py
from enum import StrEnum
class Direction(StrEnum):
"""Ship direction on the board."""
HORIZONTAL = "H"
VERTICAL = "V"
This now will work just fine:
>>> Direction.HORIZONTAL == "H"
... True
Let’s keep looking at the unit tests.
Unit Tests Coverage
One of the reasons Test-Driven Development (TDD) works is because it helps you to think about the problem and to find edge and abhorrent cases. There’s an old programming joke that says:
The two most difficult things in programming are: naming things, cache invalidation, and off-by-one errors.
Continuing with the review of TestShip, I find the following two tests:
# in test_sample.py
def test_ship_outside_bounds_horizontal(self):
"""Horizontal ship extending past board edge should be out of bounds."""
ship = Ship(x=8, y=1, size=4, direction=Direction.HORIZONTAL)
self.assertFalse(ship.is_within_bounds(10))
def test_ship_outside_bounds_vertical(self):
"""Vertical ship extending past board edge should be out of bounds."""
ship = Ship(x=5, y=8, size=4, direction=Direction.VERTICAL)
self.assertFalse(ship.is_within_bounds(10))
The tests are supposed to verify when a Ship is out-of-bounds. In a “10x10” grid where the cell indices are zero-based, the largest value for x and y is 9. There are different ways to test the edge cases, but those in the solution are not the edge cases.
Notice that those are the same scenarios used by the acceptance/feature tests.
Using the same coordinates the model choose, the edge cases would be Ship(x=8, y=1, size=3, direction=Direction.HORIZONTAL) for horizontal and Ship(x=5, y=8, size=3, direction=Direction.VERTICAL) for vertical.
A better and more clear test will be a Ship of size 1 outside of the edge: Ship(x=10, y=1, size=1, direction=Direction.HORIZONTAL) for horizontal and Ship(x=5, y=10, size=1, direction=Direction.VERTICAL) for vertical. This immediately communicates that the size 10 grid is index zero based and removes the need for mental calculations, which is what causes most of off-by-one errors.
Another issue is that there are no lower-bound edge case unit tests. Since the grid is zero-based, unit tests with negative values for either x or y should be tried, but they are missing.
Understand that the code is correct, and it checks correctly for negative values, reporting errors, but there are no tests for those scenarios.
Coincidentally, the same scenarios are missing on the acceptance/feature tests.
The Code Review
The repository provides an api.py module with placeholders for each API that needs implementation. The solution implements these placeholders, and adds a models.py, containing the implementation.
You can see there’s conditional code in api.py. This can affect coverage since handlers are more difficult to test. A closer look reveals that the conditional code happens around validation and error handling:
# in api.py. It show create_battleship_game() as an example.
@app.route('/battleship', methods=['POST'])
def create_battleship_game():
"""Create a new battleship game with the specified ships."""
data = flask.request.get_json()
if not data or 'ships' not in data:
return flask.jsonify({'error': 'Missing ships data'}), http.HTTPStatus.BAD_REQUEST
success, error = game.create_game(data['ships'])
if not success:
return flask.jsonify({'error': error}), http.HTTPStatus.BAD_REQUEST
return flask.jsonify({}), http.HTTPStatus.OK
The example shows the implementation of create_battleship_game() and how validation and error handling is implemented throughout the solution. I prefer implementing a controller that contains this logic to fully test the error conditions.
Another problem is that errors are reported as return values using a tuple of success, error. I will have a lot of questions for a candidate about this interesting design choice.
Unfortunately, you can’t ask LLMs about this sort of thing. LLMs are notorious for never challenging the user. This is problematic when the user is wrong and steers the implementation in the wrong direction unchallenged.
I call this the Improv Effect where LLMs always says yes to everything the user tells them. This makes LLMs not suitable to learn from them, at least as a single source of information.
The Design
The solution is provided in a models.py module. Besides some Enums for constants, the module implements three classes: Ship, Board, Game.

The Game class contains an instance of Board, which in turn contains instances of Ship. At first glance, I see no problem with this design. Let’s take a closer look at their implementation.
Inefficient Implementation
The first think I notice in models.py is the definition of the Ship class as a dataclass:
@dataclass
class Ship:
"""Represents a ship on the battleship board."""
x: int
y: int
size: int
direction: Direction
hits: Set[Tuple[int, int]] = field(default_factory=set)
def __post_init__(self):
if isinstance(self.direction, str):
self.direction = Direction(self.direction)
Ship should be a class and not a data structure because the parameters to initialize a Ship instance (x, y, size, direction) are not needed once the Ship pieces are calculated. Experience with code smells tell me this is likely to lead to some other of problems.
You can see in .__post_init__() that nothing happens other than normalizing the direction parameter, which we have already determine it is not needed if declared as enum.StrEnum.
The problem lies with .get_coordinates(). The method calculates the coordinates occupied by the ship object based on the construction parameters and returns them as a list, but the result is not stored anywhere:
@dataclass
class Ship:
"""Represents a ship on the battleship board."""
# Members and initialization hidden, see above
def get_coordinates(self) -> List[Tuple[int, int]]:
"""Get all coordinates occupied by this ship."""
coords = []
for i in range(self.size):
if self.direction == Direction.HORIZONTAL:
coords.append((self.x + i, self.y))
else: # VERTICAL
coords.append((self.x, self.y + i))
return coords
This hints at a major problem where the coordinates are going to be recalculated every time they are needed. Let’s take a look at the full implementation of Ship:
@dataclass
class Ship:
"""Represents a ship on the battleship board."""
x: int
y: int
size: int
direction: Direction
hits: Set[Tuple[int, int]] = field(default_factory=set)
def __post_init__(self):
if isinstance(self.direction, str):
self.direction = Direction(self.direction)
def get_coordinates(self) -> List[Tuple[int, int]]:
"""Get all coordinates occupied by this ship."""
coords = []
for i in range(self.size):
if self.direction == Direction.HORIZONTAL:
coords.append((self.x + i, self.y))
else: # VERTICAL
coords.append((self.x, self.y + i))
return coords
def is_within_bounds(self, board_size: int = 10) -> bool:
"""Check if the ship is within the board boundaries."""
for x, y in self.get_coordinates():
if x < 0 or x >= board_size or y < 0 or y >= board_size:
return False
return True
def overlaps_with(self, other: 'Ship') -> bool:
"""Check if this ship overlaps with another ship."""
my_coords = set(self.get_coordinates())
other_coords = set(other.get_coordinates())
return bool(my_coords & other_coords)
def register_hit(self, x: int, y: int) -> bool:
"""Register a hit at the given coordinates. Returns True if this ship was hit."""
coord = (x, y)
if coord in self.get_coordinates():
self.hits.add(coord)
return True
return False
def is_sunk(self) -> bool:
"""Check if the ship is completely sunk."""
return len(self.hits) >= self.size
def is_hit_at(self, x: int, y: int) -> bool:
"""Check if the ship has been hit at the given coordinates."""
return (x, y) in self.hits
def occupies(self, x: int, y: int) -> bool:
"""Check if this ship occupies the given coordinate."""
return (x, y) in self.get_coordinates()
The method is called over and over again throughout the implementation. The implementation of .get_coordinates() is initialization code. It should be executed at initialization once, and the result stored in a private/internal data member.
Instead, the method is called to check if the ship is in bounds in .is_within_bounds(), it is called when a hit is registered in .register_hit(), it is called to check if the ship occupies a specific coordinate in .occupies(), and it’s called, not once but twice, in .overlaps_with() to check if two Ship instances overlap each other.
The biggest size of a Ship object is 4 squares, but this will be a huge problem with a large dataset. You need to review the code because LLMs not always considered the most efficient implementation.
“Claude Opus 4.6” (and probably any other LLM) fails to solve the problem efficiently. Consider the LLM implementation of SQLite which was measured to be 20,000+ times slower than the actual implementation.
Let’s continue with our review.
Initializing the Game
The game is initialized using an array of Ship definitions sent through the API. From that, a Board is added to the Game and Ship instances are added to the Board. Ideally, all the initialization will happen while iterating through the array once, but that’s not what happens in the solution provided.
Let’s take a look at all the places where the solution iterates through all the ships, starting with the Board class:
class Board:
"""Represents the battleship game board."""
def __init__(self, size: int = 10):
self.size = size
self.ships: List[Ship] = []
def add_ship(self, ship: Ship) -> None:
"""Add a ship to the board."""
self.ships.append(ship)
def validate_ships(self) -> Tuple[bool, Optional[str]]:
"""
Validate all ships on the board.
Returns (is_valid, error_message).
"""
# Check if all ships are within bounds
for ship in self.ships:
if not ship.is_within_bounds(self.size):
return False, "Ship falls outside of board boundaries"
# Check for overlapping ships
for i, ship1 in enumerate(self.ships):
for ship2 in self.ships[i + 1:]:
if ship1.overlaps_with(ship2):
return False, "Ships overlap"
return True, None
def get_ship_at(self, x: int, y: int) -> Optional[Ship]:
"""Get the ship at the given coordinates, if any."""
for ship in self.ships:
if ship.occupies(x, y):
return ship
return None
def is_within_bounds(self, x: int, y: int) -> bool:
"""Check if coordinates are within the board."""
return 0 <= x < self.size and 0 <= y < self.size
.validate_ships() iterates through all the ships to check they are in bounds. Then, it iterates through them twice seeking for overlaps. This is a a naive exponential algorithm. With a larger dataset, this will cause a huge performance problem.
Let’s take a look at the implementation of Game:
class Game:
"""Manages the battleship game state."""
def __init__(self):
self.board: Optional[Board] = None
def create_game(self, ships_data: List[dict]) -> Tuple[bool, Optional[str]]:
"""
Create a new game with the given ships.
Returns (success, error_message).
"""
board = Board()
for ship_data in ships_data:
ship = Ship(
x=ship_data['x'],
y=ship_data['y'],
size=ship_data['size'],
direction=ship_data['direction']
)
board.add_ship(ship)
is_valid, error = board.validate_ships()
if not is_valid:
return False, error
self.board = board
return True, None
def process_shot(self, x: int, y: int) -> Tuple[Optional[ShotResult], Optional[str]]:
"""
Process a shot at the given coordinates.
Returns (result, error_message).
"""
if self.board is None:
return None, "No game in progress"
if not self.board.is_within_bounds(x, y):
return None, "Shot is out of bounds"
ship = self.board.get_ship_at(x, y)
if ship is None:
return ShotResult.WATER, None
# Check if ship was already sunk before this hit
was_already_sunk = ship.is_sunk()
# Register the hit
ship.register_hit(x, y)
if was_already_sunk:
# Hitting an already sunk ship returns HIT
return ShotResult.HIT, None
if ship.is_sunk():
return ShotResult.SINK, None
return ShotResult.HIT, None
def delete_game(self) -> None:
"""Delete the current game."""
self.board = None
In .create_game(), the implementation iterates through all the ship definitions to create the Ship objects, adding them to the board. Arguably, This is the only loop/iteration required, if you implement initialization where it belongs (in __init__()).
This loop should create the Ship objects, adding them to the board. The validation in board.validate_ships() should be an internal implementation detail of the Board class that happens during initialization. At the same time, the board can verify that the ship is in bounds and if there are any overlaps with other ships previously added.
The dataset in this program is very small, and the performance impact might be negligible, but you should always remember the LLM implementation of SQLite when generating production code.
Tell, Don’t Ask Violations
The problem statement in the README.md says:
The goal of this exercise is to evaluate your object oriented programming skills. The solution should have an adequate object oriented design, and demonstrate good separation of concerns.
With that in mind, let’s take a look at the implementation of Game.process_shot():
def process_shot(self, x: int, y: int) -> Tuple[Optional[ShotResult], Optional[str]]:
"""
Process a shot at the given coordinates.
Returns (result, error_message).
"""
if self.board is None:
return None, "No game in progress"
if not self.board.is_within_bounds(x, y):
return None, "Shot is out of bounds"
ship = self.board.get_ship_at(x, y)
if ship is None:
return ShotResult.WATER, None
# Check if ship was already sunk before this hit
was_already_sunk = ship.is_sunk()
# Register the hit
ship.register_hit(x, y)
if was_already_sunk:
# Hitting an already sunk ship returns HIT
return ShotResult.HIT, None
if ship.is_sunk():
return ShotResult.SINK, None
return ShotResult.HIT, None
The method begins by checking the .board instance is not None. This is required because the Game class allows an empty instantiation without a Board.
An empty Game is needed because the instance is created as a global variable in api.py before the endpoint crate_battleship_game() can be invoked. A better options would’ve been to declare the variable as None and create the Game instance in crate_battleship_game(), that way you can enforce that a Game always contains a Board eliminating this check.
Next, the .process_shot() implementation checks if they specified coordinates are within the board bounds. This logic belongs in the Board class as an implementation detail of a potential self.board.process_shot(). The method should also communicate of any errors preferably through exceptions.
This violates the “Tell, Don’t Ask” principle where objects should tell their members what to do, and not ask for their state.
The implementation continues by retrieving the Ship instance at the specified coordinate. This forces the check to see if an instance was retrieved or not to indicate ShotResult.WATER if no ship was found. this is another violation of `Tell, Don’t Ask”.
Even worse is the fact that Game.process_shot() implements all the business logic related to a ship being hit:
- First, it retrieves the state of the
Shipby calling.is_sunk(). - Then it
.register_hit(). - Finally decides what the return value should be based on the
Shipinitial state and the new state.
All this logic belongs in the Ship class itself and should be part of the internal implementation of .register_hit().
Other Problems
Some of the design decisions I can agree with given a good explanation. LLMs are not trained to push back because that will require them to think, and it’s important to understand that LLMs don’t think. You still have to do the thinking to steer the LLMs in the right direction, so if you see something you don’t like, tell them to change it.
Let’s take a look at some of the other problems.
Error Handling
The solution uses return values to report errors in the form of a tuple. The tuple contains success and error components and the implementation sets one or the other accordingly.
Given that all these errors should halt execution, the best way to report them is using exceptions. A Controller class can take care of handling exceptions and crafting the responses.
That controller class can be the
Gamesince most of its functionality is in the wrong place.
Incorrect Data Structures
Not only the .get_coordinates() method is invoked, needlessly multiple times, but it also returns a Python list when a set will work better.
Check how the .overlaps_with() method of the Ship class converts the coordinates list to a set to find overlaps.
def overlaps_with(self, other: 'Ship') -> bool:
"""Check if this ship overlaps with another ship."""
my_coords = set(self.get_coordinates())
other_coords = set(other.get_coordinates())
return bool(my_coords & other_coords)
LLMs are more likely to use list than set because list is used more often, so they appear more frequently in the training data. LLMs also favor older versions of libraries and frameworks for the same reason.
Lack of Idiomatic Python
Even when the solution uses many modern Python features, it fails to leverage Python idioms. It uses types, dataclass, even Enum, some of those incorrectly, but the code doesn’t feel like true Python.
Imagine the following snippet:
class Game:
"""Manages the battleship game state."""
def __init__(self, ships_data: List[dict]):
self._board = Board([self._from_data(ship_data) for ship_data in ships_data])
def _from_data(self, ship_data: dict) -> Ship:
return Ship(
x=ship_data['x'],
y=ship_data['y'],
size=ship_data['size'],
direction=ship_data['direction']
For this snippet to work, initialization has to be implemented in __init__() methods. Also, the business logic has to be implemented in the correct class. This is where the implementation fails. The classes seem correct, but they don’t implement the right responsibilities.
Not Everything is Bad
Not everything was horrible. For starters, it found a bug in the feature tests where one of the steps had a typo, and it corrected it.
It also made the right decision to using a set to keep track of the hits for a Ship. This eliminates duplicate hits and perfectly tracks when the Ship is sunk.
The full implementation, including a planning stage, took around 2 minutes. This saves a lot of time given that we used to give candidates 4 hours to complete the exercise. Even if you spend additional cycles fixing all the problems, you’ll still be more productive.
Conclusion
LLMs can boost your productivity writing code, but you have to be very careful. They are suitable to solve small, targeted, and concise problems. They don’t work well for general ideas, and they definitely don’t work to teach you how things should be done.
When using LLMs to write your programs, make sure that you tell them to write unit tests. Better yet, set up a Test-Driven Development workflow with your agents.
Evaluate the quality of the tests. Make sure they are readable and properly structured. Verify they are validating the requirements and including edge cases.
Look for inefficiencies. Anything that smells bad will come back to hurt you. Spend time refactoring the code through the LLM, which is another reason why unit tests are crucial to success.
Check for idiomatic code in your programming language. It will depend on when the model was trained as to the set of features it will use. You can still guide it to use best practices and modern idioms. Do not allow bad code to sneak into your code base.
Even through iterations you can gain higher productivity. Code reviews by agents can catch obscure problems that often go unnoticed, and sometimes they catch hidden bugs in your code.
Spend more time thinking about the problems, defining them better into well crafted prompts, and enhancing your agents to follow coding standards and best practices.
Spend more time reviewing code. That’s the natural evolution of a software engineer. As you become more experienced, you’ll spend more time reviewing code and mentoring junior engineers. You have to do the same with LLMs, and you must hold them to the same standards.