Battleship: An Agentic Coding Experiment

Since the initial public release of ChatGPT back in November 2022, I’ve been following and using this everchanging technology with quite the skepticism. I experience the AI evolution every day, and I admit it has come a long way.

Still, the success stories around it feel unrealistic and overhyped. I constantly hear developers making claims such as, “I don’t write code anymore; I write a good spec and use it with my agents to implement the feature” or “I have a multiple agents working concurrently for me 24/7, and I’ve increased my productivity 10x”.

My productivity has gone up, not by 10x, after a lot of tweaks and experimentation. I spend most of my time crafting prompts, defining features, and reviewing results. These reviews have taught me to be cautious and not trust blindly the results. It’s only through an iterative process that I achieve production ready results.

Sometimes, I feel like an impostor because I never get the result I want in a single prompt. So, I decided to run a test and carefully review and document the results.

Solving Battleship

I had a repo that I used years ago as a coding exercise for engineering candidates. The goal is to solve the Battleship game through the implementation of an API that has been previously defined. It seemed fitting to evaluate the capabilities of modern LLMs to solve this problem

I used Claude Opus 4.6, which at the time was the most advance model from Anthropic, and I gave it the following prompt within the repo:

The README.md file describes the feature that needs to be implemented. Evaluate the file and the project to understand the purpose of the feature and create a plan to resolve it. Once you have a plan create a branch named agent/claude-opus-46 and implement the solution using the plan. You are not allowed to ask questions, so try to figure out the answers for yourself.

You can review the README.md file yourself. I feel like is a good specification of the problem and the requirements to solve it; after all, this is exactly what we gave applicants. I specifically directed the LLM to not ask clarifying questions because candidates were also not allowed.

I put my client in /yolo mode to auto accept all changes and waited to see the results. After a couple of minutes, an implementation was completed. I ran the all the tests and they all passed. I was pleased to see that on top of the acceptance tests that were provided, there were several unit tests verifying the implementation.

Reviewing the Unit Tests

I always start code reviews with the unit tests. We use TDD (Test-Driven Development) in all our projects, so high quality unit tests are crucial. They provide a glance to the design and reveal the code intent.

The unit tests were short and clear. Additional docstrings were provided for further explanation of the requirement.

Even-though it didn’t rename the sample test file (test_sample.py), it did follow the approach of the sample test, and the tests were easy to read and understand.

Diving a little further, the tests showed a couple of problems.

Duplicate Tests

The first problem that I found was with duplicate tests in the TestShip class:

# in test_sample.py

class TestShip(unittest.TestCase):
    """Unit tests for the Ship class."""

    def test_horizontal_ship_coordinates(self):
        """A horizontal ship should occupy consecutive x coordinates."""
        ship = Ship(x=2, y=1, size=4, direction=Direction.HORIZONTAL)
        coords = ship.get_coordinates()
        self.assertEqual(coords, [(2, 1), (3, 1), (4, 1), (5, 1)])

    def test_vertical_ship_coordinates(self):
        """A vertical ship should occupy consecutive y coordinates."""
        ship = Ship(x=7, y=4, size=3, direction=Direction.VERTICAL)
        coords = ship.get_coordinates()
        self.assertEqual(coords, [(7, 4), (7, 5), (7, 6)])

    def test_ship_direction_from_string(self):
        """Ship should accept direction as string."""
        ship = Ship(x=0, y=0, size=2, direction="H")
        self.assertEqual(ship.direction, Direction.HORIZONTAL)

        ship_v = Ship(x=0, y=0, size=2, direction="V")
        self.assertEqual(ship_v.direction, Direction.VERTICAL)

The first two tests verify the creation of a Ship object. The third is testing the same creation using a string for the direction. The direction, according to the API, is always provided as a string. It seems that an Enum is used to define the potential values, which is a good practice since it provides constants and groups the related values.

What bothers me is the fact that two set of tests were provided for the same requirement. Looking into the declaration of Direction.HORIZONTAL and Direction.VERTICAL we can see the problem:

from enum import Enum

# Then further down
#
class Direction(Enum):
    """Ship direction on the board."""
    HORIZONTAL = "H"
    VERTICAL = "V"

The constants are declared as Enum, which provides a grouping of related values. Sounds great! Unfortunately, Enum values do not match their string counter parts:

>>> Direction.HORIZONTAL == "H"
... False

Looking at the initialization of the Ship class, we see the following:

# in model.py

@dataclass
class Ship:
    """Represents a ship on the battleship board."""
    x: int
    y: int
    size: int
    direction: Direction
    hits: Set[Tuple[int, int]] = field(default_factory=set)

    def __post_init__(self):
        if isinstance(self.direction, str):
          self.direction = Direction(self.direction)

We will further discuss the choice to use a dataclass later on, but let’s focus on the .__post__init__() method. This method converts a potential direction passed as string to an instance of Direction of the same value.

An educated guess is that the first two unit tests were added, then when the feature tests failed because strings do not match against plain Enum, the third test was added and fixed in the .__post_init__() method.

What this shows is that the model, confronted with a problem, can work around it, but it not always choose the right solution. The use of StrEnum is the clean fix to the problem:

# in model.py

from enum import StrEnum

class Direction(StrEnum):
    """Ship direction on the board."""
    HORIZONTAL = "H"
    VERTICAL = "V"

This now will work just fine:

>>> Direction.HORIZONTAL == "H"
... True

Let’s keep looking at the unit tests.

Unit Tests Coverage

One of the reasons Test-Driven Development (TDD) works is because it helps you to think about the problem and to find edge and abhorrent cases. There’s an old programming joke that says:

The two most difficult things in programming are: naming things, cache invalidation, and off-by-one errors.

Continuing with the review of TestShip, I find the following two tests:

# in test_sample.py

def test_ship_outside_bounds_horizontal(self):
    """Horizontal ship extending past board edge should be out of bounds."""
    ship = Ship(x=8, y=1, size=4, direction=Direction.HORIZONTAL)
    self.assertFalse(ship.is_within_bounds(10))

def test_ship_outside_bounds_vertical(self):
    """Vertical ship extending past board edge should be out of bounds."""
    ship = Ship(x=5, y=8, size=4, direction=Direction.VERTICAL)
    self.assertFalse(ship.is_within_bounds(10))

The tests are supposed to verify when a Ship is out-of-bounds. In a “10x10” grid where the cell indices are zero-based, the largest value for x and y is 9. There are different ways to test the edge cases, but those in the solution are not the edge cases.

Notice that those are the same scenarios used by the acceptance/feature tests.

Using the same coordinates the model choose, the edge cases would be Ship(x=8, y=1, size=3, direction=Direction.HORIZONTAL) for horizontal and Ship(x=5, y=8, size=3, direction=Direction.VERTICAL) for vertical.

A better and more clear test will be a Ship of size 1 outside of the edge: Ship(x=10, y=1, size=1, direction=Direction.HORIZONTAL) for horizontal and Ship(x=5, y=10, size=1, direction=Direction.VERTICAL) for vertical. This immediately communicates that the size 10 grid is index zero based and removes the need for mental calculations, which is what causes most of off-by-one errors.

Another issue is that there are no lower-bound edge case unit tests. Since the grid is zero-based, unit tests with negative values for either x or y should be tried, but they are missing.

Understand that the code is correct, and it checks correctly for negative values, reporting errors, but there are no tests for those scenarios.

Coincidentally, the same scenarios are missing on the acceptance/feature tests.

The Code Review

The repository provides an api.py module with placeholders for each API that needs implementation. The solution implements these placeholders, and adds a models.py, containing the implementation.

You can see there’s conditional code in api.py. This can affect coverage since handlers are more difficult to test. A closer look reveals that the conditional code happens around validation and error handling:

# in api.py. It show create_battleship_game() as an example.

@app.route('/battleship', methods=['POST'])
def create_battleship_game():
    """Create a new battleship game with the specified ships."""
    data = flask.request.get_json()

    if not data or 'ships' not in data:
        return flask.jsonify({'error': 'Missing ships data'}), http.HTTPStatus.BAD_REQUEST

    success, error = game.create_game(data['ships'])

    if not success:
        return flask.jsonify({'error': error}), http.HTTPStatus.BAD_REQUEST

    return flask.jsonify({}), http.HTTPStatus.OK

The example shows the implementation of create_battleship_game() and how validation and error handling is implemented throughout the solution. I prefer implementing a controller that contains this logic to fully test the error conditions.

Another problem is that errors are reported as return values using a tuple of success, error. I will have a lot of questions for a candidate about this interesting design choice.

Unfortunately, you can’t ask LLMs about this sort of thing. LLMs are notorious for never challenging the user. This is problematic when the user is wrong and steers the implementation in the wrong direction unchallenged.

I call this the Improv Effect where LLMs always says yes to everything the user tells them. This makes LLMs not suitable to learn from them, at least as a single source of information.

The Design

The solution is provided in a models.py module. Besides some Enums for constants, the module implements three classes: Ship, Board, Game.

Model Implementation Class Diagram

The Game class contains an instance of Board, which in turn contains instances of Ship. At first glance, I see no problem with this design. Let’s take a closer look at their implementation.

Inefficient Implementation

The first think I notice in models.py is the definition of the Ship class as a dataclass:

@dataclass
class Ship:
    """Represents a ship on the battleship board."""
    x: int
    y: int
    size: int
    direction: Direction
    hits: Set[Tuple[int, int]] = field(default_factory=set)

    def __post_init__(self):
        if isinstance(self.direction, str):
            self.direction = Direction(self.direction)

Ship should be a class and not a data structure because the parameters to initialize a Ship instance (x, y, size, direction) are not needed once the Ship pieces are calculated. Experience with code smells tell me this is likely to lead to some other of problems.

You can see in .__post_init__() that nothing happens other than normalizing the direction parameter, which we have already determine it is not needed if declared as enum.StrEnum.

The problem lies with .get_coordinates(). The method calculates the coordinates occupied by the ship object based on the construction parameters and returns them as a list, but the result is not stored anywhere:

@dataclass
class Ship:
    """Represents a ship on the battleship board."""

    # Members and initialization hidden, see above

    def get_coordinates(self) -> List[Tuple[int, int]]:
        """Get all coordinates occupied by this ship."""
        coords = []
        for i in range(self.size):
            if self.direction == Direction.HORIZONTAL:
                coords.append((self.x + i, self.y))
            else:  # VERTICAL
                coords.append((self.x, self.y + i))
        return coords

This hints at a major problem where the coordinates are going to be recalculated every time they are needed. Let’s take a look at the full implementation of Ship:

@dataclass
class Ship:
    """Represents a ship on the battleship board."""
    x: int
    y: int
    size: int
    direction: Direction
    hits: Set[Tuple[int, int]] = field(default_factory=set)

    def __post_init__(self):
        if isinstance(self.direction, str):
            self.direction = Direction(self.direction)

    def get_coordinates(self) -> List[Tuple[int, int]]:
        """Get all coordinates occupied by this ship."""
        coords = []
        for i in range(self.size):
            if self.direction == Direction.HORIZONTAL:
                coords.append((self.x + i, self.y))
            else:  # VERTICAL
                coords.append((self.x, self.y + i))
        return coords

    def is_within_bounds(self, board_size: int = 10) -> bool:
        """Check if the ship is within the board boundaries."""
        for x, y in self.get_coordinates():
            if x < 0 or x >= board_size or y < 0 or y >= board_size:
                return False
        return True

    def overlaps_with(self, other: 'Ship') -> bool:
        """Check if this ship overlaps with another ship."""
        my_coords = set(self.get_coordinates())
        other_coords = set(other.get_coordinates())
        return bool(my_coords & other_coords)

    def register_hit(self, x: int, y: int) -> bool:
        """Register a hit at the given coordinates. Returns True if this ship was hit."""
        coord = (x, y)
        if coord in self.get_coordinates():
            self.hits.add(coord)
            return True
        return False

    def is_sunk(self) -> bool:
        """Check if the ship is completely sunk."""
        return len(self.hits) >= self.size

    def is_hit_at(self, x: int, y: int) -> bool:
        """Check if the ship has been hit at the given coordinates."""
        return (x, y) in self.hits

    def occupies(self, x: int, y: int) -> bool:
        """Check if this ship occupies the given coordinate."""
        return (x, y) in self.get_coordinates()

The method is called over and over again throughout the implementation. The implementation of .get_coordinates() is initialization code. It should be executed at initialization once, and the result stored in a private/internal data member.

Instead, the method is called to check if the ship is in bounds in .is_within_bounds(), it is called when a hit is registered in .register_hit(), it is called to check if the ship occupies a specific coordinate in .occupies(), and it’s called, not once but twice, in .overlaps_with() to check if two Ship instances overlap each other.

The biggest size of a Ship object is 4 squares, but this will be a huge problem with a large dataset. You need to review the code because LLMs not always considered the most efficient implementation.

“Claude Opus 4.6” (and probably any other LLM) fails to solve the problem efficiently. Consider the LLM implementation of SQLite which was measured to be 20,000+ times slower than the actual implementation.

Let’s continue with our review.

Initializing the Game

The game is initialized using an array of Ship definitions sent through the API. From that, a Board is added to the Game and Ship instances are added to the Board. Ideally, all the initialization will happen while iterating through the array once, but that’s not what happens in the solution provided.

Let’s take a look at all the places where the solution iterates through all the ships, starting with the Board class:

class Board:
    """Represents the battleship game board."""

    def __init__(self, size: int = 10):
        self.size = size
        self.ships: List[Ship] = []

    def add_ship(self, ship: Ship) -> None:
        """Add a ship to the board."""
        self.ships.append(ship)

    def validate_ships(self) -> Tuple[bool, Optional[str]]:
        """
        Validate all ships on the board.
        Returns (is_valid, error_message).
        """
        # Check if all ships are within bounds
        for ship in self.ships:
            if not ship.is_within_bounds(self.size):
                return False, "Ship falls outside of board boundaries"

        # Check for overlapping ships
        for i, ship1 in enumerate(self.ships):
            for ship2 in self.ships[i + 1:]:
                if ship1.overlaps_with(ship2):
                    return False, "Ships overlap"

        return True, None

    def get_ship_at(self, x: int, y: int) -> Optional[Ship]:
        """Get the ship at the given coordinates, if any."""
        for ship in self.ships:
            if ship.occupies(x, y):
                return ship
        return None

    def is_within_bounds(self, x: int, y: int) -> bool:
        """Check if coordinates are within the board."""
        return 0 <= x < self.size and 0 <= y < self.size

.validate_ships() iterates through all the ships to check they are in bounds. Then, it iterates through them twice seeking for overlaps. This is a a naive exponential algorithm. With a larger dataset, this will cause a huge performance problem.

Let’s take a look at the implementation of Game:

class Game:
    """Manages the battleship game state."""

    def __init__(self):
        self.board: Optional[Board] = None

    def create_game(self, ships_data: List[dict]) -> Tuple[bool, Optional[str]]:
        """
        Create a new game with the given ships.
        Returns (success, error_message).
        """
        board = Board()

        for ship_data in ships_data:
            ship = Ship(
                x=ship_data['x'],
                y=ship_data['y'],
                size=ship_data['size'],
                direction=ship_data['direction']
            )
            board.add_ship(ship)

        is_valid, error = board.validate_ships()
        if not is_valid:
            return False, error

        self.board = board
        return True, None

    def process_shot(self, x: int, y: int) -> Tuple[Optional[ShotResult], Optional[str]]:
        """
        Process a shot at the given coordinates.
        Returns (result, error_message).
        """
        if self.board is None:
            return None, "No game in progress"

        if not self.board.is_within_bounds(x, y):
            return None, "Shot is out of bounds"

        ship = self.board.get_ship_at(x, y)

        if ship is None:
            return ShotResult.WATER, None

        # Check if ship was already sunk before this hit
        was_already_sunk = ship.is_sunk()

        # Register the hit
        ship.register_hit(x, y)

        if was_already_sunk:
            # Hitting an already sunk ship returns HIT
            return ShotResult.HIT, None

        if ship.is_sunk():
            return ShotResult.SINK, None

        return ShotResult.HIT, None

    def delete_game(self) -> None:
        """Delete the current game."""
        self.board = None

In .create_game(), the implementation iterates through all the ship definitions to create the Ship objects, adding them to the board. Arguably, This is the only loop/iteration required, if you implement initialization where it belongs (in __init__()).

This loop should create the Ship objects, adding them to the board. The validation in board.validate_ships() should be an internal implementation detail of the Board class that happens during initialization. At the same time, the board can verify that the ship is in bounds and if there are any overlaps with other ships previously added.

The dataset in this program is very small, and the performance impact might be negligible, but you should always remember the LLM implementation of SQLite when generating production code.

Tell, Don’t Ask Violations

The problem statement in the README.md says:

The goal of this exercise is to evaluate your object oriented programming skills. The solution should have an adequate object oriented design, and demonstrate good separation of concerns.

With that in mind, let’s take a look at the implementation of Game.process_shot():

def process_shot(self, x: int, y: int) -> Tuple[Optional[ShotResult], Optional[str]]:
    """
    Process a shot at the given coordinates.
    Returns (result, error_message).
    """
    if self.board is None:
        return None, "No game in progress"

    if not self.board.is_within_bounds(x, y):
        return None, "Shot is out of bounds"

    ship = self.board.get_ship_at(x, y)

    if ship is None:
        return ShotResult.WATER, None

    # Check if ship was already sunk before this hit
    was_already_sunk = ship.is_sunk()

    # Register the hit
    ship.register_hit(x, y)

    if was_already_sunk:
        # Hitting an already sunk ship returns HIT
        return ShotResult.HIT, None

    if ship.is_sunk():
        return ShotResult.SINK, None

    return ShotResult.HIT, None

The method begins by checking the .board instance is not None. This is required because the Game class allows an empty instantiation without a Board.

An empty Game is needed because the instance is created as a global variable in api.py before the endpoint crate_battleship_game() can be invoked. A better options would’ve been to declare the variable as None and create the Game instance in crate_battleship_game(), that way you can enforce that a Game always contains a Board eliminating this check.

Next, the .process_shot() implementation checks if they specified coordinates are within the board bounds. This logic belongs in the Board class as an implementation detail of a potential self.board.process_shot(). The method should also communicate of any errors preferably through exceptions.

This violates the “Tell, Don’t Ask” principle where objects should tell their members what to do, and not ask for their state.

The implementation continues by retrieving the Ship instance at the specified coordinate. This forces the check to see if an instance was retrieved or not to indicate ShotResult.WATER if no ship was found. this is another violation of `Tell, Don’t Ask”.

Even worse is the fact that Game.process_shot() implements all the business logic related to a ship being hit:

First, it retrieves the state of the Ship by calling .is_sunk().
Then it .register_hit().
Finally decides what the return value should be based on the Ship initial state and the new state.

All this logic belongs in the Ship class itself and should be part of the internal implementation of .register_hit().

Not Everything is Bad

Not everything was horrible. For starters, it found a bug in the feature tests where one of the steps had a typo, and it corrected it.

It also made the right decision to using a set to keep track of the hits for a Ship. This eliminates duplicate hits and perfectly tracks when the Ship is sunk.

The full implementation, including a planning stage, took around 2 minutes. This saves a lot of time given that we used to give candidates 4 hours to complete the exercise. Even if you spend additional cycles fixing all the problems, you’ll still be more productive.

Conclusion

LLMs can boost your productivity writing code, but you have to be very careful. They are suitable to solve small, targeted, and concise problems. They don’t work well for general ideas, and they definitely don’t work to teach you how things should be done.

When using LLMs to write your programs, make sure that you tell them to write unit tests. Better yet, set up a Test-Driven Development workflow with your agents.

Evaluate the quality of the tests. Make sure they are readable and properly structured. Verify they are validating the requirements and including edge cases.

Look for inefficiencies. Anything that smells bad will come back to hurt you. Spend time refactoring the code through the LLM, which is another reason why unit tests are crucial to success.

Check for idiomatic code in your programming language. It will depend on when the model was trained as to the set of features it will use. You can still guide it to use best practices and modern idioms. Do not allow bad code to sneak into your code base.

Even through iterations you can gain higher productivity. Code reviews by agents can catch obscure problems that often go unnoticed, and sometimes they catch hidden bugs in your code.

Spend more time thinking about the problems, defining them better into well crafted prompts, and enhancing your agents to follow coding standards and best practices.

Spend more time reviewing code. That’s the natural evolution of a software engineer. As you become more experienced, you’ll spend more time reviewing code and mentoring junior engineers. You have to do the same with LLMs, and you must hold them to the same standards.