The Road to 400 FPS

From 60 to 400 FPS

I’ve been working on a 2D isometric roguelike / tower defense hybrid for about over a year with another developer. Recently we were made some changes that improved our FPS from 60 to over 400 in builds and over 120 in editor. I’d like to share some of the improvements that were made in the hopes of sharing our knowledge. At a high level, as with most performance optimizations, these changes boiled down to simply doing less work. If the work had to be done and was expensive, we moved it off the main thread.

At a high level:

  • Profile, profile, profile. Without Unity’s “Deep Profile” feature, I would have never been able to make these improvements. Profiling pointed out hotspots and areas with memory pressure.
  • Pathfinding was one of our biggest bottlenecks. Our initial implementation from almost a year ago was implemented using async Tasks, so it was already relatively performance. However, we realized that we were doing a bunch of heavy calculations on the main thread where we set up initial pathfinding state, like randomly avoiding walkable terrain to make our enemies have unpredictable paths. The tasks were converted to Unity’s Job System and the walkable terrain avoidance code was moved entirely to the new jobs.
  • LINQ is expensive and generates a ton of garbage. We removed giant swaths of LINQ code and started using “buffers”, or Lists/HashSets that are cleared and rehydrated during Update or similarly frequent method calls. On a lot of classes you’ll see something like the following snippet
1// Used so we don't allocate during Update()
2private readonly List<AnxietyComponent> _allyAnxieties = new();
3private readonly HashSet<float> _uniqueAnxieties = new();
  • Algorithms are important! If you’re able to reduce a function call from O(n) to O(log(n)) where the n is big, then that’s a huge win.
  • Avoid re-allocating heap-based data structures such as classes or collections.

Here’s a quick little demo of the shooting aspect of our game:

Project Kaleidoscope Simple


Profiling

This is relatively simple - launch your project with “Deep Profiling” enabled. At points where you see frame spikes, or at simply any arbitrary point, pull up the heirarch overview and deep-dive into areas that you control.


Pathfinding

We use a relatively unmodified A* pathfinding system in conjunction with hex-based Tilemaps. Tilemap positions can either be free, blocked by the world (rocks / terrain), or blocked by entities (loot chests, temporary blocking objects). Entities can also transition between elevations (z axis) at certain locations on the map. To support all of this, we’ve wrapped all of this data in a singleton-style class called “GridManager”. Once the world is built, we bake all of the static elements such as locations blocked by the world and elevation-change triggers into read-only data structures stored in the GridManager. Dynamic data is stored in thread-safe containers (ConcurrentDictionaries) that are written to by the main thread. This lets us perform read operations against the GridManager from any thread and not run into the danger of accessing a Unity API from off of the main thread, which will throw.

Our initial implementation of Pathfinding had several ThreadLocal caches of the open set, closed set, and world parameters. The open and closed set held PathNode s that stored their position and f/g/h costs. All of this needed to be migrated to the job system.

We went from this:

 1private static readonly ThreadLocal<PathNode[,,]> ThreadLocalGridData = new ThreadLocal<PathNode[,,]>();
 2
 3private static readonly ThreadLocal<BoundsInt> ThreadLocalWorldBounds = new ThreadLocal<BoundsInt>();
 4
 5private static readonly ThreadLocal<List<PathNode>> ThreadLocalNeighborLists =
 6    new ThreadLocal<List<PathNode>>(() => new List<PathNode>(12));
 7
 8private static readonly ThreadLocal<HashSet<PathNode>> ThreadLocalOpenSets =
 9    new ThreadLocal<HashSet<PathNode>>(() => new HashSet<PathNode>());
10
11private static readonly ThreadLocal<HashSet<PathNode>> ThreadLocalClosedSets =
12    new ThreadLocal<HashSet<PathNode>>(() => new HashSet<PathNode>());
13
14private readonly GridManager _gridManager;

to this:

 1    public struct PathfindingJob : IJob, IDisposable
 2    {
 3        internal static GridManager GridManager;
 4
 5        public NativeHashSet<Vector3Int> openSet;
 6        public NativeHashSet<Vector3Int> closedSet;
 7        public NativeHashMap<Vector3Int, Vector3Int> pathParents;
 8        public NativeHashMap<Vector3Int, int> gCost;
 9        public NativeHashMap<Vector3Int, int> hCost;
10        public NativeList<Vector3Int> neighbors;
11        public NativeHashSet<Vector3Int> offsets;
12        public NativeList<Vector3Int> path;
13    }

The g and h cost were moved to their own dedicated data structures and the f cost is now calculated on the fly from all of the members of the open set each iteration. PathNodes were removed entirely and replaced with their Vector3Int positions. Notice the static GridManager - we’ve structured our code in such a way that the lifetime of a GridManager and all Pathfinding jobs for it will always be in lockstep. This lets us get around the “you can’t pass references into an IJob” restriction that the job system has.

The initial implementation of this job had these Native collections being created for every single pathfinding call, which was relatively cheap, but could be even cheaper! The majority of our Pathfinding calculations were being called from a Pathfollowing component that lives on each enemy. With that knowledge, we were able to initialize each of the Native collections required for the new PathfindingJob once and simply clear/re-use them for each subsequent call. This reduced allocations from a bunch to 1 per collection.

As part of pathfinding, I mentioned that some of the Pathfinding requests wanted to randomly avoid walkable positions. This is supported by a QuadTree that the GridManager provides access to. Initially, the QuadTree was being queried on the main thread and the results fed into the closed set of that particular Pathfinding job. According to Unity’s profiler, this was taking up 15% of the frame time! That’s a lot. We moved this functionality into the PathfindingJob, like so:

 1private void SetupInitialClosedSetFromWalkablePositions()
 2{
 3    IRandom random = SystemRandom.Instance;
 4    foreach (Vector3Int cell in GridManager
 5                    .GetWalkableCellsInRange(GridManager.GridToWorld2D(startPosition),
 6                        rangeForRandomDisallowedWalkablePositions).Where(_ =>
 7                        random.NextFloat() < walkablePositionDisallowedChance))
 8    {
 9        _ = closedSet.Add(cell);
10    }
11}

One of the challenges with this was that the QuadTree needs a conversion function from it’s element type to a Vector2. In our case, we were using the Tilemap’s CellToWorld . However, this can’t be called from off of the main thread. The solution? Fully enumerate the tilemap’s bounds and store all of the Vector3Int -> Vector2 data in a Native collection, like so:

1_gridToWorld = new NativeHashMap<Vector3Int, Vector3>(totalSize, _handle);
2Grid grid = GetGrid();
3foreach (Vector3Int cell in size.allPositionsWithin)
4{
5    _gridToWorld[cell] = grid.CellToWorld(cell);
6}

And with that, all of the pathfinding logic was moved off of the main thread.


Avoid LINQ

This gets thrown around a lot on the internet, but I’ll repeat it - avoid LINQ if you can. Unity’s profiler identified many areas where we were using LINQ during Update, which generates a ton of garbage. Pretty much all LINQ can be replaced with for loops, or if you’re using OrderBy, custom IComparer<T>s.


Algorithms

Work smarter, not harder.

The world that the player finds themselves in is procedurally generated. The generation is quite modular, allowing us to configure a multitude of passes that shape the produced world in different ways. One of our recent features is wiggly bounds:

Wiggly World Bounds

The initial implementation of this feature was relatively stupid - it would try to “cut off” a portion of the map by placing a wiggly path along an edge. To ensure that all of the “interesting stuff the player can do” was still valid, I would pathfind to each area of interest as if this new wiggly boundary existed. If all areas could be reached, only then would the wiggly boundary be placed. In the happy path, this approach worked great. For larger maps, where the chance of cutting off a point of interest was higher, this added several seconds to generation time, generally resulting in a wiggly boundary failure.

To remedy this, I inverted the order of these operations: pathfinding was ran between the player’s spawn and each area of interest. These paths, coupled with all of the boundaries of the points of interest, were then fed into the wiggly boundary pathfinding, ensuring that the boundaries wouldn’t cut off points of interest. This lets us pathfind without any validation - if a wiggly boundary path is found, it’s guaranteed to be valid


Heap Based Data Structures and Caches

This one is simple - if you’re using something like a List<T> or a HashSet<T>, see if you can only new it once, either at construction time or Awake. These can be cleared and re-used for computations throughout your classes.

If you’re calculating the same value over and over again, consider caching it. We wrote a simple TimedCache to auto-invalidate cached values after some time frame, with optional jitter. Jitter helps to reduce “frame spikes” where all of your caches expire at the same tick and need to be rehydrated.


That about wraps things up. With these improvements, we went from 60 FPS to 400+. We have some new features in the works that should see some even greater performance increases, but only time will tell. I hope you enjoyed the read!


Privacy Policy
Obsidian and DxMessaging