Continuous Integration for Startups: What makes a good testing plan?

Continuous Integration for Startups: What makes a good testing plan?

·

16 min read

The earlier you catch defects, the cheaper they are to fix - Dave Farley, Continuous Integration

The promise of continuous integration is great: you can save a lot of time in the long run by putting in a little bit of upfront effort to validate and test your code. It makes absolute sense on the face of it, but when you’ve got one-hundred things to do each day, putting it into practice at the beginning of a project is not so easily done. When you’re hyper-focused on growth and delivery, taking time to consolidate what you’ve already built can feel like an inefficiency.

Continuous integration (CI) is not a new concept, but it’s far from familiar territory for most developers. Unless you’ve previously worked in a small team or on a project from conception, it’s probably the case that you’ve interacted more often with the consequences of CI than its setup. To most people, CI is a background side effect, not a process. Consequently, when you next find yourself working on the project in the early stages of development, questions about CI loom large.

In an attempt to answer some of these questions, this post covers aspects of starting a useful CI process for a startup or new project:

  • Getting started with CI
  • Types of testing + testing priorities
  • Enacting a CI plan
  • Expanding and consolidating as you grow

Note that we only cover Continuous Integration; Continuous Delivery **is worthy of a dedicated piece of its own.

Setting the stage

Enacting and maintaining a good CI process has a few objectives:

  • Readibility - can the code be read and understood?
  • Validity - does it work as expected?
  • Performance - are outputs repeatable, dependable, timely and unburdensome?
  • Robustness - does it work at scale and is it secure?

For more lax codebases (codebases not intended to be used by paying customers, for example) robustness and performance can be skipped with some degree of safety. Testing validity and readability is almost always necessary, even if you’re writing a hacky script you expect to never have to come back to - quick hacks have a habit of returning to us with a vengeance.

The CI journey should roughly address the objectives in order. Without readable code, a team cannot collaborate on it effectively and nobody, not even the author, can really understand it. Without validated code, you can consider the product to not work as it is required to do - maybe it does, but you’re flying blind. Without performance checks, the system might work as expected in principle, but the experience on the end-user may be depreciated to a fatal point. Without robustness, a system can be prone to failure under stress, intentional or otherwise; this stress can have dire consequences for the users who depend on you to provide a service and protect their data.

When you start a new project, it’s tempting to go straight into the guts of it - having 100s of lines of code flow straight from your head to the screen is a great feeling, as it’s probably the only time you get to emulate the keyboard-mashing, deep-in-the-zone cliche of a Hollywood programmer. But as implied by the opening quote by David Farley, myopic outlooks towards testing cause more pain in the long run. Take some time at the start to avoid having to spend a lot of time later.

Before doing anything, it’s worth evaluating how much needs to be done. How much testing and validation a new codebase needs depends on how serious the consequences would be if things went wrong. Is the code going into production? Is it user facing? Will other people use it? “Other people” may be end users, it may be your engineering team, or it may be you in 6 months’ time. In the latter case, don’t try to convince yourself that your code doesn’t need to be validated because you’ll always understand it. You won’t.

The needs of your project continually change from both internal and external pressure, which requires your CI strategy to be constantly revised. New projects can get away with a bit of unit testing, but if you have paying customers you need to ensure you’re providing the service they’re expecting to receive for their money.

Almost without exception, every project can benefit from some form of CI, and that process would ideally start before the first line of code is written. The work required to test and validate code which has no existing benchmark is time-consuming and mind-numbing. Don’t get into a CI hole, or you won’t ever want to take the time to get yourself out of it. If you are already in the hole, don’t panic.

The best time to start a CI workflow is at project start. The second best time is now. - Ancient Chinese proverb, probably

Type of Testing

Complex systems have a lot of scope for failure, so need to be validated from a number of different approaches. The core types of testing you will enact as part of a CI process are:

  • Static testing. Analysis of code without running it. Code style validation is the foundational static testing method which should be part of all testing programmes.
  • Unit testing. Atomic elements of a codebase are tested independently. Each test should run quickly.
  • Integration/E2E testing. Parts of the codebase run correctly when interacting as they would in a production-like environment. Can be end-to-end, or logically isolated subsections of your system.
  • Snapshot testing. Testing that the UI matches an expected view. Most applicable for stable UIs.
  • Performance testing. How the system runs. Comparing time, memory and CPU usage to benchmarks.
  • Security testing. Validating the security of data flowing through the system. Security can be statically analysed, but more serious tests analyse a live system.
  • Production Monitoring. Checking that your live service is working as expected.

Static tests, such as code linters, are easy to setup and run, so can easily be a part of your system from the beginning. Common checks are code formatting and typing (for dynamitcally typed languages like Python). You can expand checks to also validate parts of the code like docstrings and naming conventions, but these can become overly restrictive, particularly for fast-growing codebases, which slows down the pace of development.

Unit testing - testing a single, isolated aspect of your code- is the foundation of any testing regime. Good unit tests should validate not just the common, normal uses and edge cases, but negative cases too - that bad input is handled correctly. Never underestimate the sanity of a user: somebody will type 1/0 into your calculator app, so there better be a unit test covering that behaviour. As you fix bugs, it’s a good habit to cover the fix with a unit test so that it cannot be silently reintroduced.

Just having unit tests which pass is not sufficient as you know nothing about the quality of the tests or where the gaps in your test cases are. Code coverage, the percentage of lines of code being tested, is a necessary but not sufficient metric for validating test quality: you can get 100% tests coverage while not testing any of the common edge cases in your system. Although it’s not a perfect metric, there is some correlation with code coverage and test utility, so it’s still worth setting a high coverage requirement. Any decent test framework will have a feature to fail the tests if coverage does not pass a threshold.

Integration testing is testing multiple sections of your codebase together in a more accurate mimicry of the live production system. The challenge with integration testing is that it typically takes longer to run than unit test and requires some other resources to be spun-up. For this reason, integration tests are sometimes run less frequently than unit tests, just before code gets deployed into production.

Whereas unit tests will rely on mocking behaviour from other parts of the system, integration tests use the system as-is to validate the real flow of information. The benefit of integration testing is that you make no assumptions about how parts of the system act. For example, maybe your unit tests mock a bad response to a database call as returning a null value, but in actuality it raises an error.

Any system which relies on external services, such as databases or APIs, should run integration tests; you can get away with unit tests, but you’re running on borrowed time until some version update or new feature breaks all of your existing test assumptions in a spectactular way.

Starting a CI plan

When you start a project, you have one, small codebase being worked on by one person, which is fine to manage. You know what each bit of code does and you can read every line. Sure, you try to write unit tests as you go along, but you’re mostly focused on building and growing. Before long, that one, small codebase becomes ten, large codebases in different languages, being worked on by different developers at different times. Now you have a problem.

Before you get into this state, it’s crucial to formalise the expectations for testing and CI for each codebase. You need an answer to what type of checks to be run on what types of codebases, how often the checks should be run and how, and how strongly to enforce the validation of tests. The idea of formalising something can be anathema to startups, but it doesn’t mean that you have to write a large tome containing detailed guides and testing plans; your team is full of smart people, they just need to see the boundaries of what is expected and they will work within them.

Readability

The static checks to be run on each codebase should be outlined, and whitelist libraries to use for each language, so that codebases are being help to a consistent standard. For example, for Python codebases we use black to format code, isort to format imports, mypy to validate type hinting and flake8 to validate coding standards. Those four simple, quick checks ensure the code is to a high standard and catches a lot of small issues that manual review doesn’t.

The specific coding style you validate your code against doesn’t really matter, so long as you only have one style. It can be jarring to go from one codebases to another, or god forbid one file to another, and have to get used to an entirely different visual style and flow. When in doubt of what style to follow, just copy google [1].

The gold standard is that there should never be failed CI builds for code formatting. Tools like pre-commit enforce style checks at every commit, but if you don’t want to force this onto each developer you can introduce a CI step to format code on a developer’s behalf, rather than failing and requiring the developer to intervene.

Readability checks are easy to carry out, so it should not slow down the cadence of delivery. If they are, it might be that there are disputes in the team about formatting - something which developers love to argue over - which necessitates a formalisation of coding standards. Maybe developers are taking a while to act on CI error, in which case you would be best served by automating the formatting process as outlined, taking away the onus from the developer.

Over-validating code against standards can also introduce latency. For example, you could automatically verify that function names begin with a verb and that docstrings follow an expected format, but your delivery speeds will suffer. While you can always introduce more checks and standards, their practical value very rapidly attenuated to zero as your system becomes increasingly restricted.

Beware of process rigidity in this stage of code validation: readable code is the foundational element of a good codebase, but that means nothing if features and bug fixes are not getting delivered. Perhaps above all other steps in a CI process, readability is the most easily sacrificed to push through a hotfix or a large release. The consequences of doing so, however, should be made very clear. Poor code should not remain poor for long.

Validity

All codebases which make it past the PoC phase should be accompanied by unit tests, with coverage validation. If it isn’t tested you can consider it broken, and you can’t ship broken code. Although you should be aiming for ~100% coverage, this is often difficult to achieve in practice due to services which are difficult to mock or decouple from the rest of the codebase. A rule of thumb for an okay starting level is around 80%, which should incrementally build over time. It should again be emphasised that coverage is not the arbiter of good code - don’t let the metric become a target. 100% coverage does not mean that your tests are good, and 70% does not mean that they are bad.

Performance and Robustness

Security testing is all too easily forgotten about, but the one you will be most glad for having when it works.

At minimum, the code should be statically evaluated for known security vulnerabilities. There are a number of free, open-source utility tools available, such as semgrep [2], which can be easily introduced into a CI process. Additionally, it’s a good idea to monitor codebase dependencies for vulnerabilities: third party weaknesses are a major cause of security incidents because they present a large attack surface while being hard to evaluate individually. If you’re on GitHub, the easiest way is to make use of Dependabot [3]. It’s free, so you have no excuse to not use it.

The need to test performance is highly dependent on the context of your system. If you’re developing a browser extension, for example, it’s imperitive to test CPU usage for typical and extreme uses, otherwise nobody will want to use your tool. If you’re deploying some backend service in the cloud, testing execution time will help you to keep cloud costs low and results generated promptly for users.

At minimum, these tests should be thresholded against some magic number you’ve determined, a breach of which you never want to deploy, but ideally you will compare the performance to the current previous versions - this allows you to identify when a small ineffiency has been introduced, making it easier to debug and fix. If you opt to only test the maximum value, trying to identify what specific changes caused the breach is a much more complicated and entwined task.

CI Execution

Your CI suite should be quick to fail, or you will be adding a great deal of latency to a process that should be agile. If a test suite takes 30 minutes to pass, your team will be less likely to comply with it and you may as well not have tests. CI jobs should run checks sequentially in order of execution speed, which is typically static checks, unit testing, end-to-end testing, performance testing. As your unit test suite grows, it’s a good idea to separate tests into quick and slow groups.

How frequently you test depends on your branching scheme (the pros and cons of each is a story for another time). If you always merge features directly into master then you only have one opportunity to validate the code - the pull request - and so must run every test at that time. If you make use of a staging or dev environment, then you can run the lightweight checks - static analysis and quick unit tests, for example - on the pull requests into dev, and leave the heavier workload on pull requests from dev into master. This allows individual features to be developed, interated and merged quickly, even if they carry some issues. The team lead can then work on validating the new release independently of the new features under development, which involves a slower-running, more involved CI check - slow unit tests, end-to-end tests, performance tests.

Monitoring your system in production is another vital requirement of any productionised codebase, but needs a post of its own for the topic to be fully explored. Briefly: focus on the core elements of usability. Is the system available? What’s the latency between action and result? The system should be polled frequently throughout the day, with sensible alert thresholds and channels. For example, latency for a simple request averaging 50ms over a 10 minute period when the lifetime average is 10ms might should send alerts. Where those alerts get sent is equally important. Emails to the lead developers allow you to keep a log, and a notification sent into a Slack channel gives widespread and rapid visibility, allowing a number of people to quickly tackle the problem.

Testing that a series of actions can be taken within your deployed system is more involved and may require third-party tooling to do effectively. It is a necessary investment in the early-ish stages of development, but simple performance metrics should be prioritised as they tell you most of what you need to know for far less setup cost.

Expanding and Consolidating

With that covered, you’ve implemented a good, initial CI process. You have solid static analysis, a nice suite of unit tests with decent coverage, and some end-to-end tests. It runs on every pull request, so you’re less likely to merge and ship broken code. You think everything’s going well.

The problem is that if you keep building and keep growing, you’re never really sure of what level of test coverage you truly have. What’s more, the people who wrote the code are now different to the people who wrote the tests, who are now different to the people building new features. The build times get longer, but are you getting more confident about your system’s reliability?

As with a house, it’s a good idea to spring clean your codebase to keep it functioning. Every few sprints, put a little pause in developing new features to take stock of what you have. Take the time to refactor the codebase and in doing so remove those hundreds of unnecessary lines of code. Take time to improve coding standards. Most importantly, take time to actually look at your tests and manually evaluate the coverage of edge cases. When multiple people are working on a system, it’s all too easy to test the same feature multiple times, which adds chaff to the testing framing and gives you undue confidence.

As stated before, test coverage is an imperfect measure, and can be disastrous when treated as a target rather than a metric. One method for evaluating your test utility is mutation testing. In mutation tests, small elements of your code are mutated (for example, starting a loop from index 1 instead of 0), which evaluates the robustness of tests. If tests still pass after many mutations, then they don’t actually test that the key functionality is running as it should.

How frequently spring cleaning sprints should be executed is not something that be answered universally. At the early stages of development, you won’t see as much use for it as refactors will just get invalidated by the following week’s spaghetti feature hack, and the problems of an inefficient testing regime have not compounded enough to become a significant drain. While regular review is not necessary for an early-stage product, take the opportunity offered in the immediate wake of a big release to focus on CI and testing: it’s a natural release of pressure and urgency, and cleaning in preparation for the next iteration will make your lives easier. An added, if a little cynical, benefit is that after a couple of weeks of testing your team will be begging to build features again.

As things get larger - and hopefully more stable - efficiency becomes the name of the game, so CI and testing reviews should take a more regular cadence. Once every quarter leaves little time for problems to grow into significant beasts, while not being so short as to turn your team into full-time testers.

As your products land on stable functionality, the needs of your users naturally shift from new features to reliability, and your CI processes should adapt to fit this. For example, UIs should be tested for consistent views on all major browsers and a range of screen types. Snapshot testing (as it’s known) is less relevant to startups trying to find product/market fit as the UI will change frequently and you’re more narrowly focused on a specific cohort of users (problems on Edge? Use Chrome then).

With growth and scale, you need to incrementally expand the range of tests you automate as your systems face new challenges. The system should begun to be tested under the extremes of usage. This isn’t necessary at the start when nobody’s using your products, but once you see the hockey stick, 50 users can very quickly become 5000. At scale, “unexpected behaviours” become a daily occurrence.

In brief

The recipe to a strong CI and testing framework is a short and easy one. Start early, focus on the foundations, build it incrementally, and take time to evaluate what you’ve got.

Resources

[1] https://google.github.io/styleguide/

[2] https://semgrep.dev/

[3] https://github.com/dependabot

Still need to scratch that CI itch?