Factoring in Data Costs for Your AI Startup


By Ivy Nguyen, Zetta Venture Partners

Data gives AI startups a defensive moat: The more data the startup collects to train an AI model, the better that model will perform, making it difficult for a new entrant to catch up. That data does not come for free, however, and many AI startups see their margins eroded by this additional cost. You might hope to spend less on data as your models improve over time, but it’s unclear how to predict when that will happen and to what degree, making it difficult to model your future growth.

Unlike software startups where product development is buried under research and development costs in the P&L, AI startups should account for data costs as part of the cost of goods sold (COGS). Thinking about data as COGS instead of as R&D costs will help you identify opportunities for scaling up and driving costs down to increase your margins.

The Data Value Chain flow chart below shows how most AI startups acquire and use data. First, you record snippets of ground truth as raw data. You store that raw data somewhere and then establish processes or pipelines to maintain and access it. Before you use it in an AI model, you need to annotate the data so the model knows what to do with each data point. The trained model then takes in the data and returns a recommendation, which you can then use to take an action that drives some kind of outcome for the end user. This process can be separated into three distinct steps: acquiring data, storing the data, and annotating the data to train the model. Each step incurs a cost.

Cost of data acquisition

In all data value chains, some kind of sensor (either a physical device or a human being) first needs to collect raw data by capturing observations of reality. In this case, the costs from data acquisition come from creating, distributing, and operating the sensor. If that sensor is a piece of hardware, you must consider the cost of materials and manufacturing; if the sensor is a human, the costs come from recruiting and providing them with the tools they need to make and record the observations. Depending on how broad your coverage needs to be, you may need to pay a significant amount to distribute the sensors. Some use cases may need data collected at a high frequency, which may also drive up the labor and maintenance costs. Audience measurement company Nielsen, for example, faces all of these costs because it both provides the boxes and pays participants to report what they watch on TV. In this case, economies of scale drive down the per unit data acquisition costs as Nielsen’s data becomes more valuable the more comprehensive its coverage gets.

In some use cases, you may be able to transfer the work and cost of data acquisition to the end user by offering them a tool to manage their workflow (an automatic email response generator, for example) and then storing the data they capture in their work or observing their interactions with the tool and recording it as data. If you choose to distribute these tools for free, the cost of data acquisition will be the cost of customer acquisition efforts. Alternatively, you might choose to charge for the workflow tool, which could slow and limit customer adoption and, consequently, data acquisition while offsetting the data acquisition costs, depending on how you price.

One of my firm’s portfolio companies, InsideSales, for example, offers a platform for sales reps to dial their leads. As the sales reps use the platform, it records the time, mode, and other metadata about the interaction, as well as whether that lead progresses in the sales pipeline. The data is used to train an AI model to recommend the best time and mode of communication to contact similar leads. Here, network effects may increase the usefulness of the tool as more users come onto the platform, which may drive down user acquisition costs.

Read the source article in VentureBeat.