How we used AI to speed up content production by 5X

Most of the content in the Duolingo app is in the form of sentences in the language you're trying to learn. For example, for a Spanish course for English learners, you may see an exercise like the one below where you’re asked to translate a Spanish sentence into an English sentence.

These sentences took a long time to develop manually: around 600 hours per course! So in order for us to teach more courses and expand our existing courses, we needed to scale more efficiently. We had to rethink our content strategy.

With our new strategy, we wanted to achieve the following goals:

Reduce manual content editing or reviews: we realized early on that manual steps were limiting our scalability and impeding our progress. We wanted to build a system that reduces manual steps.
Automation via AI: we wanted to embrace the emergence of AI and fully leverage it to automate the entire content production process end-to-end.
Modularity: we wanted to allow different parts of the system to iterate and develop independently. We anticipate AI to get better, and different parts of our system with it. We wanted those improvements to be applied seamlessly and reduce the amount of disruption to our overall processes.

The system we built consists of 3 components: Generation, Evaluation, Selection. At a high level, Generation takes pedagogical plans as input and outputs sentence candidates. Evaluation then tags each sentence with evaluation results (e.g. “is logically coherent”). Lastly, Selection qualifies a subset of all generated sentences that satisfy a set of criteria for a given launch target by looking at the tags.

In the example above, Generation has produced 3 sentences. Evaluation has tagged each with results from 3 evaluators: grammatical correctness, logical coherence, pedagogical difficulty fit. It has tagged the first sentence to be logically incoherent as it’s illogical for someone to run up a fridge. It has tagged the third sentence to be too difficult for the target pedagogical level since the learner has not yet learned the word “participate”. Selection has then chosen the second sentence with all its requirements met.

Each component acts asynchronously from each other and consists of workflows that drive the state machine forward for each sentence until it is deemed ready (or not ready) for launch. The only connection between the components is the content storage where sentences and their metadata reside. In other words, we built the orchestration between the components using data; no explicit state tracking infra was necessary.

Idempotency was key in achieving high throughput for the system. Each workflow, upon periodically waking up, checks if there is any work for it to do by examining the data storage, which may or may not have updated since the last execution of the workflow. With the goal of eventually arriving at the target state for each sentence, it matters not how many times we run a given workflow or how many instances of the workflow we run in parallel. This methodology allowed us to build a complex set of workflows without making their management complex.

In order to achieve high quality for the sentences, we have developed numerous evaluators, whose responsibilities range from objective assessments like difficulty level to more subjective assessments like logical coherence. These are mostly achieved with AI prompts, developed by our AI Research Engineers and Learning Designers. These evaluators are the most critical part of the system, as they ensure that we provide our learners with high quality content.

Here are the lessons we learned in our AI journey, some of them reinforcing what we already had a hunch on:

People are a key component of a successful AI system as LLMs need carefully curated data to be steered correctly. From lesson planning data to quality evaluation golden sets, leveraging people for high-quality data to feed into our AI systems led to effective content generation that allowed us to speed up our production.
As many may already know in the industry, AI is often better at checking its own output than generating that output. For example, AI is better at determining whether a sentence follows a certain rule than generating a sentence that follows that rule in the first place. This is the main reason why evaluators are so important in our system.
Observability proved to be crucial for our system. Before we built our observability tooling, with everything automated and running all the time, it was tricky to manage and understand the system’s behavior and outputs. For example, the system at one point created an entire French course without using the French words for "and" or "or"! 🤦Identifying such misbehaviors and implementing proper fixes became a critical part of our automation effort.

So how is the system performing now? So far we have successfully generated and launched fully automated sentences for many of our Spanish courses! We have plans to generate for other languages, including Italian and Chinese. We look forward to launching more courses to reach more users quickly with the help of AI!

If you’re an engineer who wants to work on interesting problems using AI tools, we’re hiring!