Microsoft’s AI division has introduced its first homegrown models: MAI-Voice-1 and MAI-1-preview.
The flagship speech model, MAI-Voice-1, is designed for speed and efficiency, capable of generating a full minute of audio in under one second using just a single GPU. Microsoft is already putting it to work in features like Copilot Daily, which delivers AI-narrated news briefings, and in podcast-style explainers that help break down complex topics.
Users can experiment with MAI-Voice-1 through Copilot Labs, where they can input custom text and adjust the model’s voice and speaking style.
Alongside it, Microsoft debuted MAI-1-preview, a text-based model trained on roughly 15,000 Nvidia H100 GPUs. It’s aimed at handling everyday instructions and queries, offering a look at what Microsoft envisions as the future backbone of Copilot.
Unlike Microsoft’s enterprise-focused AI partnerships, these internal efforts are built squarely for consumers. As AI chief Mustafa Suleyman explained in a Decoder interview last year, “My logic is that we have to create something that works extremely well for the consumer and really optimize for our use case… My focus is on building models that really work for the consumer companion.”
MAI-1-preview is already being tested for text interactions inside Copilot—currently powered by OpenAI’s models—and is also undergoing evaluation on the LMArena benchmarking platform.
Microsoft hinted that this is only the beginning:
“We have big ambitions for where we go next. Not only will we pursue further advances here, but we believe orchestrating a range of specialized models serving different user intents and use cases will unlock immense value.”
We have helped 20+ companies in industries like Finance, Transportation, Health, Tourism, Events, Education, Sports.