On June 1, MiniMax published a model called M3 and described it as the first open-weight model to combine frontier-level coding, a one-million-token context window, and native multimodal capability in a single package. The benchmarks are striking: 59 percent on SWE-Bench Pro, ahead of GPT-5.5 and Gemini 3.1 Pro, and just behind Anthropic's Opus 4.7. It processes image and video input natively and can operate a desktop computer. Pricing lands at $0.60 per million input tokens — well below most frontier alternatives.
MiniMax said the weights and technical report would be published within ten days of launch, on Hugging Face and GitHub, for private cluster deployment and fine-tuning. The weights are real, the capability appears to be real, and the pricing is accessible.
What MiniMax has not released is the training code or the inference operators used to produce the model.
This distinction matters more than the press coverage has reflected.
"Open-weight" means you receive the checkpoint: the numerical parameters that define the model's behavior. You can download it, run it, fine-tune it, and deploy it on your own infrastructure. What you cannot do, with an open-weight model whose training code has not been released, is reproduce the training run. You cannot audit what data went in. You cannot understand the choices made during post-training — the reinforcement learning from human feedback, the filtering passes, the alignment procedures — that shaped what the model does and doesn't do. You cannot verify the benchmarks independently using the same training pipeline, which matters when the benchmarks are self-reported by the releasing lab.
"Open-source" implies all of that is available: training code, data pipelines, post-training procedures, tooling. Meta's Llama models come closer to this standard, though even they do not release training data. The gap between what is legally licensed and what is operationally reproducible is significant, and it has real consequences for teams that are not large enough to run their own evaluation infrastructure.
For builders, this is not an abstract concern. If a team is making compliance decisions — healthcare data, government contracts, financial regulation — based on a model being described as open, what they actually need to know is whether they can audit training provenance. Open weights do not provide that. If a team is building an application that depends on fine-tuning the model to a specific domain, they need to understand what the base model was trained on, because the fine-tune will not escape the base distribution. And if benchmarks are self-reported without reproducible training pipelines, the number means less than it looks like it means.
None of this makes M3 a bad product. The pricing is real, the capability appears genuine, and for teams that need a capable model they can run on their own infrastructure rather than pay API rates for, it is worth evaluating on its own terms. The Chinese AI lab ecosystem — MiniMax alongside Moonshot AI's Kimi-K2.6, Alibaba's Qwen 3.5, and the variants built on DeepSeek V3 — is producing meaningful competition on benchmark performance and open availability. For teams building in contexts where API costs are denominated in a strong currency and their revenue is not, a locally-run capable model is not an academic exercise. It is a cost control measure. We have written about inference economics and running your own weights as operational decisions; the Chinese open-weight ecosystem is now a real input to those decisions, with pricing and benchmark claims that are worth testing independently.
The question is just whether to call it what it is.
The industry's language here has gotten loose in a way that serves the labs and confuses the builders.
The short of it.
MiniMax released M3 on June 1 — claiming frontier coding performance (59% on SWE-Bench Pro), 1M token context, and native multimodality at $0.60 per million input tokens, with weights arriving on Hugging Face within ten days. It is not open source: MiniMax has not released training code or inference operators, meaning the training run cannot be reproduced and the self-reported benchmarks cannot be independently verified via the same pipeline. For builders evaluating open models on compliance or fine-tuning grounds, the difference between open-weight and open-source determines what you actually know about what you're running.