Abstract
Alibaba’s Qwen team officially released the Qwen3.5 small model family on March 2, 2026. The lineup includes four lightweight models with parameter sizes of 0.8B, 2B, 4B, and 9B. According to the original source, Elon Musk praised the model on X for its impressive intelligence density, which helped draw wider attention to the release.
Qwen3.5 small models are built on a unified core architecture. They adopt native multimodal design, optimized network structures, and enhanced reinforcement learning mechanisms. The series delivers strong performance while keeping computing requirements relatively low. This makes it suitable for private deployment, edge devices, local development, offline office systems, and enterprise intranet environments.
This guide explains the core strengths, model specifications, benchmark results, hardware requirements, deployment framework selection, and practical deployment steps for Qwen3.5. It compares three mainstream inference frameworks: vLLM, LMDeploy, and Ollama. It also provides environment setup commands, model download examples, service startup methods, and OpenAI-compatible API calling samples.
The goal is to help individual developers, enterprise teams, and IoT practitioners deploy Qwen3.5 small models safely and efficiently in private environments.
1. Introduction
Lightweight large language models have become increasingly important in recent years. Enterprises want private AI services that can run inside internal networks. Developers want local models that do not depend on public cloud APIs. IoT and edge computing scenarios also require models that can run on limited hardware.
Large frontier models are powerful, but they usually require expensive GPUs, high memory capacity, and cloud-based infrastructure. This makes them difficult to use in offline, private, or cost-sensitive environments.
Small models solve part of this problem. A well-designed small model can run on consumer hardware while still providing useful reasoning, text generation, visual understanding, and agent capabilities. It may not replace ultra-large models in all scenarios, but it can offer a better balance between performance, cost, privacy, and deployment flexibility.
On March 2, 2026, Alibaba Cloud’s Qwen team released the Qwen3.5 small model lineup. The series includes four parameter sizes:
Qwen3.5-0.8B
Qwen3.5-2B
Qwen3.5-4B
Qwen3.5-9B
All models share the same underlying architecture. They support native multimodal capabilities, upgraded network structures, and improved reinforcement learning. The series is designed for different hardware levels, from mobile devices and IoT terminals to personal computers and enterprise workstations.
Qwen3.5 is especially attractive for private deployment. By deploying the model locally or inside a private network, enterprises can keep business data under internal control. They can also reduce dependency on external APIs, improve latency stability, and control long-term operating costs.
This guide focuses on practical deployment. It covers model features, framework selection, hardware requirements, environment setup, inference service startup, and API usage.
2. Core Capabilities and Technical Advantages of Qwen3.5 Small Models
2.1 Model Positioning and Basic Specifications
The Qwen3.5 small model family covers four sizes. Each model targets a different deployment scenario.
Qwen3.5-0.8B and Qwen3.5-2B are ultra-lightweight models. Both adopt a 24-layer network structure. They are designed for IoT devices, mobile phones, embedded hardware, and low-end edge terminals. These models are suitable for simple voice interaction, lightweight text processing, local assistants, and low-power inference tasks.
Qwen3.5-4B is a stronger lightweight model. It has 32 layers and a hidden dimension of 2560. It is positioned as a foundational multimodal agent model. It can run on consumer-grade PCs and workstations. It is suitable for local office automation, document processing, general AI assistants, and small-scale agent workflows.
Qwen3.5-9B is the highest-capability model in the small model lineup. It has 32 layers, a hidden dimension of 4096, and an FFN size of 12288. It can run smoothly on Mac devices and high-performance consumer hardware. According to the original source, its overall performance is close to some 120B-level open-source models, while its parameter size is only about one-thirteenth of such large models.
This makes Qwen3.5-9B a strong choice for users who need higher performance but still want manageable deployment costs.
2.2 Benchmark and Multimodal Performance
Qwen3.5 performs well across several benchmark tasks.
The 9B version reaches around 82.5 points on MMLU-Pro. It also scores above 70 on MMMU and above 78 on MathVision. These results show strong general reasoning, multimodal understanding, and mathematical vision capabilities.
For visual tasks, the 0.8B and 2B versions outperform traditional lightweight multimodal models on MathVista and OCRBench. This is important because many small models struggle with visual reasoning and OCR-related tasks.
A major technical difference is Qwen3.5’s early fusion multimodal architecture. Many lightweight multimodal models use a separate visual encoder connected to a text model. Qwen3.5 instead performs unified modeling for text, images, and videos at the underlying network level.
This design improves cross-modal understanding. It also makes the model more efficient when handling tasks that combine text and visual information.
The entire Qwen3.5 series supports a maximum context window of around 260,000 tokens. This is highly useful for long-document processing, codebase analysis, historical log inspection, and enterprise knowledge tasks.
2.3 Advanced Network Architecture
Qwen3.5’s performance comes from several key architectural improvements.
The first is the combination of Gated DeltaNet and Sparse MoE hybrid attention. The model arranges gated incremental networks and sparse Mixture-of-Experts layers at a ratio of 3:1. This structure improves long-sequence attention efficiency and helps reduce unnecessary computation.
The second is the on-demand activation mechanism. During inference, the model activates only sub-networks related to the current task. It does not run the full model for every request. This reduces computational load and improves response latency.
The third is the support for two working modes: thinking mode and quick response mode. Users can choose deeper reasoning for complex tasks or faster response for interactive scenarios. This gives developers more control over the trade-off between quality and speed.
Together, these designs make Qwen3.5 more practical for private deployment. The models are not only smaller. They are also optimized for real inference efficiency.
2.4 Open Ecosystem
All Qwen3.5 variants are available on mainstream model platforms, including Hugging Face and ModelScope. The release covers base models, chat models, and multimodal models.
Popular inference frameworks such as Ollama, vLLM, and LMDeploy have also completed adaptation work. This gives developers multiple deployment options based on hardware, performance needs, and engineering complexity.
Some developers have already deployed quantized versions of Qwen3.5-2B on iPhones for real-time visual question answering. This shows the model family’s strong compatibility with end-side deployment.
3. Why Choose Private Deployment?
Private deployment is valuable for teams with data security, compliance, latency, and cost requirements.
First, private deployment keeps business data inside the internal network. Sensitive user data, internal documents, financial records, logs, and business instructions do not need to be sent to public cloud APIs. This is important for finance, government, healthcare, legal, and enterprise internal systems.
Second, private deployment gives teams full control over service behavior. They can manage concurrency, latency, model versions, access permissions, and logging policies. They are not limited by external platform rate limits or temporary service fluctuations.
Third, private deployment can reduce long-term costs. After the initial hardware investment, teams do not need to pay per-token fees for every request. For high-frequency internal usage, this can produce major cost savings.
This deployment model is especially suitable for:
enterprise internal assistants
offline office systems
private knowledge bases
financial document analysis
government intranet services
IoT and edge terminals
local coding assistants
log and report analysis systems
4. Comparison of vLLM, LMDeploy, and Ollama
For Qwen3.5 private deployment, three mainstream inference frameworks are commonly used:
vLLM
LMDeploy
Ollama
Each has different strengths.
| Framework | Core Scenario | Performance Feature | Hardware Requirement | Deployment Difficulty | Rating |
|---|---|---|---|---|---|
| vLLM | High-concurrency online services | PagedAttention, throughput up to 24x higher | Multi-GPU recommended | Medium | ★★★★★ |
| LMDeploy | Edge devices and domestic hardware | W4A16 quantization, over 90% VRAM reduction | Supports Ascend NPU and low-end GPUs | Simple | ★★★★ |
| Ollama | Local development and privacy use | One-click deployment and easy model switching | CPU and low-end GPU supported | Very simple | ★★★ |
4.1 vLLM
vLLM is the preferred framework for high-concurrency production services.
Its core advantage is PagedAttention, which improves memory management and inference throughput. It supports dynamic batching and multi-GPU parallel inference. It is also compatible with OpenAI API standards, making it easy to connect with existing client code.
vLLM is suitable for enterprise services, internal platforms, and online AI applications with many concurrent requests.
Its limitations are also clear. It is not the most memory-efficient option, and its support for some quantization strategies is not as flexible as LMDeploy.
Choose vLLM when throughput and concurrency are the top priorities.
4.2 LMDeploy
LMDeploy is strong in memory optimization and hardware compatibility.
Its W4A16 quantization can reduce VRAM usage by more than 90%, which is useful for limited hardware environments. It also supports domestic Ascend NPU chips, making it suitable for teams using localized hardware stacks.
LMDeploy is friendly to low-end GPUs and edge deployment. It is also easier to configure than vLLM in many scenarios.
Its inference speed is usually slightly lower than vLLM, but its memory efficiency is better.
Choose LMDeploy when VRAM is limited, quantization is required, or domestic hardware support is important.
4.3 Ollama
Ollama is the easiest framework to use.
It supports Windows, macOS, and Linux. It allows users to run models locally with simple commands. It is also convenient for switching between multiple models on one device.
Ollama is ideal for personal development, local testing, privacy-sensitive usage, and rapid function verification.
However, its performance is weaker than vLLM and LMDeploy. It is not suitable for high-concurrency production services.
Choose Ollama when simplicity matters more than throughput.
5. Hardware and Software Environment Preparation
5.1 Hardware Requirements
For production deployment, a single GPU with 24GB or more VRAM is recommended. Multi-GPU cluster deployment is preferred for higher concurrency.
System memory should be at least 64GB. A 1TB NVMe SSD is recommended for storing multiple model versions, cache files, logs, and runtime data.
For GPU compatibility, vLLM requires compute capability 7.0 or above. Supported GPUs include V100, T4, A100, and similar models. LMDeploy can run on lower-end GPUs and Ascend NPU through quantization.
The test environment in the original tutorial uses:
GPU: RTX 4090 24GB
CPU: 16-core Intel Xeon Platinum 8352V
Memory: 120GB system RAM
This configuration can stably run all Qwen3.5 small model variants.
5.2 Software Environment
Recommended operating systems include:
Ubuntu 20.04 or above
CentOS 7 or above
Python version requirements:
Python 3.9 to 3.12
vLLM: optimized for Python 3.12
LMDeploy: more stable with Python 3.11
CUDA requirements:
CUDA 11.8 or above
CUDA 12.4 recommended for vLLM compatibility
Conda is recommended for environment isolation and dependency management.
5.3 Model Download
Qwen3.5 models can be downloaded from Hugging Face or ModelScope.
Model file size depends on parameter scale and precision format, such as:
BF16
FP16
INT8
INT4
Small models require far less storage than ultra-large models. This reduces the threshold for private deployment.
6. Practical Deployment Tutorial
6.1 Deploy with vLLM
Step 1: Install the Environment
Create a dedicated Conda environment and install dependencies:
conda create -n vllm python=3.12 -y
conda activate vllm
pip install vllm
# For CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu124
Step 2: Download the Model
Use ModelScope to download the model locally:
from modelscope import snapshot_download
model_dir = snapshot_download(
'Qwen/Qwen3.5-2B',
cache_dir="/root/autodl-tmp/models"
)
print(f"Model path: {model_dir}")
Step 3: Start the Service
Single-GPU deployment:
vllm serve /root/autodl-tmp/models/Qwen/Qwen3.5-2B --port 8000
Multi-GPU deployment:
vllm serve /root/autodl-tmp/models/Qwen/Qwen3.5-2B --port 8000 --tensor-parallel-size 4
Step 4: Call the API
vLLM is compatible with OpenAI-style APIs.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1/",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="qwen3.5-2b",
messages=[
{"role": "user", "content": "Introduce yourself"}
],
temperature=1.0,
top_p=0.95,
top_k=40
)
print(response.choices[0].message.content)
6.2 Deploy with LMDeploy
Step 1: Install the Environment
conda create -n lmdeploy python=3.11 -y
conda activate lmdeploy
pip install lmdeploy[all]
# Additional dependencies for Ascend NPU
pip install dlinfer-ascend
Step 2: Start the Service
lmdeploy serve api_server /root/autodl-tmp/models/Qwen/Qwen3.5-2B --server-port 23333
Step 3: Call the API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:23333/v1/",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="/root/autodl-tmp/models/Qwen/Qwen3.5-2B",
messages=[
{"role": "user", "content": "Introduce yourself"}
],
temperature=1.0,
top_p=0.95
)
print(response.choices[0].message.content)
Step 4: Quantization Optimization
For devices with limited VRAM, use LMDeploy to convert the model to INT4:
lmdeploy convert qwen3.5-2b Qwen/Qwen3.5-2B --dst-path /data/models/qwen3.5-2b-int4 --quant-policy 4 --tp 1
This can significantly reduce memory usage and make deployment easier on constrained hardware.
6.3 Recommended Inference Parameters
A balanced parameter configuration for Qwen3.5 is:
{
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"max_tokens": 2048
}
temperature controls randomness. A value between 0.7 and 1.0 works well for most dialogue and creation tasks.
top_p and top_k control candidate token filtering. They help balance diversity and stability in model outputs.
max_tokens limits the maximum generated length. Adjust it based on your task type and service latency requirements.
7. Operations and Maintenance Suggestions
After private deployment, teams need to monitor service health and manage model usage.
Recommended monitoring items include:
GPU memory usage
request latency
token throughput
error rate
concurrent request volume
disk usage
model loading time
API response consistency
For multi-model environments, teams should standardize API formats and access rules. Treerouter can be used as a supplementary API gateway for unified multi-model access, with lower-cost options than some direct official services and a simpler way to connect different model endpoints.
Teams should also define service fallback rules. For example, if Qwen3.5-9B is overloaded, some lightweight requests can be routed to Qwen3.5-2B or Qwen3.5-4B. This helps improve overall service stability.
8. Summary
The Qwen3.5 small model family offers a strong balance between intelligence density, multimodal capability, and deployment flexibility. The four model sizes cover different scenarios:
0.8B and 2B for mobile, IoT, and edge terminals
4B for local agents and office automation
9B for higher-performance private AI services
The series supports native multimodal understanding, long context processing, hybrid network architecture, and multiple inference modes. Its maximum context window of around 260,000 tokens makes it useful for long documents, code repositories, and log analysis.
For deployment frameworks, each option has a clear role:
vLLM: best for high-concurrency production services
LMDeploy: best for quantization, edge devices, and domestic hardware
Ollama: best for personal local testing and quick model switching
By following the standard workflow of environment setup, model download, service startup, and API invocation, developers can quickly build a private Qwen3.5 inference service.
As lightweight multimodal models continue to evolve, Qwen3.5 is likely to be used in more IoT terminals, enterprise intranet systems, offline office tools, and local AI workflows. For teams that care about data security, controllable latency, and long-term cost, mastering private deployment of small models is becoming an important engineering capability.




