Qwen VL: WebUI Bug Prevents Image Recognition
Introduction
In this article, we delve into a peculiar issue encountered while using the Qwen VL model within a WebUI environment. The problem manifests as the WebUI incorrectly indicating that image support is unavailable for the model, despite Qwen VL being designed to process visual inputs. This article aims to provide a detailed overview of the problem, the steps to reproduce it, and the relevant technical details, ensuring clarity and understanding for both users and developers. We will explore the specific build and system configurations where this issue arises, offering insights into potential causes and solutions. Let's examine the intricacies of this image recognition problem with the Qwen VL model.
Problem Description & Steps to Reproduce
The central issue is that the WebUI fails to recognize the Qwen VL model as a vision model capable of processing images. This is evident from the WebUI's display, which suggests that image support is not available. To reproduce this issue, follow these steps:
- Load the Qwen VL model into the WebUI. Specifically, the model used in this case is
Qwen3-VL-235B-A22B-Instruct-IQ4_NL-00001-of-00003.gguf. - Attempt to input an image into the WebUI. Despite the model being a vision-language model, the WebUI indicates that images are not supported.
- Observe the WebUI's behavior, which should incorrectly display that image support is unavailable for the Qwen VL model.
As shown in the image below, the WebUI does not correctly recognize the Qwen VL model's ability to process images.
System Information
To provide a comprehensive understanding of the environment in which this issue occurs, here are the key system details:
- Build: 7036 (017eceed6)
- Compiler: cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
- Operating System: Linux
- GGML Backend: CPU
- Hardware: Azure Cobalt
- Model:
Qwen3-VL-235B-A22B-Instruct-IQ4_NL-00001-of-00003.gguf
Understanding these details is crucial for identifying any potential conflicts or compatibility issues that might be contributing to the problem. The Azure Cobalt hardware and the specific compiler version could play a role in how the WebUI interacts with the Qwen VL model. Make sure that all system requirements are adhered to.
Relevant Log Output
The log output provides valuable insights into the model loading process and the system's configuration. Here's a snippet of the relevant log data:
build: 7036 (017eceed6) with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
system info: n_threads = 64, n_threads_batch = 64, total_threads = 64
system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8082, http threads: 63
main: loading model
srv load_model: loading model '/home/alerant/models/IQ4_NL/Qwen3-VL-235B-A22B-Instruct-IQ4_NL-00001-of-00003.gguf'
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from /home/alerant/models/IQ4_NL/Qwen3-VL-235B-A22B-Instruct-IQ4_NL-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3vlmoe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-Vl-235B-A22B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen3-Vl-235B-A22B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 235B-A22B
llama_model_loader: - kv 7: general.license str = apache-2.0
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 VL 235B A22B Instruct
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-VL-...
llama_model_loader: - kv 13: general.tags arr[str,1] = ["unsloth"]
llama_model_loader: - kv 14: qwen3vlmoe.block_count u32 = 94
llama_model_loader: - kv 15: qwen3vlmoe.context_length u32 = 262144
llama_model_loader: - kv 16: qwen3vlmoe.embedding_length u32 = 4096
llama_model_loader: - kv 17: qwen3vlmoe.feed_forward_length u32 = 12288
llama_model_loader: - kv 18: qwen3vlmoe.attention.head_count u32 = 64
llama_model_loader: - kv 19: qwen3vlmoe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 20: qwen3vlmoe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 21: qwen3vlmoe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: qwen3vlmoe.expert_used_count u32 = 8
llama_model_loader: - kv 23: qwen3vlmoe.attention.key_length u32 = 128
llama_model_loader: - kv 24: qwen3vlmoe.attention.value_length u32 = 128
llama_model_loader: - kv 25: qwen3vlmoe.expert_count u32 = 128
llama_model_loader: - kv 26: qwen3vlmoe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 27: qwen3vlmoe.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
llama_model_loader: - kv 28: qwen3vlmoe.n_deepstack_layers u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "{{content}}quot;, "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 35: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 25
llama_model_loader: - kv 41: quantize.imatrix.file str = Qwen3-VL-235B-A22B-Instruct-GGUF/imat...
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-VL-235B-A22...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 752
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 154
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1131
llama_model_loader: - kv 47: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q4_K: 1 tensors
llama_model_loader: - type q5_K: 94 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_nl: 564 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = IQ4_NL - 4.5 bpw
print_info: file size = 123.49 GiB (4.51 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3vlmoe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_embd_inp = 16384
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: mrope sections = [24, 20, 20, 0]
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-Vl-235B-A22B-Instruct
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: CPU_Mapped model buffer size = 47637.86 MiB
load_tensors: CPU_Mapped model buffer size = 46780.64 MiB
load_tensors: CPU_Mapped model buffer size = 30307.12 MiB
load_tensors: CPU_REPACK model buffer size = 125313.75 MiB
....................................................................................................
llama_context: constructing llama_context
lama_context: n_ctx is not divisible by n_seq_max - rounding down to 262656
lama_context: n_seq_max = 3
lama_context: n_ctx = 262656
lama_context: n_ctx_seq = 87552
lama_context: n_batch = 1014
lama_context: n_ubatch = 1014
lama_context: causal_attn = 1
lama_context: flash_attn = enabled
lama_context: kv_unified = false
lama_context: freq_base = 5000000.0
lama_context: freq_scale = 1
lama_context: n_ctx_seq (87552) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
lama_context: CPU output buffer size = 1.74 MiB
lama_kv_cache: CPU KV buffer size = 25617.94 MiB
lama_kv_cache: size = 25617.94 MiB ( 87552 cells, 94 layers, 3/3 seqs), K (q8_0): 12808.97 MiB, V (q8_0): 12808.97 MiB
lama_context: CPU compute buffer size = 889.73 MiB
lama_context: graph nodes = 6117
lama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262656
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 3
slot init: id 0 | task -1 | new slot, n_ctx = 87552
slot init: id 1 | task -1 | new slot, n_ctx = 87552
slot init: id 2 | task -1 | new slot, n_ctx = 87552
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use `--cache-ram 0` to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 0
main: model loaded
This log excerpt details the model loading process, including the metadata, tensor loading, and context initialization. Key parameters such as n_ctx_train, n_embd, and n_layer are printed, giving a snapshot of the model's architecture and configuration. Any anomalies or errors during this loading process could potentially lead to the observed issue. The log also indicates the use of specific tokens like <|endoftext|> and <|im_end|>, which are essential for the model's functioning.
Possible Causes
Several factors might be contributing to this issue:
- WebUI Compatibility: The WebUI might not be fully compatible with the Qwen VL model, particularly in handling vision-language models. This could stem from incorrect configuration or a lack of necessary updates in the WebUI's codebase.
- Model Loading Errors: Although the log output doesn't explicitly show errors, subtle issues during the model loading process might prevent the WebUI from correctly recognizing the model's vision capabilities. Tensor loading and context initialization are critical steps, and any missteps could lead to incomplete functionality.
- Configuration Mismatch: There could be a mismatch between the expected configuration of the Qwen VL model and the actual settings within the WebUI. This could involve incorrect parameters related to image processing or attention mechanisms.
- Dependency Issues: Missing or outdated dependencies in the system environment might affect the WebUI's ability to interact with the model. Libraries related to image processing or tensor manipulation could be causing conflicts.
- Incorrect Model Identification: The WebUI might not be correctly identifying the model as a vision model during the initialization phase. This could be due to how the model's metadata is being parsed or interpreted by the WebUI.
Potential Solutions
Addressing this issue requires a systematic approach. Here are some potential solutions to explore:
- Update WebUI: Ensure the WebUI is updated to the latest version. Newer versions often include bug fixes and improved compatibility with various models, including vision-language models like Qwen VL.
- Verify Model Loading: Double-check the model loading process to ensure all tensors and metadata are loaded correctly. Compare the log output with expected values to identify any discrepancies.
- Adjust Configuration: Review the WebUI's configuration settings related to model processing and ensure they align with the Qwen VL model's requirements. Pay close attention to parameters affecting image input and processing.
- Check Dependencies: Verify that all necessary dependencies are installed and up to date. This includes libraries for image processing, tensor manipulation, and any other relevant components.
- Manual Model Identification: If the WebUI allows, manually specify the model type as a vision model during initialization. This might override any incorrect automatic detection.
- Code Review: Review the WebUI's codebase to identify any potential issues in how it handles vision models. Look for sections related to model identification, input processing, and feature support.
- Community Support: Seek assistance from the community forums or the developers of the WebUI and the Qwen VL model. They may have encountered similar issues and can offer valuable insights and solutions.
Conclusion
The issue of the WebUI incorrectly displaying that images are not supported for the Qwen VL model is a significant impediment to utilizing the model's full capabilities. By understanding the problem, its causes, and potential solutions, users and developers can work towards resolving this bug and unlocking the full potential of the Qwen VL model. Further investigation and collaborative efforts are essential to ensure that the WebUI correctly recognizes and supports vision-language models. You can find additional resources and support on Hugging Face.