Rspamd: Fixing Incorrect BAYES/FUZZY Statistics Display

Alex Johnson
-
Rspamd: Fixing Incorrect BAYES/FUZZY Statistics Display

Introduction

This article addresses a bug found in Rspamd version 3.14.0 where the WebUI statistics page incorrectly displays the total number of learns and hashes for all categories and rules in the BAYES and Fuzzy tables, rather than showing the individual counts for each category or rule. This issue affects the accuracy of the displayed statistics and can make it difficult to monitor the performance of individual bayes categories and fuzzy rules.

Understanding the Issue

The core problem lies in how the statistics are aggregated and presented in the Rspamd WebUI. Instead of pulling the specific counts for each bayes category (e.g., phishing, scam, malware) and fuzzy rule (e.g., local, HTML), the system mistakenly shows the total number of learns and hashes stored in the Redis server used for these statistics. This can be misleading because it doesn't provide a clear picture of which categories or rules are contributing the most to the overall statistics.

Prerequisites for Bug Reporting

Before diving into the specifics, it's important to ensure that the necessary steps for reporting bugs have been followed. These include:

  • Reading about bug reporting in general.
  • Enabling relevant debugging logs to gather detailed information.
  • Checking the FAQs about Core files in case of fatal crashes.
  • Trying the ASAN package and obtaining the ASAN report (if possible).
  • Checking that the issue isn't already filed.
  • Ensuring that there is no existing experimental package or master branch addressing the issue.

By following these steps, you can ensure that your bug report is comprehensive and helpful for developers to diagnose and fix the problem.

Detailed Bug Description

Symptoms

The primary symptom of this bug is that the Learns and Hashes columns in the BAYES and Fuzzy tables on the statistics page display incorrect counters. Instead of showing the number of learns/hashes specific to each bayes category or fuzzy rule, the WebUI shows the total number of learns/hashes stored in the Redis server.

Steps to Reproduce

To reproduce this bug, follow these steps:

  1. Configure bayes categories and add fuzzy HTML rules in your Rspamd setup.
  2. Open the statistics page in the Rspamd WebUI.
  3. Observe the Learns and Hashes columns in the BAYES and Fuzzy tables. You will notice that the counters display the total values rather than the individual counts for each category/rule.

Expected Behavior

The expected behavior is that the Learns and Hashes counters should accurately represent the number of learns/hashes for each individual bayes category and fuzzy rule. This would provide a clear and precise view of the contribution of each category/rule to the overall statistics, aiding in performance monitoring and tuning.

Environment Information

  • Rspamd daemon version: 3.14.0
  • CPU architecture: x86_64; features: avx2, avx, sse2, sse3, ssse3, sse4.1, sse4.2, rdrand
  • Hyperscan enabled: TRUE
  • Jemalloc enabled: TRUE
  • LuaJIT enabled: TRUE (LuaJIT version: LuaJIT 2.1.1762617240)
  • ASAN enabled: FALSE
  • BLAS enabled: TRUE
  • Fasttext enabled: TRUE
  • Operating System: Rocky Linux 10.0

Configuration Details

fuzzy_check.conf

The fuzzy_check.conf file is configured as follows:

rule "local" {
 servers = "localhost:11335";
 ...
}
rule "HTML" {
 servers = "localhost:11335";
 ...
}
...

This configuration defines the rules for fuzzy checks, specifying the servers to be used for each rule. In this case, the rules local and HTML are configured to use the server localhost:11335.

statistics.conf

The statistics.conf file is configured as follows:

classifier "bayes" {
 name = "bayes_binary";
 tokenizer { name = "osb"; }
 backend = "redis";
 servers = "127.0.0.1:16379";
 ...
}
classifier "bayes" {
 name = "bayes_multi";
 tokenizer { name = "osb"; }
 backend = "redis";
 servers = "127.0.0.1:16379";
 ...
 statfile { symbol = "BAYES_PHISHING"; class = "phishing"; }
 statfile { symbol = "BAYES_SCAM"; class = "scam"; }
 statfile { symbol = "BAYES_MALWARE"; class = "malware"; }
 statfile { symbol = "BAYES_MARKETING"; class = "marketing"; }
 statfile { symbol = "BAYES_ILLEGAL_GOODS"; class = "illegal_goods"; }
 statfile { symbol = "BAYES_IMAGE_ONLY_SPAM"; class = "image_only_spam"; }
 statfile { symbol = "BAYES_BACKSCATTER"; class = "backscatter"; }
 statfile { symbol = "BAYES_COMPROMISED_ACCOUNT"; class = "compromised_account"; }
 ...
}

This configuration defines two bayes classifiers, bayes_binary and bayes_multi, both using Redis as the backend and connecting to the server 127.0.0.1:16379. The bayes_multi classifier is further configured with several statfile entries, each associating a symbol with a specific class (e.g., BAYES_PHISHING with phishing).

Possible Causes and Solutions

The root cause of this issue likely lies in the way the WebUI queries and aggregates the statistics from the Redis backend. Instead of querying for the specific counts associated with each category and rule, it may be querying for the total counts across the entire Redis database.

Potential Solutions

  1. Modify WebUI Query: The most direct solution is to modify the WebUI code to query the Redis backend for the specific counts associated with each bayes category and fuzzy rule. This would involve changing the queries to filter the results based on the category or rule identifier.
  2. Update Statistics Aggregation: Another approach is to modify the statistics aggregation logic to correctly sum the counts for each category and rule before displaying them in the WebUI. This could involve changes to the Rspamd code that handles statistics aggregation.
  3. Verify Redis Data Structure: Ensure that the data structure in Redis is correctly storing the counts for each category and rule. If the data is not being stored in a way that allows for easy retrieval of individual counts, the data structure may need to be modified.

Impact and Mitigation

Impact

The incorrect statistics can lead to inaccurate monitoring and tuning of the Rspamd system. Without accurate counts for each bayes category and fuzzy rule, it becomes difficult to identify which categories or rules are performing well and which ones need adjustment. This can result in suboptimal spam filtering performance.

Mitigation

While a permanent fix is being developed, a temporary workaround is to manually query the Redis database for the specific counts associated with each category and rule. This can be done using the redis-cli command-line tool or a Redis GUI client. However, this is a manual and time-consuming process and is not a sustainable solution for long-term monitoring.

Conclusion

The bug in Rspamd version 3.14.0, where the WebUI statistics page displays incorrect counts for bayes categories and fuzzy rules, is a significant issue that affects the accuracy of monitoring and tuning. By understanding the symptoms, reproducing the bug, and exploring potential solutions, developers can work towards a permanent fix. In the meantime, manual workarounds can be used to mitigate the impact of the issue. For more information on Rspamd and its features, visit the official Rspamd website.

You may also like