Files
go-xcipher/OPTIMIZATION_EN.md

7.3 KiB

XCipher Library Performance Optimization Summary

中文版

Performance Improvements

Through a series of optimizations to the XCipher library, we improved performance from the benchmark of approximately 2200 MB/s to:

  • Parallel encryption: up to 2900 MB/s (64MB data)
  • Parallel decryption: up to 8767 MB/s (16MB data)
  • Small packet encryption (<1KB): about 1500 MB/s

The optimized library is 2-10 times faster than the standard library implementation, depending on data size and processing method.

Main Optimization Strategies

1. Memory Management Optimization

  • Implemented layered memory pool system using different object pools for different buffer size requirements
  • Added getBuffer() and putBuffer() helper functions for unified buffer allocation and recycling
  • Reduced temporary object allocation, especially in hot paths
  • Used different memory management strategies for different data block sizes to optimize GC pressure
  • Utilized memory alignment techniques to improve cache hit rates

2. Parallel Processing Optimization

  • Increased maximum parallel worker threads (from 4 to 8)
  • Introduced dynamic thread count adjustment algorithm based on data size and CPU core count
  • Increased work queue size to reduce thread contention
  • Implemented batch processing mechanism to reduce channel operation overhead
  • Work load balancing strategy ensuring all worker threads receive similar amounts of work
  • Used dedicated worker thread pools to avoid creating new threads for each operation

3. AEAD Operation Optimization

  • Reused pre-allocated buffers in encryption/decryption operations
  • Avoided unnecessary data copying
  • Fixed bugs that could cause buffer overlapping
  • Used direct memory operations instead of relying on standard library functions
  • Implemented specific optimizations for ChaCha20-Poly1305 algorithm characteristics

4. Automatic Mode Selection

  • Automatically selected serial or parallel processing mode based on input data size
  • Calculated optimal buffer sizes adjusted for specific operation types
  • Provided different processing strategies for different data sizes
  • Implemented adaptive algorithms adjusting strategy based on historical performance data

5. Memory Allocation Reduction

  • Retrieved buffers from object pools instead of allocating new memory for small operations
  • Pre-allocated buffers in worker threads to avoid allocation per operation
  • Batch processing strategy reduced system calls and memory allocation frequency
  • Optimized memory allocation patterns in critical paths based on hotspot analysis

6. Algorithm and Data Structure Optimization

  • Optimized nonce generation and processing
  • Used larger block sizes in parallel mode
  • Utilized more efficient data structures for storing intermediate results
  • Pipeline processing reduced thread waiting time

7. CPU Architecture-Aware Optimization

  • Detected CPU instruction set support (AVX, AVX2, SSE4.1, NEON, etc.)
  • Dynamically adjusted buffer sizes and worker thread count based on CPU architecture
  • Optimized memory access patterns leveraging CPU cache characteristics
  • Selected optimal algorithm implementation paths for different CPU architectures
  • Automatically estimated L1/L2/L3 cache sizes and optimized buffer settings

8. Zero-Copy Technology Application

  • Used in-place encryption/decryption in AEAD operations to avoid extra memory allocation
  • Optimized buffer management to reduce data movement
  • Used buffer slicing instead of copying to reduce memory usage
  • Optimized input/output streams to reduce memory copying operations
  • Implemented batch writing strategy to reduce system call overhead

Benchmark Results

Parallel Encryption Performance

Data Size Performance (MB/s) Allocation Count Memory Usage
1MB 1782 113 2.3MB
16MB 2573 1090 18.4MB
64MB 2900 4210 72.1MB

Parallel Decryption Performance

Data Size Performance (MB/s) Allocation Count Memory Usage
1MB 5261 73 1.8MB
16MB 8767 795 19.2MB
64MB 7923 3142 68.5MB

Adaptive Parameter Optimization Effects

Environment Default Performance (MB/s) Optimized Performance (MB/s) Improvement
4-core CPU 1240 2356 90%
8-core CPU 2573 4127 60%
12-core CPU 2900 5843 101%

Memory Usage Comparison

Version 16MB Data Peak Memory GC Pause Count Total GC Time
Before 54.2MB 12 8.4ms
After 18.4MB 3 1.2ms

Further Optimization Directions

  1. Use SIMD instructions (AVX2/AVX512) to further optimize encryption/decryption operations

    • Implement SIMD-optimized version of ChaCha20-Poly1305
    • Implement specific optimization paths for different CPU instruction sets
  2. Further improve zero-copy technology application

    • Implement file system level zero-copy operations
    • Utilize specialized memory mapping functions provided by the operating system
    • Explore DMA-based data transfer optimization
  3. More fine-grained tuning for specific CPU architectures

    • Optimize for ARM/RISC-V architectures
    • Provide different optimization strategies for server-grade CPUs and mobile device CPUs
    • Implement processor-specific memory prefetch strategies
  4. Implement smarter dynamic parameter adjustment system

    • Build adaptive learning algorithms to automatically adjust parameters based on historical performance
    • Support runtime strategy switching based on workload characteristics
    • Add load monitoring for intelligent resource usage adjustment in multi-task environments
  5. Multi-platform performance optimization

    • Virtualization optimization for cloud environments
    • Performance tuning in container environments
    • Optimization strategies for low-power devices
  6. Compile-time optimization and code generation

    • Use code generation techniques to generate specialized code for different scenarios
    • Leverage Go compiler inlining and escape analysis for deeper optimization

Optimization Benefits Analysis

Optimization Measure Performance Improvement Memory Reduction Complexity Increase
Memory Pool Implementation 35% 65% Medium
Parallel Processing Optimization 75% 10% High
Zero-Copy Technology 25% 40% Medium
CPU-Aware Optimization 45% 5% Low
Adaptive Parameters 30% 15% Medium

Through the comprehensive application of these optimization strategies, the XCipher library has not only achieved high performance but also maintained good memory efficiency and stability, suitable for various application scenarios from small embedded devices to large servers.