glm-4-flash
- 125K Context
- 0.01/M Input Tokens
- 0.01/M Output Tokens
- ChatGLM
- Text 2 text
- 15 Nov, 2024
GLM-4-Flash Model Introduction
Key Capabilities and Primary Use Cases
- Handles multi-turn dialogues, web searches, and tool calls.
- Supports long text inference with a context length of up to 128K and output length of up to 4K.
- Multilingual support for 26 languages, including Chinese, English, Japanese, Korean, and German.
Most Important Features and Improvements
- Optimized for speed using adaptive weight quantization, parallel processing, batch processing, and speculative sampling.
- Fine-tuning features available to adapt the model to various application scenarios.
- Advanced features include web browsing, code execution, and custom tool calls.
Essential Technical Specifications
- Pre-trained on 10TB of high-quality multilingual data.
- Supports multiple languages and long text reasoning.
- Model size and parameters vary, but optimized for high performance.
Notable Performance Characteristics
- Achieves an inference speed of 72.14 tokens per second, significantly faster than similar models.
- Demonstrates superior performance in semantics, mathematics, reasoning, code, and knowledge tasks, outperforming models like Llama-3-8B[2][4].