huggingfacetransformers

156,730 stars32,135 forksPythonapache-2.00 views

Transformers

Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and performance, including techniques like quantization, speculative decoding, and paged memory management for key-value caches. It provides native integration for distributed training across multi-node clusters, as well as flexible APIs for serving models via compatible inference servers. Developers can also utilize built-in utilities for model patching, custom kernel execution, and automated documentation generation to streamline development workflows.

Features

API Frameworks - A comprehensive training API for models that supports distributed training, mixed precision, and integration with various hardware accelerators.
Hybrid Parallelism Strategies - A training approach combining data, pipeline, and tensor parallelism to scale large language models across multi-node, multi-GPU clusters.
Byte Pair Encodings - A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs to build a vocabulary.
Vision Transformers - A computer vision model that processes images by splitting them into fixed-sized patches, treating them as sequences of tokens.
Chat Template Formatters - A method for formatting chat history into the specific token sequences and control tokens required by a model's chat structure.
Large Model Optimizations - Optimizations for large models including automatic device mapping, half-precision weight support, and quantization to reduce memory footprint and accelerate inference.
Qwen2 Language Models - A family of pretrained and instruction-tuned large language models featuring group query attention, rotary positional embeddings, and support for long context lengths.
Checkpoint Resumption - A capability to resume training from a specific checkpoint path, restoring optimizer, scheduler, and random number generator states.
Batched Inference Mechanisms - A batch-processing mechanism that accepts lists of conversation sequences to enable efficient inference across multiple chat sessions in a single forward pass.
Attention Mechanisms - A registry-based interface for managing and extending attention functions, allowing models to register custom implementations or locally overwrite existing mechanisms.
Model Quantization - A collection of quantization methods to reduce model memory requirements by storing weights in lower precision while balancing accuracy and compression.
Tokenizer Base Interfaces - A base class providing a unified interface for tokenization, encoding, decoding, and vocabulary management across different tokenizer backends.
Tool Calling Patterns - A pattern for tool invocation that appends assistant-generated function requests and subsequent tool-role results to the conversation message list.
Paged KV Cache Management - A memory management system using fixed-size blocks to store key-value cache states, enabling efficient memory sharing and preventing fragmentation.
Configuration Management - A configuration class that centralizes hyperparameters, optimization settings, logging preferences, and infrastructure choices.
Transformers Integration Layers - A model loader that integrates with standard transformer libraries to handle device mapping, quantization, and attention backends, while extending the training loop with custom mixins.
Multimodal Input Handlers - A capability for multimodal models to process mixed-modality inputs, such as images, video, or audio, by specifying input types within the content structure.
Data Parallelism - A training strategy that evenly distributes data across multiple GPUs, where each GPU holds a model copy and synchronizes results to reduce training time.
Sequence-to-Sequence Translation Tasks - A sequence-to-sequence framework for converting text between languages, supporting model fine-tuning, dataset preprocessing, evaluation, and inference.
Tool Calling Supports - Native support for structured function calling, allowing models to generate function requests that can be executed by the host application.
Training Flow Managers - A built-in callback that manages logging, evaluation, and checkpointing schedules based on training arguments, with support for customization.
Chunked Prefill Mechanisms - A technique that splits long prompt processing across multiple forward passes to prevent blocking other requests during generation.
Text Classification Tasks - A machine learning task that assigns labels to text sequences, commonly used for sentiment analysis or categorization.
Mixture of Experts - A workflow for mixture-of-experts models that captures expert routing indices during inference and replays them during training passes to maintain consistent expert paths.
Document Question Answering Pipelines - A high-level pipeline interface for performing document question answering inference by passing image and question inputs to a model.
Distributed Training Integrations - An integration layer for loading models directly into a distributed training framework, leveraging native components while utilizing parallelization and optimization techniques.
Generation Continuation Modes - A configuration option that allows the model to continue generating from the last message in the chat history rather than initiating a new assistant turn.
Asynchronous Batching Execution - An execution strategy that overlaps CPU request preparation with GPU computation using multiple streams and graph-based execution to improve performance.
Prompt Lookup Decoding - An optimization technique that proposes candidate tokens by identifying and copying repeating n-grams from the input prompt, avoiding the need for an external assistant model.
Edge Model Inference Runtimes - A lightweight runtime for edge device model inference that exports models into a portable format with ahead-of-time memory planning and hardware-specific operation dispatch.
Parallel Loading - Integration with tensor parallelism that shards tensors during materialization, allowing each rank to load only the necessary portion of the weight data.
Byte Level Encodings - A variant of subword tokenization that uses byte values as the base vocabulary, ensuring every word can be tokenized without requiring an unknown token.
Memory Efficient Evaluation - A technique for memory-efficient evaluation by offloading accumulated predictions to the CPU and preprocessing logits at the batch level.