Library NLPBenchmarks

CLARIN-PL Embeddings

Transformer-based NLP library for text classification, sequence labeling, and embeddings in Polish.

A Python library for transformer-based text processing and embeddings in Polish, built as the core component of the NeurIPS 2022 LEPISZCZE benchmark. It abstracts away model configuration and training boilerplate through composable pipelines.

What it is

CLARIN-PL Embeddings wraps Hugging Face Transformers and PyTorch Lightning into composable pipelines for transformer-based NLP tasks. It offers pre-built pipelines for text classification and sequence labeling (NER, POS tagging, punctuation restoration) alongside Flair-based static embeddings. The library ships with optimized hyperparameter search pipelines and integrates Polish-specific models like HerBERT. Compatible with 10+ Polish datasets curated by CLARIN-PL, from sentiment analysis to question answering.

Highlights

Text classification and sequence labeling pipelines using PyTorch Lightning
Hyperparameter search integration for finding optimal models per task
Pre-configured Polish datasets: PolEmo2, KPWR-NER, NKJP-POS, and more
Static and transformer embeddings with Flair backend support
Published in NeurIPS 2022 as part of the LEPISZCZE benchmark paper