Semantic Split is a lovely working nlp splitter using Spacy. Problem is, it's default English only. This version, intends to be able to instantiate the classes more generally, to expose the used spacy model and other fine tuning paramenters, like No. Sentences and No. max Chars per Segment. :D
|
|
1 anno fa | |
|---|---|---|
| semantic_split | 1 anno fa | |
| tests | 1 anno fa | |
| README.md | 1 anno fa | |
| poetry.lock | 1 anno fa | |
| pyproject.toml | 1 anno fa |
A Python library to chunk/group your text based on semantic similarity - ideal for pre-processing data for Language Models or Vector Databases. Leverages SentenceTransformers and spaCy.
Better Context: Providing more relevant context to your prompts enhances the LLM's performance (arXiv:2005.14165 [cs.CL]). Semantic-Split groups related sentences together, ensuring your prompts have relevant context.
Improved Results: Short, precise prompts often yield the best results from LLMs (arXiv:2004.04906 [cs.CL]). By grouping semantically similar sentences, Semantic-Split helps you craft such efficient prompts.
Cost Savings: LLMs like GPT-4 charge costs per token and have a token limit (e.g., 8K tokens). With Semantic-Split, you can make your prompts shorter and more meaningful, leading to potential cost savings.
Real world example:
Imagine you're building an application where users ask questions about articles:
semantic-split and store it in a Vector DB as embeddings.As you can see, in part 1, which involves semantic sentence splitting (grouping), is crucial. If we don't split or group the sentences semantically, we risk losing essential information. This can diminish the effectiveness of our Vector DB in identifying the most suitable chunks. Consequently, we may end up with poorer context for our prompts, negatively impacting the quality of our responses.
python -m spacy download en_core_web_sm
pip install semantic-split
You might need to have CUDA for the SentenceSimilarity an easy fix is to install it via pytorch:
pip install torch if you have a GPU.
or (this requires python 3.8 for some reason)
pip3 install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html if you don't.
or conda install cudatoolkit
python -m spacy download en_core_web_smpoetry install I dogs are amazing.
Cats must be the easiest pets around.
Robots are advanced now with AI.
Flying in space can only be done by Artificial intelligence.
[ ["I dogs are amazing.", "Cats must be the easiest pets around."],
["Robots are advanced now with AI.", "Flying in space can only be done by Artificial intelligence."] ]
from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter
text = """
I dogs are amazing.
Cats must be the easiest pets around.
Robots are advanced now with AI.
Flying in space can only be done by Artificial intelligence."""
model = SentenceTransformersSimilarity()
sentence_splitter = SpacySentenceSplitter()
splitter = SimilarSentenceSplitter(model, sentence_splitter)
res = splitter.split(text)
poetry run pytest