πBest Practices For Preparing Documents
Best Practices Guide for Preparing Documents xfor Vector Databases on the Orchestrator Platform
Introduction
Vector databases enable efficient embedding and retrieval of documents based on content similarity. For optimal accuracy on the Orchestrator platform, follow these formatting and structuring guidelines.
Note: Documents are converted into vectors (AI readable format) and not stored in our systems.
Key Guidelines
Accurate Document Title
Ensure titles clearly reflect the document's core information.
Bad case: "InfoAboutCompany.docx"
Good case: "CompanyX: 2023 Financial Overview"
Avoidance of Prescriptive Statements
Maintain content clarity by avoiding directives or overly casual language.
Bad case: "Hey, this doc's all about Product Y!"
Good case: "Product Y: Features and Specifications"
Logical Document Structure
Organize content with clear separations for ease of parsing and comprehension.
Bad case: Jumbled content without clear separations.
Good case: Defined headings, subheadings, and bullet points.
Layout Structure Caution
Ensure key details are within the main content flow, not in layout elements that may not be parsed correctly.
Bad case: Key details inside a text box.
Good case: Information placed directly within the main content flow.
Table Formatting
Properly format tables for correct data interpretation.
Single Row Headers: Avoid multiple header rows.
Avoid Split Tables: If tables span multiple pages, include headers on each page.
Bad case: Table with three header rows; tables split across pages without headers.
Good case: Table with a single header; table headers on each page if split.
Consistency in Terminology
Use terms consistently for clarity.
Bad case: Interchanging "eco-friendly" and "green tech" randomly.
Good case: Consistent use of "eco-friendly" throughout.
Moderate Document Length
Concise documents ensure efficient and accurate analysis.
Bad case: A 15,000-word document on a topic.
Good case: Split content into three 5,000-word segments with unique titles.
Timely Updates
Regularly update documents to ensure accuracy.
Bad case: Outdated product specifications.
Good case: Regularly updated specs to match the latest version.
Document Quality and Accuracy
Ensure documents come from reputable sources.
Double-check facts and information to avoid embedding misinformation.
Language and Grammar
Use clear, concise, and grammatically correct language to aid AI understanding.
Metadata Utilization
Utilize document metadata (author, creation date, tags) for better embedding results.
Embedding Granularity
Decide on embedding the whole document as one vector or embedding paragraphs/sentences separately.
Document Preprocessing
Use techniques like tokenization, stemming, and lemmatization.
Remove stop words to improve embedding quality.
Embedding Algorithms
Choose the appropriate embedding algorithm (e.g., TF-IDF, Word2Vec, FastText, BERT) for your use case.
Dimensionality
Decide on the dimensions of the embeddings, balancing detail capture and computational expense.
Document Versioning
Maintain document versions to track changes and decide which version to embed.
Handling Multimedia Elements
Decide how to handle non-textual elements (images, charts, audio clips). Options include removal, textual descriptions, or separate processing.
Regular Re-embedding
Periodically re-embed documents to ensure relevance in the vector space.
Evaluation and Feedback
Test embedding results for accuracy and implement feedback mechanisms to improve the process.
Security and Privacy
Handle sensitive information correctly, possibly redacting before embedding.
Ensure compliance with data privacy regulations.
Handling Special Characters and Symbols
Decide whether to retain, replace, or remove special characters and symbols, as some embedding algorithms may not handle them well.
Following these best practices ensures a comprehensive and efficient embedding process, delivering high-quality and relevant vectors in your database.
Last updated
