πŸ“–Best Practices For Preparing Documents

Best Practices Guide for Preparing Documents xfor Vector Databases on the Orchestrator Platform

Introduction

Vector databases enable efficient embedding and retrieval of documents based on content similarity. For optimal accuracy on the Orchestrator platform, follow these formatting and structuring guidelines.

Note: Documents are converted into vectors (AI readable format) and not stored in our systems.

Key Guidelines

Accurate Document Title

Ensure titles clearly reflect the document's core information.

  • Bad case: "InfoAboutCompany.docx"

  • Good case: "CompanyX: 2023 Financial Overview"

Avoidance of Prescriptive Statements

Maintain content clarity by avoiding directives or overly casual language.

  • Bad case: "Hey, this doc's all about Product Y!"

  • Good case: "Product Y: Features and Specifications"

Logical Document Structure

Organize content with clear separations for ease of parsing and comprehension.

  • Bad case: Jumbled content without clear separations.

  • Good case: Defined headings, subheadings, and bullet points.

Layout Structure Caution

Ensure key details are within the main content flow, not in layout elements that may not be parsed correctly.

  • Bad case: Key details inside a text box.

  • Good case: Information placed directly within the main content flow.

Table Formatting

Properly format tables for correct data interpretation.

  • Single Row Headers: Avoid multiple header rows.

  • Avoid Split Tables: If tables span multiple pages, include headers on each page.

    • Bad case: Table with three header rows; tables split across pages without headers.

    • Good case: Table with a single header; table headers on each page if split.

Consistency in Terminology

Use terms consistently for clarity.

  • Bad case: Interchanging "eco-friendly" and "green tech" randomly.

  • Good case: Consistent use of "eco-friendly" throughout.

Moderate Document Length

Concise documents ensure efficient and accurate analysis.

  • Bad case: A 15,000-word document on a topic.

  • Good case: Split content into three 5,000-word segments with unique titles.

Timely Updates

Regularly update documents to ensure accuracy.

  • Bad case: Outdated product specifications.

  • Good case: Regularly updated specs to match the latest version.

Document Quality and Accuracy

  • Ensure documents come from reputable sources.

  • Double-check facts and information to avoid embedding misinformation.

Language and Grammar

  • Use clear, concise, and grammatically correct language to aid AI understanding.

Metadata Utilization

  • Utilize document metadata (author, creation date, tags) for better embedding results.

Embedding Granularity

  • Decide on embedding the whole document as one vector or embedding paragraphs/sentences separately.

Document Preprocessing

  • Use techniques like tokenization, stemming, and lemmatization.

  • Remove stop words to improve embedding quality.

Embedding Algorithms

  • Choose the appropriate embedding algorithm (e.g., TF-IDF, Word2Vec, FastText, BERT) for your use case.

Dimensionality

  • Decide on the dimensions of the embeddings, balancing detail capture and computational expense.

Document Versioning

  • Maintain document versions to track changes and decide which version to embed.

Handling Multimedia Elements

  • Decide how to handle non-textual elements (images, charts, audio clips). Options include removal, textual descriptions, or separate processing.

Regular Re-embedding

  • Periodically re-embed documents to ensure relevance in the vector space.

Evaluation and Feedback

  • Test embedding results for accuracy and implement feedback mechanisms to improve the process.

Security and Privacy

  • Handle sensitive information correctly, possibly redacting before embedding.

  • Ensure compliance with data privacy regulations.

Handling Special Characters and Symbols

  • Decide whether to retain, replace, or remove special characters and symbols, as some embedding algorithms may not handle them well.

Following these best practices ensures a comprehensive and efficient embedding process, delivering high-quality and relevant vectors in your database.

Last updated