📖Best Practices For Preparing Documents

Best Practices Guide for Preparing Documents xfor Vector Databases on the Orchestrator Platform

Introduction

Vector databases enable efficient embedding and retrieval of documents based on content similarity. For optimal accuracy on the Orchestrator platform, follow these formatting and structuring guidelines.

Note: Documents are converted into vectors (AI readable format) and not stored in our systems.

Key Guidelines

Accurate Document Title

Ensure titles clearly reflect the document's core information.

Bad case: "InfoAboutCompany.docx"
Good case: "CompanyX: 2023 Financial Overview"

Avoidance of Prescriptive Statements

Maintain content clarity by avoiding directives or overly casual language.

Bad case: "Hey, this doc's all about Product Y!"
Good case: "Product Y: Features and Specifications"

Logical Document Structure

Organize content with clear separations for ease of parsing and comprehension.

Bad case: Jumbled content without clear separations.
Good case: Defined headings, subheadings, and bullet points.

Layout Structure Caution

Ensure key details are within the main content flow, not in layout elements that may not be parsed correctly.

Bad case: Key details inside a text box.
Good case: Information placed directly within the main content flow.

Table Formatting

Properly format tables for correct data interpretation.

Single Row Headers: Avoid multiple header rows.
Avoid Split Tables: If tables span multiple pages, include headers on each page.
- Bad case: Table with three header rows; tables split across pages without headers.
- Good case: Table with a single header; table headers on each page if split.

Consistency in Terminology

Use terms consistently for clarity.

Bad case: Interchanging "eco-friendly" and "green tech" randomly.
Good case: Consistent use of "eco-friendly" throughout.

Moderate Document Length

Concise documents ensure efficient and accurate analysis.

Bad case: A 15,000-word document on a topic.
Good case: Split content into three 5,000-word segments with unique titles.

Timely Updates

Regularly update documents to ensure accuracy.

Bad case: Outdated product specifications.
Good case: Regularly updated specs to match the latest version.

Document Quality and Accuracy

Ensure documents come from reputable sources.
Double-check facts and information to avoid embedding misinformation.

Language and Grammar

Use clear, concise, and grammatically correct language to aid AI understanding.

Metadata Utilization

Utilize document metadata (author, creation date, tags) for better embedding results.

Embedding Granularity

Decide on embedding the whole document as one vector or embedding paragraphs/sentences separately.

Document Preprocessing

Use techniques like tokenization, stemming, and lemmatization.
Remove stop words to improve embedding quality.

Embedding Algorithms

Choose the appropriate embedding algorithm (e.g., TF-IDF, Word2Vec, FastText, BERT) for your use case.

Dimensionality

Decide on the dimensions of the embeddings, balancing detail capture and computational expense.

Document Versioning

Maintain document versions to track changes and decide which version to embed.

Handling Multimedia Elements

Decide how to handle non-textual elements (images, charts, audio clips). Options include removal, textual descriptions, or separate processing.

Regular Re-embedding

Periodically re-embed documents to ensure relevance in the vector space.

Evaluation and Feedback

Test embedding results for accuracy and implement feedback mechanisms to improve the process.

Security and Privacy

Handle sensitive information correctly, possibly redacting before embedding.
Ensure compliance with data privacy regulations.

Handling Special Characters and Symbols

Decide whether to retain, replace, or remove special characters and symbols, as some embedding algorithms may not handle them well.

Following these best practices ensures a comprehensive and efficient embedding process, delivering high-quality and relevant vectors in your database.

PreviousAgent Knowledge, Documents, and Training NextAgent Tools, Integrations, and Memories

Last updated 6 months ago

hashtagIntroduction

hashtagKey Guidelines

hashtagDocument Quality and Accuracy

hashtagLanguage and Grammar

hashtagMetadata Utilization

hashtagEmbedding Granularity

hashtagDocument Preprocessing

hashtagEmbedding Algorithms

hashtagDimensionality

hashtagDocument Versioning

hashtagHandling Multimedia Elements

hashtagRegular Re-embedding

hashtagEvaluation and Feedback

hashtagSecurity and Privacy

hashtagHandling Special Characters and Symbols