# Best Practices For Preparing Documents

#### **Introduction**

Vector databases enable efficient embedding and retrieval of documents based on content similarity. For optimal accuracy on the Orchestrator platform, follow these formatting and structuring guidelines.

**Note**: Documents are converted into vectors (AI readable format) and not stored in our systems.

#### Key Guidelines

**Accurate Document Title**

Ensure titles clearly reflect the document's core information.

* **Bad case**: "InfoAboutCompany.docx"
* **Good case**: "CompanyX: 2023 Financial Overview"

**Avoidance of Prescriptive Statements**

Maintain content clarity by avoiding directives or overly casual language.

* **Bad case**: "Hey, this doc's all about Product Y!"
* **Good case**: "Product Y: Features and Specifications"

**Logical Document Structure**

Organize content with clear separations for ease of parsing and comprehension.

* **Bad case**: Jumbled content without clear separations.
* **Good case**: Defined headings, subheadings, and bullet points.

**Layout Structure Caution**

Ensure key details are within the main content flow, not in layout elements that may not be parsed correctly.

* **Bad case**: Key details inside a text box.
* **Good case**: Information placed directly within the main content flow.

**Table Formatting**

Properly format tables for correct data interpretation.

* **Single Row Headers**: Avoid multiple header rows.
* **Avoid Split Tables**: If tables span multiple pages, include headers on each page.
  * **Bad case**: Table with three header rows; tables split across pages without headers.
  * **Good case**: Table with a single header; table headers on each page if split.

**Consistency in Terminology**

Use terms consistently for clarity.

* **Bad case**: Interchanging "eco-friendly" and "green tech" randomly.
* **Good case**: Consistent use of "eco-friendly" throughout.

**Moderate Document Length**

Concise documents ensure efficient and accurate analysis.

* **Bad case**: A 15,000-word document on a topic.
* **Good case**: Split content into three 5,000-word segments with unique titles.

**Timely Updates**

Regularly update documents to ensure accuracy.

* **Bad case**: Outdated product specifications.
* **Good case**: Regularly updated specs to match the latest version.

#### Document Quality and Accuracy

* Ensure documents come from reputable sources.
* Double-check facts and information to avoid embedding misinformation.

#### Language and Grammar

* Use clear, concise, and grammatically correct language to aid AI understanding.

#### Metadata Utilization

* Utilize document metadata (author, creation date, tags) for better embedding results.

#### Embedding Granularity

* Decide on embedding the whole document as one vector or embedding paragraphs/sentences separately.

#### Document Preprocessing

* Use techniques like tokenization, stemming, and lemmatization.
* Remove stop words to improve embedding quality.

#### Embedding Algorithms

* Choose the appropriate embedding algorithm (e.g., TF-IDF, Word2Vec, FastText, BERT) for your use case.

#### Dimensionality

* Decide on the dimensions of the embeddings, balancing detail capture and computational expense.

#### Document Versioning

* Maintain document versions to track changes and decide which version to embed.

#### Handling Multimedia Elements

* Decide how to handle non-textual elements (images, charts, audio clips). Options include removal, textual descriptions, or separate processing.

#### Regular Re-embedding

* Periodically re-embed documents to ensure relevance in the vector space.

#### Evaluation and Feedback

* Test embedding results for accuracy and implement feedback mechanisms to improve the process.

#### Security and Privacy

* Handle sensitive information correctly, possibly redacting before embedding.
* Ensure compliance with data privacy regulations.

#### Handling Special Characters and Symbols

* Decide whether to retain, replace, or remove special characters and symbols, as some embedding algorithms may not handle them well.

Following these best practices ensures a comprehensive and efficient embedding process, delivering high-quality and relevant vectors in your database.
