Best Practices for Data Preparation for Training an AI Chatbot

Best Practices 2024: How to Prepare Training Data for an AI Chatbot?
Best Practices 2024: How to Prepare Training Data for an AI Chatbot?

To achieve a high accuracy level of an AI Chatbot, it's important to follow the guidelines for data preparation and continually improve the process. Though preparing data for AI can seem daunting, understanding its aspects can help in preparing data effectively for implementation. Well-prepared data fosters the AI Chatbot communicate with customers just like humans to help them in solving their queries. For more detailed guidance, refer to our comprehensive guide on the topic:

πŸ‘€ Human Readable Data

The data should be readable by humans, as this will facilitate the chatbot's ability to understand and process the information. Unstructured and chaotic data can hinder the chatbot's learning process.

🌐 β€œWebsite” Data Source Recommendations

Firstly, ensure webpages are publicly accessible and crawlable by bots for data extraction. This allows an AI chat bot to collect relevant information from webpages, which can be used to provide users with accurate and up-to-date answers or recommendations. This also improves the accuracy of the chatbot while delivering the requested information.

Page Crawlability Checker

Check Your Pages' Crawlability

Secondly, pay attention to the titles of webpages. So, if you create a new page specifically for training the Ribbo AI chatbot, set an intuitive and clear title.
For example: you create a new webpage to collect all the questions and answers regarding your products or services. In this case, it is preferred to specifically name the page "Frequently asked questions and answers for "[product name]" instead of "Bot training data." In this case, the AI bot will understand the context of this page better.

Thirdly, select only the most relevant web pages and blog posts to maintain the quality of the AI bot's responses. Do NOT include all your blog posts unless you know that most of the info is needed to answer user questions. We recommend to pick only the most relevant webpages and blog posts manually.

πŸ“ Plain Text Data Source Recommendations

  1. The name of your data source matters a lot. Please name your data source in the way that it represents its content.
  2. Organize plain text data sources by context. For example, create separate data sources for pricing options and service offerings, instead of combining them into one document.
Plain Text Data Source Example
Plain Text Data Source Example

πŸ“ Document File Dara Source Recommendations

  1. While you can upload any text document, it's beneficial if the document has some structure. If your document is unstructured and covers all aspects of your company, it may be less effective for the AI chatbot. If your document is only one or two pages, there's no need to split it into smaller documents. However, in general, it's advisable to separate documents based on the context of the information they contain.
  2. It's important to note that cross-referencing between documents may work better, especially with less powerful AI models. For instance, if you have a spreadsheet with product names and SKUs and a separate document with descriptions of each SKU, it could be challenging for the AI bot to establish a reference between the two documents. Therefore, keeping as much data as possible in one place for each item is sensible. For example, create a spreadsheet with the product name, SKU attributes, price, etc. This doesn't mean that cross-referencing between multiple documents won't work, but it may not always be successful with less powerful models like GPT 3.5. However, with GPT 4, it should work in most cases.

β“πŸ—£οΈQ&A Data Source Best Recommendations:

This is the most powerful data source type when used correctly. It's crucial to frame the question as a user would. For example: "What are the pricing options?". The answer should be comprehensive and contain all the necessary data. While our AI bot may make minor edits, the context and information will remain the same.

Good example:

Q&A Text Data Source Example
Q&A Text Data Source Example

Additional Recommendations

To maximize the usefulness of your data sources, ensure they are updated as changes occur on your website. For instance, if you modify information on your website, you'll need to retrain the affected web pages. This can be done quickly (if you use by clicking the 'Retrain' button next to the webpage address in your data source.

If you have a Q&A data source and the answer changes, editing the existing question rather than creating a new one is essential to avoid duplication. Alternatively, you can delete the old question and create a new one. This prevents confusion from having two data sources with the same question but different answers, which could confuse the AI model.

If you have any questions or need help, don't hesitate to send us a message in the live chat on social media and we'll be glad to answer ALL your questions. learns from your docs, website content, blog and other sources to generate a chatbot that answer questions within the context of your data.
[email protected]