James Jarvis 16 December, 2025

Common AI Implementation Mistakes to Avoid Part 2

Common AI Implementation Mistakes to Avoid

Part 2: Misconfigured Datasets

The use of AI in internal and external applications is rapidly being deployed across all sectors, often handling vast arrays of data from across an organisation. Whilst this offers exciting new capabilities, it also widens the potential attack surface of a company’s infrastructure. AI infrastructure can accidentally expose sensitive data to unauthenticated users if datasets are not properly configured.

The ‘mistakes’ highlighted within this mini-series are based on findings from real tests, showing that these are not just hypothetical scenarios, but real situations with real implications.

Data security is paramount, but many of the issues raised below are not unique to AI –in most cases, these are standard vulnerabilities. Their occurrence has been exacerbated by poor AI/chatbot implementation where rigorous security practices may have been overlooked.

Misconfigured Datasets

Well-configured roles are irrelevant if the datasets themselves are misconfigured. You may need to ensure users only have access to certain directories. Whilst this is an effective way to prevent unauthorised access, this does not prevent access from files which have been miscategorised and placed into the wrong directories.

The issue of ‘wrong directory’ is a surprisingly common occurrence and fundamentally related to human error. If a strong process is not in place when assigning what data is accessible to a chatbot or other AI capability, it could lead to a breach of confidentiality. I have previously seen a test where a chatbot which should have only been an enhanced FAQ bot had access to the company’s financial history, which allowed the tester to see the company accounts and expenses. It became clear that the Chatbot folder was in a C-Suite employee’s directory – possibly the CISO – and horizontal directory access had not been blocked.

Whilst this could be considered an outlier, it is clear that these breaches can happen. A breach of internal documents, customer PII, and even IP leakage can occur if the datasets provided to the chatbot contain this information. It is not sufficient to rely on ‘filters’ to ‘stop’ the chatbot talking about undesirable content. As is repeatedly shown, chatbot filters can be bypassed. The best defence against a chatbot breaching confidentiality is to prevent it having access to this data, especially when it does not need the documents for its business use case.

It should be noted that this is not new to AI. However, the use of AI could allow a malicious actor to regularly check for misconfigured files. It may also allow for files to be found accidentally, which is likely to happen if parent directories are misconfigured. 

As alluded to above, it is imperative that datasets are configured accurately, with robust measures in place to ensure that a file cannot be ‘accidentally’ added to an AI function’s accessible dataset. This dataset itself should have a limited number of users who can edit it to help reduce the risk of accidental data exposure.

The Key Takeaways:

  • Human error in dataset configuration can expose sensitive files to chatbots, creating major confidentiality risks.
  • AI magnifies small mistakes. Miscategorised files or poorly set directory permissions can lead to leaks of internal documents, customer PII, or IP.

Limit who can edit AI datasets, enforce strict configuration processes, and never rely on chatbot filters as the sole safeguard.
One misplaced file can lead to crushing consequences. Robust measures can help ensure that AI only shares approved data, protecting the confidentiality of users and the company.

Part 3...

Improve your security

Our experienced team will identify and address your most critical information security concerns.