Table of Contents:
- Training Dataset and Methodology Overview
- Inference at Scale
- Data Security and Privacy
- Understanding the Limitations of the AI Model
Training Dataset and Methodology Overview
ProcessUnity Global Risk Exchange Framework Mapper Dataset
The starting dataset contains over a quarter million in-house cybersecurity text pairs labelled by our security analysts, who have decades of experience in the field. This dataset is curated to represent question-to-passage pairs where the passage is the information relevant to the question. Each passage pair is weighted by relevancy as there can be many individual passages relevant to the same question. As of early 2025, there are no standardized human labelled publicly available datasets that have cybersecurity text similarity pairs at this volume. ProcessUnity’s is the largest in existence.
Extended Framework Pairs
The Framework Mapper dataset is further expanded using the Exchange standard questionnaire, which is based on cyber controls, as steppingstones from one framework control to another.
We can extend these mappings to map frameworks to frameworks, not just frameworks to questionnaire controls, by relating two controls from different frameworks through the Exchange questionnaire.
The resulting unified dataset characterizes to:
- Upwards of 40-million pairs
- All 40-million pairs are topic clustered through the Exchange questionnaire
- All 40-million passages are weighted by using the minimum of the weights for each pair that was derived from being primary (1.0) to the Exchange questionnaire, or supporting (0.5) to the Exchange questionnaire.
This amount of data enables us to efficiently aggregate, filter, fine-tune, and iterate on models capable of retrieving sections of text from documents.
Cybersecurity Relevant Document Test Dataset
Assessment Autofill is tested on real-world cybersecurity data to verify its ability to identify the documents and text passages as we would expect to see in practice.
Hard Negatives
One of the training techniques used is the method of pairing similar passages in the training set with different questions. These pairs are not relevant to each other, i.e. negative pairs, but exhibit similar language use. By pairing similar passages in both positive and negative examples, the model is forced to learn distinctions in context, language, and topic, rather than simply identifying related words.
Inference at Scale
No Performance Hardware Dependency
Our proprietary relevancy model is light weight at a few hundred megabytes. The model is compiled with the inference code, allowing it to be as mobile as the code is when packaged. The model only requires CPU hardware as opposed to more expensive, higher carbon-emitting, difficult to provision, inflexible hardware while still performing at a high level. This allows for Assessment Autofill to achieve much higher and more consistent uptime than depending on specialized hardware.
No Queueing Mechanism
With performance hardware, there needs to be a queueing mechanism as requests are limited by the amount of onboard hardware memory. We are not constrained by queueing requests that cause halts for customer results. Being independent of a queueing mechanism opens inference to be horizontally scalable while remaining cost-effective. In our tests using real world documents, we find that five-hundred documents can be processed in under two minutes.
AWS Infrastructure
Assessment Autofill is hosted in AWS cloud infrastructure for scalability and managed services, further reducing our development time and exposure to security vulnerabilities.
Dynamically scalable
The inference pipeline is scalable with the ability to support 5,000 simultaneous requests per minute of roughly 500 documents per request. Since the documents are processed in parallel, the average runtime of one minute is unchanged as the documents and requests increase.
Multi-Region Deployment
The necessary components are also light as to be implemented easily in multiple regions through AWS. We maintain stateless components that only require CPUs and storage that can be initialized across many AWS regions to support our customers’ needs.
Data Security and Privacy
Data Security
The data uploaded from each request to Assessment Autofill is encrypted in transit and at rest. After each run, the documents are removed from the Assessment Autofill processing pipeline but continue to exist under the existing level of high security for the overall Exchange platform as it does without Assessment Autofill. This means the documents are stored outside of the pipeline and kept secure as any other document uploaded on the platform.
AWS Services
Any content in the documents that are used is scrubbed for sensitive information, sectioned to a handful of short paragraphs, and only those relevant to the given question are retained before being processed in any AWS service. Additionally, the data used within an AWS service is not retained for training or improving the AWS product.
Additional Support
Have additional questions regarding Assessment Autofill? Please reach out to your account representative to get connected with our product team for additional support.
Understanding the Limitations of the AI Model
While the Assessment Autofill AI tool is a powerful assistant for document analysis and content extraction, it’s important to clarify what it does not do to ensure users engage with it effectively and responsibly.
What the AI Tool Cannot Do
-
Company Relevance Assessment
- The AI tool cannot determine whether a document pertains to your company or another organization.
- It assumes that any uploaded document is relevant to your organization. Users must ensure that only applicable documents are provided.
-
Document Date Awareness
- The AI tool does not evaluate the date of a document to prioritize newer over older content.
- It treats all documents equally unless the content itself indicates relevance. Users must upload the most current and relevant materials.
-
Document Weighting
- The tool does not differentiate between document types (e.g., SOC reports vs. internal policies).
- All documents are evaluated based on content relevance to the query, not their source or perceived importance.
-
Technology Inference
- The AI tool does not infer the purpose or relevance of technologies mentioned in a document.
- For example, if password management tools are listed, the AI will not automatically consider them relevant to a question about password management unless the document explicitly provides context linking the tools to that use case.
Best Practices for Users
- Curate your uploads: Ensure documents are accurate, relevant, and up to date.
- Avoid assumptions: Don’t expect the tool to “know” what matters most, guide it with quality inputs.
- Add explanatory context: If a document includes technologies or tools, make sure it explains their relevance to the topic at hand.
- Be intentional with document selection: Choose documents that directly address the question or topic, rather than relying on implied relevance.