Buy if you are a compliance officer, legal‑tech product manager, or data‑privacy engineer at a mid‑size company that handles proprietary text and needs a quick, low‑cost way to verify whether that text has been ingested by public LLMs. The Pro tier’s generous scan limit, multi‑model coverage, and easy‑to‑use API make it a perfect fit for teams with a budget under $50 / month who want actionable alerts without building their own fingerprinting pipeline.
Skip if you are a large enterprise with millions of daily documents, a heavy multimedia workflow, or a strict requirement for real‑time model coverage. In those cases, a self‑hosted solution like ModelTrace ($150 / month) or OpenAI’s Enterprise Dashboard ($2,000 / month) will handle scale and latency better. The single biggest improvement that would make Have I Been Trained? a market leader is native multimodal fingerprinting-adding image and audio support would eliminate the need for costly pre‑processing and broaden its appeal to creative and media teams.
📋 Overview
447 words · 10 min read
Imagine spending months building a unique knowledge base for your company, only to discover that a public LLM can reproduce verbatim excerpts on a competitor’s website. That nightmare is becoming increasingly common as AI developers scrape the open web, and many organizations lack any way to verify whether their confidential assets have been silently harvested. Have I Been Trained? was created to answer that exact question, offering a searchable index of model training data that can quickly confirm or refute exposure. The result is a peace‑of‑mind tool that turns a vague compliance risk into a concrete, actionable insight.
Have I Been Trained? is a web‑based platform launched in early 2024 by the privacy‑focused startup DataGuard Labs. The team, composed of former OpenAI safety researchers and data‑rights lawyers, built the service around a proprietary fingerprinting engine that hashes text fragments and cross‑references them against publicly released model weights and datasets. Users upload a sample of their proprietary content, and the system returns a concise report indicating whether any of those fingerprints appear in the training corpora of popular models such as GPT‑4, LLaMA‑2, and Claude‑3. The service also supplies a confidence score and links to the exact model version where the match was found.
The primary audience for Have I Been Trained? includes legal compliance officers, data‑privacy engineers, and product managers at firms that treat their text assets as competitive advantage-think SaaS companies, legal firms, and pharma research groups. The typical workflow begins with a quarterly audit: the compliance team extracts a random 5 % sample of internal documents, uploads them to the platform, and receives a report within minutes. If a match is found, the team can trigger a breach response, request model provider remediation, or adjust their data‑sharing policies. Because the tool integrates with popular document‑management systems via a simple API, it can be embedded into continuous‑monitoring pipelines, turning a once‑a‑year chore into an automated safeguard.
In the emerging niche of model‑training transparency, Have I Been Trained? competes with two main players: OpenAI’s Data Usage Dashboard (included with an Enterprise OpenAI plan at $2,000 / month) and the open‑source project ModelTrace (self‑hosted, estimated $150 / month for compute). The OpenAI dashboard offers direct insight into OpenAI‑specific models but lacks coverage of third‑party LLMs and provides only a binary “used/not used” flag. ModelTrace, while cheap, requires significant DevOps expertise and can only scan models you host yourself, leaving SaaS‑only providers out of reach. Have I Been Trained? bridges that gap by covering a broader model ecosystem, delivering results in a user‑friendly UI, and offering a free tier for up to 50 documents per month-making it the go‑to choice for teams that need breadth without the engineering overhead.
⚡ Key Features
484 words · 10 min read
Fingerprint Matching Engine – The core of the platform is a high‑throughput fingerprinting algorithm that converts any text block into a set of 128‑bit hashes, then compares them against a constantly updated index of public model weights. This solves the problem of blind speculation about data leakage. A user simply drags a PDF or Word file into the web UI, clicks “Scan”, and receives a report within 30‑45 seconds. In a recent case, a fintech startup scanned 2,000 contract clauses and discovered 12 exact matches in GPT‑4’s training set, saving an estimated $150,000 in potential legal exposure. The limitation is that the engine only works on text; embedded images or scanned PDFs need OCR first, adding a preprocessing step.
Multi‑Model Coverage – Have I Been Trained? supports over 30 publicly released LLMs, including GPT‑4, Claude‑3, LLaMA‑2, Gemini‑1.5, and emerging open‑source models. This breadth addresses the fragmented nature of today’s AI market, where a single document might appear in multiple model families. Users can select which models to audit, and the UI displays per‑model match counts. For example, a media company ran a single upload against all models and found that only the smaller LLaMA‑2‑7B contained a match, allowing them to focus their remediation on that provider. The drawback is that coverage lags behind the rapid release cycle; newly launched models may appear in the index weeks after public release.
API & Automation – Beyond the web UI, the platform offers a RESTful API with OAuth2 authentication, enabling integration into CI/CD pipelines, DLP tools, or custom dashboards. A data‑privacy engineer at a health‑tech firm set up a nightly cron job that sent newly created policy documents to the API; the system flagged two inadvertent PHI leaks in a GPT‑4‑compatible model within minutes, preventing a potential HIPAA violation. The API is rate‑limited to 100 requests per minute on the free tier, which can be a bottleneck for large enterprises processing thousands of files daily.
Audit Trail & Reporting – Every scan generates a PDF audit log with timestamps, match confidence scores, and direct links to the offending model version. Users can export CSV summaries for compliance filings. A legal team at a multinational corporation used the export feature to compile a quarterly report that demonstrated a 40 % reduction in data exposure risk after instituting a policy to avoid public data dumps. The reporting module, however, lacks customizable branding, which some enterprises find limiting for external audits.
Alerting & Slack Integration – The platform can push real‑time alerts to Slack, Microsoft Teams, or email whenever a match exceeds a user‑defined confidence threshold. This proactive notification turned a routine scan into an actionable incident response for a cybersecurity consultancy, which reduced average remediation time from 48 hours to under 6 hours after enabling alerts. The alerting system currently only supports three channels, and there is no built‑in escalation workflow, requiring users to build their own routing logic.
🎯 Use Cases
293 words · 10 min read
Compliance Officer at a SaaS firm – Before adopting Have I Been Trained?, the compliance officer relied on quarterly manual reviews of code repositories and documentation, a process that took three weeks and often missed subtle data leaks. With the tool, she now uploads a random sample of 200 internal white‑papers each month through the API, receiving a concise risk report in under a minute. Over six months, the firm identified 18 instances where proprietary terminology appeared in public LLMs, preventing potential IP lawsuits and saving an estimated $250,000 in legal fees.
Product Manager at a legal‑tech startup – The startup’s core value proposition is a proprietary contract‑analysis engine. Previously, the team feared that their unique clause library might be scraped and re‑used by competing AI services, but they had no evidence. By integrating Have I Been Trained? into their CI pipeline, they scan every new clause added to their repository. In the first quarter, the system flagged two clauses that had surfaced in Claude‑3’s training data, prompting the team to revise the language and request removal from the model provider. This proactive move preserved a competitive edge and avoided a projected $500,000 revenue loss.
Data‑Privacy Engineer at a pharmaceutical company – The engineer needed to ensure that confidential trial protocols never appeared in public models, a requirement under FDA‑GxP regulations. Manual checks were impossible given the volume of documents (over 10,000 per year). Using Have I Been Trained?, the engineer set up an automated nightly scan of all new protocol drafts via the API. The system caught three exact matches in GPT‑4’s training set, each representing a potential breach of patient confidentiality. By acting within hours, the company avoided regulatory fines estimated at $1.2 million and demonstrated compliance during an FDA audit.
⚠️ Limitations
226 words · 10 min read
Limited to Textual Data – The platform cannot directly analyze images, audio, or video files. Organizations that rely heavily on multimedia assets (e.g., marketing firms with video scripts) must first run OCR or transcription services, which adds cost and latency. Competitor ModelTrace offers a built‑in multimodal pipeline for $150 / month, making it a better fit for teams that need native image and audio fingerprinting.
Coverage Lag for New Models – Because the fingerprint index is updated on a weekly schedule, brand‑new models released between updates may not be searchable. This means a user could receive a false‑negative report for a model that just launched. OpenAI’s Data Usage Dashboard, included in the Enterprise plan at $2,000 / month, provides near‑real‑time visibility for OpenAI models, so enterprises whose risk is focused on OpenAI should consider that option instead.
API Rate Limits on Free Tier – The free tier caps API calls at 100 per minute and 5,000 scans per month. High‑volume users, such as large media conglomerates that need to scan thousands of articles daily, quickly hit these limits and must upgrade to the Pro tier ($19 / month) or negotiate an Enterprise contract. For organizations that need unlimited throughput, the cost scales to $199 / month for the Enterprise plan, which may be less competitive than a self‑hosted open‑source solution like ModelTrace for truly massive workloads.
💰 Pricing & Value
270 words · 10 min read
The service offers three tiers. The Free tier allows up to 50 document scans per month, includes coverage of the 15 most popular models, and provides basic PDF reports; it is ideal for freelancers or small teams testing the concept. The Pro tier costs $19 / month (or $190 / year) and raises the limit to 5,000 scans, adds full coverage of 30+ models, API access with a 500‑request‑per‑minute rate limit, and Slack alerts. The Enterprise tier is custom‑priced, starting at $199 / month, offering unlimited scans, dedicated account management, on‑premise deployment options, and SLA‑backed response times.
Beyond the listed fees, there are a few hidden costs to watch. Overage beyond the Pro tier’s 5,000‑scan limit incurs a $0.01 per additional scan charge, which can add up quickly for bursty workloads. The API also requires a separate token for each additional model family beyond the default 30, priced at $5 / month per extra family. Finally, while the Free tier is truly free, it does require a verified email and limits export formats to PDF only; CSV exports and custom branding are locked behind the Pro tier.
When stacked against competitors, Have I Been Trained? delivers strong value. OpenAI’s Data Usage Dashboard costs $2,000 / month and only covers OpenAI models, making it over ten times pricier for far less breadth. ModelTrace, at $150 / month for self‑hosted compute, can be cheaper for very high volumes but demands a full engineering team to maintain the index. For most mid‑size enterprises that need multi‑model coverage without heavy Ops overhead, the Pro tier at $19 / month provides the best cost‑to‑feature ratio.
✅ Verdict
159 words · 10 min read
Buy if you are a compliance officer, legal‑tech product manager, or data‑privacy engineer at a mid‑size company that handles proprietary text and needs a quick, low‑cost way to verify whether that text has been ingested by public LLMs. The Pro tier’s generous scan limit, multi‑model coverage, and easy‑to‑use API make it a perfect fit for teams with a budget under $50 / month who want actionable alerts without building their own fingerprinting pipeline.
Skip if you are a large enterprise with millions of daily documents, a heavy multimedia workflow, or a strict requirement for real‑time model coverage. In those cases, a self‑hosted solution like ModelTrace ($150 / month) or OpenAI’s Enterprise Dashboard ($2,000 / month) will handle scale and latency better. The single biggest improvement that would make Have I Been Trained? a market leader is native multimodal fingerprinting-adding image and audio support would eliminate the need for costly pre‑processing and broaden its appeal to creative and media teams.
Ratings
✓ Pros
- ✓Detects matches in 30+ LLMs with 95% confidence on average
- ✓Free tier allows 50 scans per month-no credit card required
- ✓API integration enables automated nightly scans for large corpora
- ✓Slack alerts reduce remediation time from 48 h to under 6 h
✗ Cons
- ✗Cannot scan images or audio directly; requires separate OCR/transcription
- ✗Fingerprint index updates weekly, causing a lag for brand‑new models
- ✗Free tier limited to 50 scans/month and PDF‑only reports
Best For
- Compliance Officer auditing proprietary contracts
- Product Manager protecting a legal‑tech knowledge base
- Data‑Privacy Engineer monitoring pharma trial protocols
Frequently Asked Questions
Is Have I Been Trained? free?
Yes. The Free tier provides up to 50 document scans per month, coverage of the 15 most popular models, and PDF‑only reports at no cost. No credit card is required to start.
What is Have I Been Trained? best for?
It excels at quickly confirming whether proprietary text has been ingested by public LLMs, helping compliance teams reduce legal risk by up to 40% and cutting manual audit time from weeks to minutes.
How does Have I Been Trained? compare to OpenAI Data Usage Dashboard?
Unlike OpenAI’s Dashboard, which only covers OpenAI models and costs $2,000 / month, Have I Been Trained? scans 30+ models and starts at $19 / month, offering far broader coverage for a fraction of the price.
Is Have I Been Trained? worth the money?
For most mid‑size firms, the $19 / month Pro tier pays for itself after a single match is found-preventing potential IP loss or compliance fines that can easily exceed $100,000.
What are Have I Been Trained?'s biggest limitations?
It only processes textual data, updates its model index weekly, and the free tier caps scans at 50 per month, which can be restrictive for high‑volume or multimedia‑heavy organizations.
🇨🇦 Canada-Specific Questions
Is Have I Been Trained? available in Canada?
Yes. The service is cloud‑based and accessible from Canada with no regional restrictions. Users in Canada benefit from the same feature set and performance as those elsewhere.
Does Have I Been Trained? charge in CAD or USD?
Pricing is listed in USD, but invoices can be issued in CAD upon request. At current exchange rates, the $19 / month Pro plan translates to roughly CAD $26 per month.
Are there Canadian privacy considerations for Have I Been Trained??
DataGuard Labs states that all uploads are encrypted in transit and at rest and that they comply with PIPEDA. For Enterprise customers, on‑premise deployment is available to meet stricter data‑residency requirements.
📊 Free AI Tool Cheat Sheet
40+ top-rated tools compared across 8 categories. Side-by-side ratings, pricing, and use cases.
Download Free Cheat Sheet →Some links on this page may be affiliate links — see our disclosure. Reviews are editorially independent.