By Estelle Ruellan[1] , threat intelligence researcher at TEM company Flare[2].
Image: DIW-AIgen
Cybercriminals persistently target critical infrastructure to disrupt key lifeline services and influence more prudent attacks that tempt companies into paying large amounts of ransom.
Such was the case when advanced persistent threat (APT) groups like Volt Typhoon, APT41, and Salt Typhoon leveraged legitimate account credentials[3] to conduct long-term intrusions, moving laterally across multiple U.S. state government networks.
In collaboration with Flare, Verizon found stolen credentials were involved in 88% of basic web application attack breaches[4] , making them not only the most common initial attack vector but also, frequently, the only one.
In 2024 and 2025, there has been a surge[5] in infostealer and credential marketplace activity, and security teams are struggling with alert fatigue. Most organizations can’t afford analysts spending hours every day trawling through Telegram, forums, and paste sites. If a model helps filter the noise, it gives human teams breathing room.
Our latest research[6] shows that GPT-powered models can scan hundreds of daily posts on underground forums like XSS, Exploit.in[7] , and RAMP, detecting stolen credentials and mapping live malware campaigns with 96% accuracy.
With the right prompts and navigation, LLMs can detect emerging breaches, identify compromised credentials, and surface novel exploits. When properly directed, these models can take on the heavy lifting of cyber threat intelligence, handling the foundational work of CTI gathering and basic analysis, so security analysts can dedicate their expertise to complex investigations and strategic threat assessments that demand human judgment and deeper insight.
However, the takeaway here isn’t “LLMs will solve cyber threat intelligence (CTI).” They are more like hyper-fast execution engines that require detailed human instruction rather than seasoned analysts who understand business risk and context.
Security analysts must understand the tool’s blind spot: LLMs need humans to dissect every element, provide domain knowledge, map decision-making steps, and supply contextual understanding. When properly instructed with this comprehensive guidance, they can execute tasks at incredible speed, but they remain fundamentally blind without human strategic oversight.
Let’s look at where LLMs succeed in CTI, and where their limitations are to use them safely.
Where LLMs Add Real Value in CTI
Security teams are drowning in noise. Microsoft Defender for Endpoint has seen a significant increase in the number of indicators of attack (IOAs), with a 79% growth[8] from January 2020 to today. Many of these alerts will be false positives, such as flagging logins from unusual geographies, devices, or IPs when employees are on business travel or working from new cafes.
LLMs can chip away at the overload. In our study[9] , GPT-3.5 parsed hundreds of daily forum posts, pulling out details like stolen credentials, malware variants, and targeted sectors. For an analyst, that means minutes instead of hours spent sifting through chatter.
Its usefulness in allowing for breach and leak monitoring is potent. The use of valid account credentials and the exploitation of public-facing applications were tied as the top initial access vectors observed in 2024, both representing 30% of X-Force incident response engagements[10] .
Having LLMs summarize cybercrime forum conversations and flag when credentials or other sensitive data appear to be leaked or traded can help flag exposures before they hit production systems. In our study, the model highlighted mentions of compromised companies or products and surfaced potential breaches or exploits being discussed. This provides valuable context for breach and leak monitoring, giving analysts early awareness of emerging threats without hours of manual review.
Moreover, threat actors rarely stay in one lane; they might sell infostealers on Telegram, have initial access brokers (IABs) packages that access and list them on forums, and in another channel, advertise phishing kits to weaponize those stolen credentials. Each stage looks like a separate conversation if you only see one channel, but they’re pieces of the same campaign pipeline.
LLMs are uniquely good at pattern recognition across disjointed conversations. Done right and with the right context, they could stitch fragments together into early warning signals, giving analysts a clearer picture of emerging campaigns.
Blind Spots and Risks of Overreliance
While LLMs show potential in minimizing false positives, these tools are not immune to them. Our team noted that GPT-3.5 struggled with something as basic as verb tense, confusing an ongoing breach with one that had already ended. The key points here are your prompt engineering (how you craft your prompt) and a reminder that high accuracy in controlled studies does not guarantee the same results in live and variable scenarios.
LLMs can fabricate connections or misclassify chatter when context is thin. In practice, that means a model might confidently link stolen credentials to the wrong sector, sending analysts down rabbit holes and wasting valuable time. According to Gartner, 66% of senior enterprise risk executives[11] noted AI-assisted misinformation as a top threat in 2024.
Cost and scale matter too. Running models across thousands of daily posts isn’t free. If teams lean too hard on closed-source LLMs without evaluating cost-performance trade-offs, they risk creating yet another tool that looks great in a proof of concept but doesn’t survive budget cycles.
Projects like LLaMA 3, Mistral, and Falcon are catching up to closed models in language understanding. Fine-tuning or training them on your own CTI datasets can be cheaper in the long term, with more control over model updates and security. The trade-off is that you need in-house expertise to manage training and guardrails.
What CISOs Should Demand
CISOs already know the only way to stay ahead of automated attacks is to automate defenses. Some 79% of senior executives[12] say they are adopting agents in their companies to strengthen security. The key part is knowing how to use them without adding new risks.
A model with 96% accuracy is impressive, but it still misses nearly one in twenty signals. And, as we mentioned earlier, they can still alert to false positives or link stolen credentials to the wrong sector. That’s why all AI triage must be overseen and verified by an analyst, ensuring errors don’t slip into executive briefings or trigger costly over-reactions.
These tools only work if they are steered with precision. Prompt engineering is critical. Context, down to the last detail and tense used, all affect the LLM performance. In one case, a discussion about purchasing data in Israel, titled “Buy GOV access,” was mislabeled as not targeting critical infrastructure, when in fact it was, because that title wasn’t part of the prompt. CISOs or security teams using these models must always ground outputs with missing yet critical context.
Moreover, variables like “is targeting a large organization” or “critical infrastructure” were interpreted inconsistently by the model, since there was no shared definition. It flagged globally known names accurately but missed sector-specific or less famous entities. When prompting an LLM, don’t rely on the model’s definitions, set your own. Because if you don’t set the rules, the model will make them up/follow its own. Therefore, when using subjective or loosely defined labels, security teams should embed definitions or examples within prompts, such as, “Critical infrastructure encompasses essential systems and facilities such as energy, oil and gas sector, transportation, water supply, telecommunications, internet providers, military, governments, harbour, airport.”
Some best practices include:
- Define the LLM’s role and provide an explicit output structure
- Align verb tense to context (“has sold” vs. “is selling”)
- Always include relevant context (e.g., thread titles or summaries of the previous conversation)
- Provide clear definitions or decision rules for subjective categories
Finally, CISOs should demand clear ROI benchmarks before betting big on tools that could become shelfware. Closed-source models deliver strong results, but open-source alternatives are catching up.
LLMs are not perfect, but when tied tightly to structured prompts, contextual data, and clear analyst-defined rules, they can amplify defense strategies. They should not be treated as black-box oracles. They can sift vast volumes of dark-web chatter and hand analysts a distilled starting point. The key is not expecting them to make judgment calls on risk but designing the workflow so that they enrich human decision-making instead of replacing it.
Read next: Who Really Owns OpenAI? The Billion-Dollar Breakdown[13]
References
- ^ Estelle Ruellan (www.linkedin.com)
- ^ Flare (flare.io)
- ^ legitimate account credentials (www.cyber.nj.gov)
- ^ 88% of basic web application attack breaches (flare.io)
- ^ surge (www.infosecurity-magazine.com)
- ^ research (arxiv.org)
- ^ Exploit.in (exploit.in)
- ^ a 79% growth (cdn-dynmedia-1.microsoft.com)
- ^ study (arxiv.org)
- ^ 30% of X-Force incident response engagements (www.ibm.com)
- ^ 66% of senior enterprise risk executives (www.gartner.com)
- ^ 79% of senior executives (www.pwc.com)
- ^ Who Really Owns OpenAI? The Billion-Dollar Breakdown (www.digitalinformationworld.com)