Stellaris-ModelScope: Interpretability Infrastructure for AI

Raising $45,000active

An open interpretability platform that enables researchers to inspect model internals, analyze latent representations, and detect hallucination or deceptive behavior in open-weight language models.

Project Details

I am building Stellaris-Modelscope, a mechanistic interpretability platform designed to make internal analysis of large language models accessible to independent researchers, safety teams and smaller labs without frontier scale infrastructure.

Today, most interpretability research happens inside a small number of well-funded organizations because inspecting transformer internals at scale requires significant engineering. My primary objective is to reduce that barrier by providing a unified software platform for model introspection.

I decided to build the platform using a Business-Intelligence-Software style architecture rather than traditional research tooling. That architectural decision was made due to the fact that organizations leverage more of their big data with BI software than the classical static notebook approach. Now researchers can easily interactively slice, aggregate, compare and visualize model behavior across prompts, task and model families.

I've personally used the platform to perform some research on open weight models and it helped me identify a linear epistemic state signal in transformer residual streams that generalizes across multiple architectures and helps predict model confidence and hallucination risk.

The concrete outputs of this project are:

A public and accessible research platform for interpretability experiments.
Reusable safety evaluation workflows
Governance tooling for hallucination and deception detections

Theory of Impact

Advanced AI systems may become dangerous not only because they become more capable, but because humans lose visibility into why models make decisions, when they are uncertain, and when they are behaving deceptively.

I believe one major bottleneck in AI safety is the lack of scalable interpretability infrastructure available outside frontier labs. Today, only a small number of organizations can deeply inspect model internals, limiting the amount of safety research the broader ecosystem can perform.

Stellaris-ModelScope reduces this bottleneck by democratizing mechanistic interpretability tooling.

If more researchers can inspect latent representations, activation pathways, confidence signals, and internal model geometry, we increase our ability to:

detect hallucination before deployment
identify deceptive or misaligned reasoning
measure uncertainty more reliably
develop intervention methods for safer model behavior

My research suggests that important safety-relevant signals, such as epistemic support, may already be linearly encoded inside transformer residual streams. If these internal signals can be reliably measured and acted upon, they could become a scalable oversight mechanism for advanced AI systems.

People

Kaossara Osseni

Team Member

Grants Received– no grants recorded

Funding Asks

grantmaking.ai Launch Round

Applied

Minimum

$45,000

Ideal

$80,000

How the money will be spent

Minimum Funding details:

$20000 -> Compute resources GPU inference, activation extraction, and interpretability experiments on larger open-weight models.

$5000 -> Infrastructure / hosting Cloud hosting, storage, deployment, and observability stack.

$20000 -> Research stipend / founder support/ Misc Reasearch costs Allows me to allocate significantly more time to full-time safety research and platform development.Ideal funding Details:

Contractor engineering support
Open-source release acceleration
Marketing

Discussion

No comments yet. Be the first to share your thoughts.