Skip to content

Core Concepts

IntelliRag is a code intelligence platform that uses Retrieval-Augmented Generation (RAG) to make large codebases intelligible to developers and AI coding assistants. This page covers the foundational concepts you will encounter when working with the platform.

All data in IntelliRag follows a strict hierarchy:

Tenant > Workspace > Repository > Branch

Each level inherits the isolation boundary of its parent. A tenant cannot access another tenant’s data at any level of the hierarchy.

A tenant is the top-level isolation boundary. Each organization is a tenant. All data - symbols, graphs, embeddings, configuration - is isolated by tenant_id across every datastore. There are no cross-tenant operations.

A workspace is a logical grouping of repositories. Use workspaces to organize by team, product, or microservice group. A tenant can have multiple workspaces, and each workspace can contain multiple repositories.

A repository represents a git repository connected to IntelliRag. Repositories are identified by their remote URL and must be unique per tenant. You must create a repository in the dashboard before the indexer can process it - the indexer performs lookup only, not auto-creation.

Symbols are the named code elements extracted by the indexer: functions, classes, methods, interfaces, constants, types, and other declarations. Each symbol carries metadata including its location, visibility, documentation, and relationships to other symbols.

Every symbol has a fully qualified name - a unique identifier within its repository. FQNs follow language-specific conventions:

Language Example
Java com.example.UserService.findById
Go pkg/handlers.CreateUser
TypeScript src/auth/middleware.ts#validateToken

FQNs are used for precise symbol lookup and as stable references in the knowledge graph.

The indexing pipeline is the process of analyzing a codebase and extracting structured intelligence. It runs in the following order:

  1. Git clone or pull - Fetch the latest code from the repository.
  2. File filtering - Exclude dependency directories, build artifacts, binary files, lock files, and other non-useful content.
  3. Tree-sitter AST parsing - Parse source files into abstract syntax trees for structural analysis.
  4. Language analysis - Extract symbols, references, call edges, and import edges using language-specific analyzers.
  5. Framework detection - Identify framework patterns (Spring, Express, Rails, etc.) using import edges from the language pass.
  6. Text analysis - Process non-code files (OpenAPI specs, SQL DDL, Maven POM files, property files) that have no tree-sitter grammar.
  7. Embedding generation - Generate vector embeddings for semantic search via the embedding service.
  8. Batch upload - Write all extracted data to the API server in batches.

Analyzers are pluggable components that extract intelligence from source files. There are three types:

Language analyzers use tree-sitter grammars to parse ASTs and extract symbols, references, and call edges. Each supported language (Java, Go, TypeScript, Python, C#, Ruby, PHP) has its own analyzer.

Framework analyzers build on language analysis to detect framework-specific patterns - route definitions, dependency injection, event handlers, and more. They run after language analyzers and use the import edges from the language pass for framework detection.

Text analyzers handle files that do not have a tree-sitter grammar, such as OpenAPI specifications, SQL DDL files, Maven POM files, and property files. They extract structured data without AST parsing.

IntelliRag builds a knowledge graph in Neo4j that captures the structural relationships in your codebase. The graph contains:

  • Call edges - Which functions call which other functions.
  • Import edges - Which modules import which other modules.
  • Data flow - How data moves through the system.
  • Entry points - HTTP endpoints, message handlers, and other external interfaces.

The knowledge graph powers navigation queries such as “who calls this function,” “what depends on this module,” and “what is affected if I change this.”

IntelliRag maintains seven Qdrant vector collections for semantic search:

Collection Purpose
code_chunks Source code segments for natural language code search
module_summaries LLM-generated module descriptions
pattern_matches Detected design patterns and architectural patterns
git_archaeology_chunks Git history analysis (churn, ownership, change patterns)
debt_vectors Technical debt items and code quality signals
api_contract_chunks API endpoint definitions and contracts
event_catalog_vectors Event producers, consumers, and message schemas

All collections use the same embedding model (Voyage AI voyage-code-3, 1024 dimensions) and enforce tenant_id filtering on every query.

After the indexing pipeline completes, an asynchronous enrichment process runs LLM-powered analysis on the indexed data. Enrichment is queued via a job system and processed by a dedicated worker - it never runs in the API server request path.

Enrichment job types include:

  • Module summary - Generate natural language descriptions of modules and packages.
  • Debt triage - Classify and prioritize detected technical debt.
  • Dead code review - Identify potentially unused code paths.
  • Schema annotation - Add context to database schema definitions.
  • Contract inference - Infer API contracts from code patterns.
  • Event description - Describe event producers and consumers.

Enrichment results are stored alongside indexed data and made available through the same MCP tools and API endpoints.