Skip to content

Language Support

The IntelliRag indexer uses tree-sitter grammars to parse source files into abstract syntax trees (ASTs), then extracts structured intelligence from each file. This page covers the supported languages, what gets extracted, and how the indexer handles files that fall outside tree-sitter grammars.

Language Grammar Symbols Calls Imports Entry points Data flow Patterns
Java tree-sitter-java Yes Yes Yes Yes Yes Yes
Go tree-sitter-go Yes Yes Yes Yes Yes Yes
Python tree-sitter-python Yes Yes Yes Yes Yes Yes
TypeScript tree-sitter-typescript Yes Yes Yes Yes Yes Yes
C# tree-sitter-c-sharp Yes Yes Yes Yes Yes Yes
Ruby tree-sitter-ruby Yes Yes Yes Yes Yes Yes
PHP tree-sitter-php Yes Yes Yes Yes Yes Yes
Terraform (HCL) tree-sitter-hcl Yes - Yes - Yes Yes

Each column in the table above represents a category of intelligence the indexer extracts:

  • Symbols - Named code elements: functions, methods, classes, interfaces, structs, constants, types, and other declarations. Each symbol includes its fully qualified name (FQN), location, visibility, and documentation.

  • Calls - Function and method invocations, including method dispatch through interfaces. These form the call graph in the knowledge graph.

  • Imports - Module, package, and file import relationships. Import edges are used by framework analyzers to detect framework usage and by dependency analysis tools.

  • Entry points - External interfaces into the codebase: HTTP route handlers, message consumers, CLI commands, event listeners, and scheduled tasks.

  • Data flow - How data moves through function parameters, return values, and assignments. Used to trace the path of a value through the system.

  • Patterns - Framework usage patterns, design patterns, and anti-patterns detected in the code. Includes things like singleton implementations, repository patterns, and dependency injection usage.

The indexer uses a multi-step hierarchy to determine how to process each file:

  1. Shebang line - A #!/usr/bin/env python3 header overrides the file extension. This handles scripts with no extension or mismatched extensions.

  2. Exact filename match - Recognized filenames like Makefile, Dockerfile, Jenkinsfile, Procfile, and pom.xml are routed to the appropriate analyzer regardless of extension.

  3. Extension match - Standard file extensions (.java, .go, .py, .ts, .cs, .rb, .php, .tf) select the language analyzer.

  4. Path pattern match - Files matching path patterns like .github/workflows/*.yml are classified by their location in the repository.

  5. Content sniffing - YAML files are classified by their top-level keys. SQL files are checked for DDL statements (CREATE TABLE, ALTER TABLE).

  6. go-enry fallback - The go-enry library provides a final classification based on heuristics.

  7. Text analyzer fallback - Files that do not match any tree-sitter grammar are passed to text analyzers for structured extraction.

Some files carry valuable intelligence but have no tree-sitter grammar. The indexer includes dedicated text analyzers for these formats:

File type Analyzer What it extracts
OpenAPI/Swagger (YAML/JSON) openapi API endpoints, request/response schemas, authentication requirements
SQL DDL files sqlddl Table definitions, column types, constraints, indexes, foreign keys
Maven pom.xml maven Project dependencies, build plugins, module structure
Java .properties files properties Configuration keys, values, and which files define or consume them

Text analyzers produce the same IndexOutput as language analyzers and follow the same batch upload path.

Before any analysis, the indexer filters out files that provide no code intelligence value. This keeps indexing fast and avoids noise in search results.

Dependency directories - node_modules/, vendor/, target/, .gradle/, Pods/, and similar package manager output.

Build artifacts - dist/, build/, out/, .next/, bin/, obj/, and compiled output directories.

Binary extensions - .class, .jar, .dll, .so, .wasm, .png, .jpg, .woff2, and other non-text formats.

Lock files - package-lock.json, yarn.lock, go.sum, Cargo.lock, Gemfile.lock, and other dependency lock files.

Secrets - .env, .env.*, .pem, .key, .crt, and other files that may contain credentials.

IDE and OS files - .vscode/, .idea/, .DS_Store, .cache/, coverage/, and editor/OS metadata.

All exclusion patterns use O(1) map lookups by directory name, file extension, or exact filename. To see the full list of patterns, refer to the source in indexer/internal/pipeline/exclusions.go.