Skip to content

feat: Python parser #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[submodule "pylsp"]
path = pylsp
url = git@github.com:Hoblovski/python-lsp-server.git
branch = abc
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ see [UniAST Specification](docs/uniast-zh.md)


# Quick Start

## Use ABCoder as a MCP server

1. Install ABCoder:
Expand All @@ -40,13 +39,15 @@ see [UniAST Specification](docs/uniast-zh.md)
abcoder parse {language} {repo-path} > xxx.json
```

For example:
For example, to parse a Go repository:

```bash
git clone https://github.com/cloudwego/localsession.git localsession
abcoder parse go localsession -o /abcoder-asts/localsession.json
```

To parse repositories in other languages, [install the corresponding langauge server first](./docs/lsp-installation-en.md).

3. Integrate ABCoder's MCP tools into your AI agent.

```json
Expand Down Expand Up @@ -112,17 +113,16 @@ $ exit

- NOTICE: This feature is Work-In-Progress. It only supports code analysis at present.


# Supported Languages

ABCoder currently supports the following languages:

| Language | Parser | Writer |
| -------- | ----------- | ----------- |
| Go | ✅ | ✅ |
| Rust | ✅ | Coming Soon |
| C | ✅ | ❌ |
| Python | Coming Soon | Coming Soon |
| Go | ✅ | ✅ |
| Rust | ✅ | Coming Soon |
| C | ✅ | Coming Soon |
| Python | | Coming Soon |


# Getting Involved
Expand Down
44 changes: 44 additions & 0 deletions docs/lsp-installation-cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Language server 安装
为了解析仓库中符号之间的依赖,abcoder parser 需要使用各语言的 language server。
运行 parser 之前请安装对应的 language server。

语言和 language server 的对应关系如下

| 语言 | Language server | 可执行文件 |
| --- | --- | --- |
| Go | 不使用 LSP,使用内置 parser | / |
| Rust | rust-analyzer | rust-analyzer |
| Python | (修改后的) python-lsp-server | pylsp |
| C | clangd-18 | clangd-18 |

按如下教程完成安装后,在运行 abcoder 前请确保 PATH 中有对应可执行文件

## Rust
* 先通过 [rustup](https://www.rust-lang.org/tools/install) 安装 Rust 语言
* 安装 rust-analyzer
```bash
$ rustup component add rust-analyzer
$ rust-analyzer --version # 验证安装成功
```

## Python
* 安装 Python 3.9+
* 从 git submodule 安装 pylsp
```bash
$ git submodule init
$ git submodule update
$ cd pylsp
$ pip install -e . # 可以考虑在单独的 conda/venv 环境中执行
$ export PATH=$(realpath ./bin):$PATH # 放到 .rc 文件里,或每次运行 abcoder 前都设置一下
$ pylsp --version # 验证安装成功
```

## C
* ubuntu 24.04 或以后版本: 可以直接从 apt 安装
```bash
$ sudo apt install clangd-18
```

* 其他发行版:手动编译、或从 [llvm 官方网站](https://releases.llvm.org/download.html) 下载预编译的版本。
clangd 在 clang-tools-extra 中。

43 changes: 43 additions & 0 deletions docs/lsp-installation-en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Language Server Installation

To parse dependencies between symbols in a repository, the abcoder parser requires the use of language servers for various languages. Please install the corresponding language server before running the parser.

The mapping between languages and language servers is as follows:

| Language | Language Server | Executable |
| -------- | ------------------------- | --------------- |
| Go | Does not use LSP, uses built-in parser | / |
| Rust | rust-analyzer | rust-analyzer |
| Python | (Modified) python-lsp-server | pylsp |
| C | clangd-18 | clangd-18 |

Ensure the corresponding executable is in PATH before running abcoder.

## Rust
* First, install the Rust language via [rustup](https://www.rust-lang.org/tools/install).
* Install rust-analyzer:
```bash
$ rustup component add rust-analyzer
$ rust-analyzer --version # Verify successful installation
```

## Python
* Install Python 3.9+
* Install pylsp from the git submodule:
```bash
$ git submodule init
$ git submodule update
$ cd pylsp
$ pip install -e . # Consider executing in a separate conda/venv environment
$ export PATH=$(realpath ./bin):$PATH # Add this to your .rc file, or set it before each abcoder run
$ pylsp --version # Verify successful installation
```

## C
* Ubuntu 24.04 or later: Install directly from apt:
```bash
$ sudo apt install clangd-18
```

* Other distributions: Use a manual installation.
Or download a pre-compiled version from the [LLVM official website](https://releases.llvm.org/download.html). clangd is in `clang-tools-extra`.
12 changes: 10 additions & 2 deletions docs/uniast-en.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ Universal Abstract-Syntax-Tree is a LLM-friendly, language-agnostic code context

# Identity Node Unique Identification

To ensure precise querying and scalable storage, `ModPath?PkgPath#SymbolName` is约定 as the globally unique identifier for AST Nodes.

To ensure precise querying and scalable storage, `ModPath?PkgPath#SymbolName` is as the globally unique identifier for AST Nodes. For example:

```json
{
Expand All @@ -16,6 +15,15 @@ To ensure precise querying and scalable storage, `ModPath?PkgPath#SymbolName` is
}
```

> Note that different languages have different descriptions of module and package. For example:
> * In Go, a module refers to a project that contains multiple packages, and a package includes all the files within a specific directory.
> * In Python, a package is a directory, which may contain sub-packages. A package can also contain modules, which are .py files inside the package directory.
> * In Rust, the term package does not exist at all. Instead, a crate (project) contains multiple modules, and modules may include sub-modules.
> * In C, neither concept exists at all.
>
> Do not confuse them with the terminology used in abcoder!
> In abcoder, unless otherwise specified, the module (mod) and package (pkg) are defined as follows:

- ModPath: A complete build unit where the content is the installation path@version number. This information is not required for LLMs but is preserved to ensure global uniqueness of Identity. It corresponds to different concepts in various languages:

- <u>Golang</u>: Corresponds to a module, e.g., github.com/cloudwego/hertz@v0.1.0
Expand Down
12 changes: 10 additions & 2 deletions docs/uniast-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ Universal Abstract-Syntax-Tree 是 ABCoder 建立的一种 LLM 亲和、语言

# Identity 节点唯一标识

为了保证精确查询和可扩展存储,约定 `ModPath?PkgPath#SymbolName` 为 AST Node 的全球唯一标识。

为了保证精确查询和可扩展存储,约定 `ModPath?PkgPath#SymbolName` 为 AST Node 的全球唯一标识。例如:

```json
{
Expand All @@ -16,6 +15,15 @@ Universal Abstract-Syntax-Tree 是 ABCoder 建立的一种 LLM 亲和、语言
}
```

> 注意,不同的语言对 module 和 package 的描述不同,例如
> * 在 Go 中 module 表示一个项目,包含了若干 package。而 package 包含了某目录下的诸文件。
> * 在 Python 中则是,package 是一个目录,可能包含子 package。而且 package 也可能包含 module,是 package 目录下的 py 文件。
> * 在 Rust 中根本没有 package 的说法,而是 crate(项目)包含了诸 module。module 可能包含子 module。
> * 在 C 中就完全没有这两个东西。
>
> 不要把它们和 abcoder 的描述混淆!
> 在 abcoder 中,除非另外说明,module(mod) / package(pkg) 的含义如下。

- ModPath: 一个完整的构建单元,ModPath 内容为安装路径@版本号。该信息对于 LLM 并不需要,只是为了保证 Identity 的全球唯一性而保存。它在各个语言中对应不同概念:

- <u>Golang</u>: 对应 module,如 github.com/cloudwego/hertz@v0.1.0
Expand Down
46 changes: 37 additions & 9 deletions lang/collect/collect.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/cloudwego/abcoder/lang/cxx"
"github.com/cloudwego/abcoder/lang/log"
. "github.com/cloudwego/abcoder/lang/lsp"
"github.com/cloudwego/abcoder/lang/python"
"github.com/cloudwego/abcoder/lang/rust"
"github.com/cloudwego/abcoder/lang/uniast"
)
Expand Down Expand Up @@ -85,9 +86,11 @@ type functionInfo struct {
func switchSpec(l uniast.Language) LanguageSpec {
switch l {
case uniast.Rust:
return &rust.RustSpec{}
return rust.NewRustSpec()
case uniast.Cxx:
return &cxx.CxxSpec{}
return cxx.NewCxxSpec()
case uniast.Python:
return python.NewPythonSpec()
default:
panic(fmt.Sprintf("unsupported language %s", l))
}
Expand All @@ -110,7 +113,30 @@ func NewCollector(repo string, cli *LSPClient) *Collector {
return ret
}

func (c *Collector) configureLSP(ctx context.Context) {
// XXX: should be put in language specification
if c.Language == uniast.Python {
if !c.NeedStdSymbol {
if c.Language == uniast.Python {
conf := map[string]interface{}{
"settings": map[string]interface{}{
"pylsp": map[string]interface{}{
"plugins": map[string]interface{}{
"jedi_definition": map[string]interface{}{
"follow_builtin_definitions": false,
},
},
},
},
}
c.cli.Notify(ctx, "workspace/didChangeConfiguration", conf)
}
}
}
}

func (c *Collector) Collect(ctx context.Context) error {
c.configureLSP(ctx)
excludes := make([]string, len(c.Excludes))
for i, e := range c.Excludes {
if !filepath.IsAbs(e) {
Expand All @@ -121,7 +147,7 @@ func (c *Collector) Collect(ctx context.Context) error {
}

// scan all files
roots := make([]*DocumentSymbol, 0, 1024)
root_syms := make([]*DocumentSymbol, 0, 1024)
scanner := func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
Expand Down Expand Up @@ -177,7 +203,7 @@ func (c *Collector) Collect(ctx context.Context) error {
sym.Text = content
sym.Tokens = tokens
c.syms[sym.Location] = sym
roots = append(roots, sym)
root_syms = append(root_syms, sym)
}

return nil
Expand All @@ -187,11 +213,11 @@ func (c *Collector) Collect(ctx context.Context) error {
}

// collect some extra metadata
syms := make([]*DocumentSymbol, 0, len(roots))
for _, sym := range roots {
entity_syms := make([]*DocumentSymbol, 0, len(root_syms))
for _, sym := range root_syms {
// only language entity symbols need to be collect on next
if c.spec.IsEntitySymbol(*sym) {
syms = append(syms, sym)
entity_syms = append(entity_syms, sym)
}
c.processSymbol(ctx, sym, 1)
}
Expand Down Expand Up @@ -229,7 +255,7 @@ func (c *Collector) Collect(ctx context.Context) error {
// }

// collect dependencies
for _, sym := range syms {
for _, sym := range entity_syms {
next_token:

for i, token := range sym.Tokens {
Expand Down Expand Up @@ -572,11 +598,13 @@ func (c *Collector) collectImpl(ctx context.Context, sym *DocumentSymbol, depth
}
}
var impl string
// HACK: impl head for Rust.
if fn > 0 && fn < len(sym.Tokens) {
impl = ChunkHead(sym.Text, sym.Location.Range.Start, sym.Tokens[fn].Location.Range.Start)
}
// HACK: implhead for Python. Should actually be provided by the language spec.
if impl == "" || len(impl) < len(sym.Name) {
impl = sym.Name
impl = fmt.Sprintf("class %s {\n", sym.Name)
}
// search all methods
for _, method := range c.syms {
Expand Down
13 changes: 6 additions & 7 deletions lang/lsp/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,10 @@ import (
type LSPClient struct {
*jsonrpc2.Conn
*lspHandler
tokenTypes []string
tokenModifiers []string
files map[DocumentURI]*TextDocumentItem
tokenTypes []string
tokenModifiers []string
hasSemanticTokensRange bool
files map[DocumentURI]*TextDocumentItem
ClientOptions
}

Expand Down Expand Up @@ -156,10 +157,6 @@ func initLSPClient(ctx context.Context, svr io.ReadWriteCloser, dir DocumentURI,
return nil, fmt.Errorf("server did not provide TypeDefinition")
}

implementationProvider, ok := vs["implementationProvider"].(bool)
if !ok || !implementationProvider {
return nil, fmt.Errorf("server did not provide Implementation")
}
documentSymbolProvider, ok := vs["documentSymbolProvider"].(bool)
if !ok || !documentSymbolProvider {
return nil, fmt.Errorf("server did not provide DocumentSymbol")
Expand All @@ -174,6 +171,8 @@ func initLSPClient(ctx context.Context, svr io.ReadWriteCloser, dir DocumentURI,
if !ok || semanticTokensProvider == nil {
return nil, fmt.Errorf("server did not provide SemanticTokensProvider")
}
semanticTokensRange, ok := semanticTokensProvider["range"].(bool)
cli.hasSemanticTokensRange = ok && semanticTokensRange
legend, ok := semanticTokensProvider["legend"].(map[string]interface{})
if !ok || legend == nil {
return nil, fmt.Errorf("server did not provide SemanticTokensProvider.legend")
Expand Down
28 changes: 14 additions & 14 deletions lang/lsp/lsp.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ import (
"sort"
"strings"

"github.com/cloudwego/abcoder/lang/uniast"
"github.com/cloudwego/abcoder/lang/utils"
"github.com/sourcegraph/go-lsp"
)
Expand Down Expand Up @@ -285,22 +284,23 @@ func (cli *LSPClient) References(ctx context.Context, id Location) ([]Location,
return resp, nil
}

// TODO(perf): cache results especially for whole file queries.
// TODO(refactor): infer use_full_method from capabilities
func (cli *LSPClient) getSemanticTokensRange(ctx context.Context, req DocumentRange, resp *SemanticTokens, use_full_method bool) error {
if use_full_method {
req1 := struct {
TextDocument lsp.TextDocumentIdentifier `json:"textDocument"`
}{TextDocument: req.TextDocument}
if err := cli.Call(ctx, "textDocument/semanticTokens/full", req1, resp); err != nil {
return err
}
filterSemanticTokensInRange(resp, req.Range)
} else {
// Some language servers do not provide semanticTokens/range.
// In that case, we fall back to semanticTokens/full and then filter the tokens manually.
func (cli *LSPClient) getSemanticTokensRange(ctx context.Context, req DocumentRange, resp *SemanticTokens) error {
if cli.hasSemanticTokensRange {
if err := cli.Call(ctx, "textDocument/semanticTokens/range", req, resp); err != nil {
return err
}
return nil
}
// fall back to semanticTokens/full
req1 := struct {
TextDocument lsp.TextDocumentIdentifier `json:"textDocument"`
}{TextDocument: req.TextDocument}
if err := cli.Call(ctx, "textDocument/semanticTokens/full", req1, resp); err != nil {
return err
}
filterSemanticTokensInRange(resp, req.Range)
return nil
}

Expand Down Expand Up @@ -355,7 +355,7 @@ func (cli *LSPClient) SemanticTokens(ctx context.Context, id Location) ([]Token,
}

var resp SemanticTokens
if err := cli.getSemanticTokensRange(ctx, req, &resp, cli.Language == uniast.Cxx); err != nil {
if err := cli.getSemanticTokensRange(ctx, req, &resp); err != nil {
return nil, err
}

Expand Down
Loading