如何读写LLVM bitcode

翻译转载

我已经阅读了社交媒体上的多篇帖子，现在抱怨LLVM有多可怕。
存储库太大以至于无法获得有用的信息，每天经常有数百次提交，邮件列表几乎不可能跟踪，现在可执行的可执行文件高达40Mb …

将那些花絮放在一边 - 一旦你掌握了这头巨兽，LLVM就非常容易使用了。
为了帮助人们使用LLVM，我想我会把你会用LLVM做的琐碎的操作示例放在一起 - 解析一个LLVM的中间表示文件（称为bitcode，文件扩展名.bc），然后再写回来。

首先，我们浏览一些高级LLVM术语：

对于用户代码来说LLVM的主要抽象对象是Module。它是一个包含所有函数，全局变量，以及你或用户编写的代码指令的类。
Bitcode文件实际上是LLVM Module的序列化，以便可以在一个不同的程序中重组。
LLVM实验MemoryBuffer 对象来处理来自文件、标准输入或数组的数据

以我的例子，我将使用LLVM C API - 一个比LLVM内核C++头文件更稳定的抽象。如果你希望使用多个版本的LLVM代码，那么C API非常有用，他比LLVM C++头文件更稳定。（另外，我将LLVM广泛用于我的工作，几乎每周都会有一些LLVM C++头文件被更改以破坏我们的代码。而C API从来不会。）

这里，我将假设你已经启动了LLVM，构建完并安装了它。像这样执行一下简单的步骤：

git clone https://git.llvm.org/git/llvm.git <llvm dir>
cd <llvm dir>
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=install ..
cmake --build . --target install

以上步骤完成后，你将会得到一个安装在/build/install下的LLVM

至此，对于一些小可执行文件我使用了CMake。CMake是目前为止结合LLVM的最简单的方法，LLVM也使用它作为自己的构建系统。

project(llvm_bc_parsing_example)
cmake_minimum_required(VERSION 3.4.3)

# option to allow a user to specify where an LLVM install is on the system
set(LLVM_INSTALL_DIR "" CACHE STRING "An LLVM install directory.")

if("${LLVM_INSTALL_DIR}" STREQUAL "")
  message(FATAL_ERROR "LLVM_INSTALL_DIR not set! Set it to the location of an LLVM install.")
endif()

# fixup paths to only use the Linux convention
string(REPLACE "\\" "/" LLVM_INSTALL_DIR ${LLVM_INSTALL_DIR})

# tell CMake where LLVM's module is
list(APPEND CMAKE_MODULE_PATH ${LLVM_INSTALL_DIR}/lib/cmake/llvm)

# include LLVM
include(LLVMConfig)

add_executable(llvm_bc_parsing_example main.c)

target_include_directories(llvm_bc_parsing_example PUBLIC ${LLVM_INCLUDE_DIRS})

target_link_libraries(llvm_bc_parsing_example PUBLIC LLVMBitReader LLVMBitWriter)

所以现在我们已经完成了CMake的设置，我们可以使用现有的LLVM安装，现在我们可以开始使用我们的实际C代码了！
因此，要使用LLVM C API，主要需要一个头文件：

1	#include <llvm-c/Core.h>

我们需要两个额外的头文件来执行bitcode的读写：

1 2	#include <llvm-c/BitReader.h> #include <llvm-c/BitWriter.h>

现在我们开始我们主要实现的功能，假设使用2个命令行参数，第一个是输入文件，第二个是输出文件。如果一个接收到以‘-’为文件名的参数，这意味着从标准输入读取或向标准输出写：

if(3 != argc){
    fprintf(stderr,"Invalid command line!\n");
    return 1;
}

const char *const inputFilename = argv[1];
const char *const outputFilename = argv[2];

所以首先我们扫描输入文件。我们将从其他输入或一个文件名得到一个LLVM内存缓冲对象：

LLVMMemoryBufferRef memoryBuffer;

// check if we are to read our input file from stdin
if (('-' == inputFilename[0]) && ('\0' == inputFilename[1])) {
  char *message;
  if (0 != LLVMCreateMemoryBufferWithSTDIN(&memoryBuffer, &message)) {
    fprintf(stderr, "%s\n", message);
    free(message);
    return 1;
  }
} else {
  char *message;
  if (0 != LLVMCreateMemoryBufferWithContentsOfFile(
               inputFilename, &memoryBuffer, &message)) {
    fprintf(stderr, "%s\n", message);
    free(message);
    return 1;
  }
}

执行完这些代码后，memoryBuffer就可以读取我们的bitcode文件转为LLVM module。现在我们创建module

// now create our module using the memory buffer
LLVMModuleRef module;
if (0 != LLVMParseBitcode2(memoryBuffer, &module)) {
  fprintf(stderr, "Invalid bitcode detected!\n");
  LLVMDisposeMemoryBuffer(memoryBuffer);
  return 1;
}

// done with the memory buffer now, so dispose of it
LLVMDisposeMemoryBuffer(memoryBuffer);

一旦我们得到module，我们不再需要memory buffer了，我们直接释放这部分内存。我们设法获取了一个LLVM bitcode文件，将其反序列化为一个LLVM module，之后就可以随你操作了。因此假设你已经使用LLVM module完成了所有操作了，并希望写回bitcode文件.

方法和读取方法正交，我们寻找特殊文件名‘-’做相应处理：

// check if we are to write our output file to stdout
if (('-' == outputFilename[0]) && ('\0' == outputFilename[1])) {
  if (0 != LLVMWriteBitcodeToFD(module, STDOUT_FILENO, 0, 0)) {
    fprintf(stderr, "Failed to write bitcode to stdout!\n");
    LLVMDisposeModule(module);
    return 1;
  }
} else {
  if (0 != LLVMWriteBitcodeToFile(module, outputFilename)) {
    fprintf(stderr, "Failed to write bitcode to file!\n");
    LLVMDisposeModule(module);
    return 1;
  }
}

最后我们需要对内存清理，删除module：
LLVMDisposeModule(module);

我们现在就可以扫描并输出一个LLVM bitcode文件。
GitHub Example