与系统交互

Python 运维脚本常见工作是读取环境变量、检查文件、执行系统命令、解析配置、输出日志和生成报告。这里主要依赖标准库，不需要额外安装第三方包。

Shell 能直接调用命令，Python 也能调用命令；区别在于 Python 更适合把命令结果、错误分支和后续处理组织成结构化逻辑。简单管道仍然用 Shell 更快，涉及多个步骤、异常处理和结构化输出时，Python 更合适。

一、路径处理

pathlib.Path 用对象表示路径，比手工拼字符串稳一些。

python

from pathlib import Path

base_dir = Path("/data/apps")
config_path = base_dir / "nginx" / "conf" / "nginx.conf"

print(config_path)
print(config_path.exists())

常用方法：

方法	用途
`exists()`	路径是否存在
`is_file()`	是否是文件
`is_dir()`	是否是目录
`mkdir()`	创建目录
`glob()`	按通配符查找
`read_text()`	读取文本
`write_text()`	写入文本

查找日志文件：

python

from pathlib import Path

log_dir = Path("/var/log")

for path in log_dir.glob("*.log"):
    print(path)

递归查找：

python

from pathlib import Path

for path in Path("/etc").rglob("*.conf"):
    print(path)

rglob() 在目录很大时会扫很多文件，线上脚本里要限定范围，避免误扫整个根目录。

二、环境变量

环境变量常用于传入账号、URL、环境名和开关。

python

import os

env = os.environ.get("APP_ENV", "dev")
api_url = os.environ.get("API_URL")

print(f"env={env} api_url={api_url}")

必填环境变量可以集中校验：

python

import os


def require_env(name):
    value = os.environ.get(name)
    if not value:
        raise RuntimeError(f"missing environment variable: {name}")
    return value


token = require_env("API_TOKEN")

把密码、Token 这类敏感信息写死在脚本里不合适。环境变量、配置文件权限、密钥管理系统都比硬编码更容易控制泄露范围。

三、执行系统命令

subprocess.run() 是最常用的命令执行入口。

python

import subprocess

result = subprocess.run(
    ["systemctl", "is-active", "nginx"],
    text=True,
    capture_output=True,
    check=False,
)

print(result.returncode)
print(result.stdout.strip())
print(result.stderr.strip())

参数说明：

参数	含义
`["cmd", "arg"]`	用列表传命令和参数，避免 Shell 字符串解析问题
`text=True`	输出按字符串处理
`capture_output=True`	捕获 stdout 和 stderr
`check=False`	非零退出码不自动抛异常，脚本自己判断

封装一个通用函数：

python

import subprocess


def run_command(args, timeout=30):
    """执行命令，返回 CompletedProcess。"""
    return subprocess.run(
        args,
        text=True,
        capture_output=True,
        timeout=timeout,
        check=False,
    )


result = run_command(["df", "-h"])
if result.returncode != 0:
    print(f"command failed: {result.stderr.strip()}")
else:
    print(result.stdout)

需要管道、重定向、通配符时，很多场景可以拆开用 Python 处理输出，而不是直接 shell=True。shell=True 会把字符串交给 Shell 解释，参数里如果混入外部输入，命令注入风险很高。

四、解析命令输出

命令输出是纯文本，解析时要尽量找稳定字段。

python

import subprocess


def get_filesystem_usage():
    result = subprocess.run(
        ["df", "-P"],
        text=True,
        capture_output=True,
        check=True,
    )

    rows = []
    for line in result.stdout.splitlines()[1:]:
        filesystem, blocks, used, available, use_percent, mountpoint = line.split(maxsplit=5)
        rows.append(
            {
                "filesystem": filesystem,
                "used_percent": int(use_percent.rstrip("%")),
                "mountpoint": mountpoint,
            }
        )

    return rows


for item in get_filesystem_usage():
    if item["used_percent"] >= 80:
        print(f"{item['mountpoint']} used={item['used_percent']}%")

df -P 使用 POSIX 输出格式，比默认 df -h 更适合脚本解析。能用稳定参数输出机器可读格式时，优先选机器可读格式。

五、日志输出

print() 适合很小的脚本。稍微正式一点的脚本用 logging，能区分级别、时间和模块。

python

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)

logging.info("script started")
logging.warning("disk usage high")
logging.error("service check failed")

写入文件：

python

import logging
from pathlib import Path

log_path = Path("/tmp/ops-script.log")

logging.basicConfig(
    filename=log_path,
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)

logging.info("write log to file")

日志里要写清楚对象和结果：

python

logging.info("check_service host=%s service=%s status=%s", "web01", "nginx", "active")

这种写法比拼字符串更适合 logging；日志级别被过滤时，字符串格式化开销也更少。

六、读取 ini 配置

configparser 适合简单 ini 配置：

ini

[api]
url = https://example.com/health
timeout = 5

[check]
threshold = 80

读取：

python

import configparser
from pathlib import Path

config = configparser.ConfigParser()
config.read(Path("/etc/ops/check.ini"), encoding="utf-8")

api_url = config["api"]["url"]
timeout = config.getint("api", "timeout")
threshold = config.getint("check", "threshold")

print(api_url, timeout, threshold)

配置文件缺字段时，configparser 会抛异常。脚本启动阶段集中读取和校验配置，比执行到一半才发现字段不存在更好排查。

七、读取 YAML 配置

YAML 不是标准库，需要安装 pyyaml：

bash

uv add pyyaml

配置示例：

yaml

targets:
  - name: nginx
    host: 127.0.0.1
    port: 80
  - name: mysql
    host: 127.0.0.1
    port: 3306

读取：

python

from pathlib import Path

import yaml

config_path = Path("targets.yaml")
config = yaml.safe_load(config_path.read_text(encoding="utf-8"))

for target in config["targets"]:
    print(target["name"], target["host"], target["port"])

safe_load() 比 load() 更适合读取普通配置，避免 YAML 里执行任意 Python 对象反序列化。

八、临时文件和锁文件

脚本需要生成临时文件时用 tempfile：

python

import tempfile

with tempfile.NamedTemporaryFile("w", encoding="utf-8", delete=False) as file:
    file.write("temporary data\n")
    print(file.name)

简单锁文件可以防止同一个定时脚本重复运行：

python

from pathlib import Path
import os

lock_path = Path("/tmp/check-service.lock")

if lock_path.exists():
    raise SystemExit(f"lock exists: {lock_path}")

try:
    # 写入 PID，排查时能知道哪个进程创建了锁
    lock_path.write_text(str(os.getpid()), encoding="utf-8")
    print("running job")
finally:
    lock_path.unlink(missing_ok=True)

这种锁文件适合简单单机场景。跨机器、长任务、异常退出后自动清理这些需求更复杂，要换数据库锁、Redis 锁或调度系统。

九、脚本参数

argparse 用来解析命令行参数：

python

import argparse


def parse_args():
    parser = argparse.ArgumentParser(description="check service port")
    parser.add_argument("--host", required=True, help="target host")
    parser.add_argument("--port", required=True, type=int, help="target port")
    parser.add_argument("--timeout", type=int, default=3, help="connect timeout seconds")
    return parser.parse_args()


args = parse_args()
print(args.host, args.port, args.timeout)

运行：

bash

uv run python check_port.py --host 127.0.0.1 --port 22 --timeout 2

参数比硬编码更适合复用。主机、端口、阈值、配置文件路径这些经常变化的值，都适合做成参数或配置。

十、一个 systemd 状态检查脚本

python

#!/usr/bin/env python3
"""检查 systemd 服务状态。"""

import argparse
import logging
import subprocess
import sys


def parse_args():
    parser = argparse.ArgumentParser(description="check systemd service state")
    parser.add_argument("service", help="systemd service name, for example nginx")
    return parser.parse_args()


def get_service_state(service):
    result = subprocess.run(
        ["systemctl", "is-active", service],
        text=True,
        capture_output=True,
        check=False,
    )
    return result.returncode, result.stdout.strip(), result.stderr.strip()


def main():
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
    args = parse_args()

    returncode, stdout, stderr = get_service_state(args.service)
    if returncode == 0 and stdout == "active":
        logging.info("service=%s state=active", args.service)
        return 0

    logging.error("service=%s state=%s error=%s", args.service, stdout, stderr)
    return 1


if __name__ == "__main__":
    sys.exit(main())

运行：

bash

uv run python check_systemd.py nginx
echo $?

这个脚本把参数、命令执行、日志和退出码放在一起，已经接近日常巡检脚本的基本形态。

与系统交互 ​

一、路径处理 ​

二、环境变量 ​

三、执行系统命令 ​

四、解析命令输出 ​

五、日志输出 ​

六、读取 ini 配置 ​

七、读取 YAML 配置 ​

八、临时文件和锁文件 ​

九、脚本参数 ​

十、一个 systemd 状态检查脚本 ​

与系统交互

一、路径处理

二、环境变量

三、执行系统命令

四、解析命令输出

五、日志输出

六、读取 ini 配置

七、读取 YAML 配置

八、临时文件和锁文件

九、脚本参数

十、一个 systemd 状态检查脚本