瞧瞧我对服务器干了些什么!

Deep Learning Env by CUDA and PyTorch

  • 以3090Server为例
  • 2024.06.26
  • 3090Server目前系统已更新至Ubuntu 20.04.6 LTS,但对下面的记录影响不大
  • Ubuntu 18.04.6 LTS
  • gcc version 7.5.0
  • CUDA 11.3
  • cuDNN 8.9.5

Server Basic Info

1
2
3
Welcome to Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-150-generic x86_64)
Model name: Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

Personal Dir

1
2
3
4
5
6
7
8
9
10
houjinliang@3090server:~$ ll
total 40
drwxr-xr-x 4 houjinliang houjinliang 4096 6月 26 10:18 ./
drwxrwxrwx 21 super super 4096 6月 26 10:17 ../
-rw-r--r-- 1 houjinliang houjinliang 220 4月 5 2018 .bash_logout
-rw-r--r-- 1 houjinliang houjinliang 3771 4月 5 2018 .bashrc
drwx------ 2 houjinliang houjinliang 4096 6月 26 10:18 .cache/
-rw-r--r-- 1 houjinliang houjinliang 8980 4月 16 2018 examples.desktop
drwx------ 3 houjinliang houjinliang 4096 6月 26 10:18 .gnupg/
-rw-r--r-- 1 houjinliang houjinliang 807 4月 5 2018 .profile

Miniconda

下载Miniconda的sh脚本文件,增加文件可执行的权限,然后执行下载脚本。

1
2
3
houjinliang@3090server:~/MyDownloadFiles$ wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
houjinliang@3090server:~/MyDownloadFiles$ chmod +x Miniconda3-latest-Linux-x86_64.sh
houjinliang@3090server:~/MyDownloadFiles$ ./Miniconda3-latest-Linux-x86_64.sh

安装过程中会有选择安装路径的选择,直接选择默认路径.

1
2
3
4
5
6
7
8
9
# 默认安装路径
Miniconda3 will now be installed into this location:
/mnt/houjinliang/miniconda3

- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below

[/mnt/houjinliang/miniconda3] >>>

conda init

  • 这里选择输入yes,然后会自动配置 ~/.bashrc,关闭Terminal然后再重启一个,就能看到命令行前面的base了;

  • 如果是输入no的话,安装完成之后conda命令在终端是识别不到的,需要配置环境变量,手动输入下面的内容到 ~/.bashrc中。

1
2
3
4
# 配置miniconda的环境变量(根据实际情况更改)
(base) houjinliang@3090server:~$ vim ~/.bashrc

# 这里vim用的不熟练就使用vscode打开这个‘~/.bashrc’文件,然后再末尾增加moniconda的配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/mnt/houjinliang/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/mnt/houjinliang/miniconda3/etc/profile.d/conda.sh" ]; then
. "/mnt/houjinliang/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/mnt/houjinliang/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
1
2
3
# 更新环境变量,能看到base就出现了
houjinliang@3090server:~$ source ~/.bashrc
(base) houjinliang@3090server:~$

检查一下Minconda的基本信息.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 查看一下miniconda的基本信息
(base) houjinliang@3090server:~$ conda info

active environment : base
active env location : /mnt/houjinliang/miniconda3
shell level : 1
user config file : /mnt/houjinliang/.condarc
populated config files :
conda version : 24.4.0
conda-build version : not installed
python version : 3.12.3.final.0
solver : libmamba (default)
virtual packages : __archspec=1=broadwell
__conda=24.4.0=0
__cuda=11.7=0
__glibc=2.27=0
__linux=5.4.0=0
__unix=0=0
base environment : /mnt/houjinliang/miniconda3 (writable)
conda av data dir : /mnt/houjinliang/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /mnt/houjinliang/miniconda3/pkgs
/mnt/houjinliang/.conda/pkgs
envs directories : /mnt/houjinliang/miniconda3/envs
/mnt/houjinliang/.conda/envs
platform : linux-64
user-agent : conda/24.4.0 requests/2.31.0 CPython/3.12.3 Linux/5.4.0-150-generic ubuntu/18.04.6 glibc/2.27 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8 aau/0.4.4 c/. s/. e/.
UID:GID : 1035:1035
netrc file : None
offline mode : False

可以选择将conda或者pip换源

1
2
3
(base) houjinliang@3090server:~$ conda config --show-sources
==> /mnt/houjinliang/.condarc <==
auto_activate_base: True
  • pip换源,我选择换成阿里云源。可以直接用命令的方式,如下.
1
2
(base) houjinliang@3090server:~$ pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
Writing to /mnt/houjinliang/.config/pip/pip.conf
  • 或者是修改 ~/.config/pip/pip.conf (没有就创建一个), 内容如下:
1
2
3
(base) houjinliang@3090server:~$ cat ~/.config/pip/pip.conf
[global]
index-url = https://mirrors.aliyun.com/pypi/simple/

NV Driver

  • 这里的驱动是服务器自带的,所以在给自己目录下安装CUDA的时候,可以不选择安装driver,影响不大。
1
2
3
(base) houjinliang@3090server:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

CUDA 11.3.1 & CUDNN 8.9.5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# CUDA run下载
(base) houjinliang@3090server:~/MyDownloadFiles$ wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run

# "cuda_11.3.1_465.19.01_linux.run"文件没有"x"的权限,加一个权限
(base) houjinliang@3090server:~/MyDownloadFiles$ ll
total 3224920
drwxrwxr-x 2 houjinliang houjinliang 4096 6月 26 11:10 ./
drwxr-xr-x 10 houjinliang houjinliang 4096 6月 26 11:04 ../
-rw-rw-r-- 1 houjinliang houjinliang 3158494112 5月 14 2021 cuda_11.3.1_465.19.01_linux.run
-rwxrwxr-x 1 houjinliang houjinliang 143808873 5月 21 02:15 Miniconda3-latest-Linux-x86_64.sh*

(base) houjinliang@3090server:~/MyDownloadFiles$ chmod +x cuda_11.3.1_465.19.01_linux.run
(base) houjinliang@3090server:~/MyDownloadFiles$ ll
total 3224920
drwxrwxr-x 2 houjinliang houjinliang 4096 6月 26 11:10 ./
drwxr-xr-x 10 houjinliang houjinliang 4096 6月 26 11:04 ../
-rwxrwxr-x 1 houjinliang houjinliang 3158494112 5月 14 2021 cuda_11.3.1_465.19.01_linux.run*
-rwxrwxr-x 1 houjinliang houjinliang 143808873 5月 21 02:15 Miniconda3-latest-Linux-x86_64.sh*

# 执行安装
(base) houjinliang@3090server:~/MyDownloadFiles$ ./cuda_11.3.1_465.19.01_linux.run

cuda run

出现这样的不要害怕,直接Continue就好了,然后按照下面的步骤。

NPU_2024-06-26_11-29-59
NPU_2024-06-26_11-31-04
NPU_2024-06-26_11-31-37
NPU_2024-06-26_11-32-40
NPU_2024-06-26_11-34-55
NPU_2024-06-26_11-35-29
image-20240626114348562
NPU_2024-06-26_11-36-37
NPU_2024-06-26_11-37-23

安装完成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(base) houjinliang@3090server:~/MyDownloadFiles$ ./cuda_11.3.1_465.19.01_linux.run
===========
= Summary =
===========

Driver: Not Selected
Toolkit: Installed in /mnt/houjinliang/cuda-11.3/
Samples: Not Selected

Please make sure that
- PATH includes /mnt/houjinliang/cuda-11.3/bin
- LD_LIBRARY_PATH includes /mnt/houjinliang/cuda-11.3/lib64, or, add /mnt/houjinliang/cuda-11.3/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /mnt/houjinliang/cuda-11.3/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 465.00 is required for CUDA 11.3 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver

Logfile is /tmp/cuda-installer.log

image-20240626114642588

安装完成之后,最好把这个/tmp/cuda-installer.log文件删除了,如果不删的话,后面的用户再安装就会有影响。为了不妨碍他人,最好把这个删掉。

1
rm /tmp/cuda-installer.log

配置CUDA Toolkit 的环境变量,使用vim或vscode

1
2
# 配置环境变量
(base) houjinliang@3090server:~$ vim ~/.bashrc
1
2
3
4
5
6
7
# >>> cuda environment variables >>>
# murphy insert
export CUDA_HOME=/mnt/houjinliang/cuda-11.3
export CUDA_PATH=$CUDA_HOME
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# <<< cuda environment variables <<<
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 更新环境变量
(base) houjinliang@3090server:~$ source ~/.bashrc

# CUDA Check
(base) houjinliang@3090server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

# CUDA Check
(base) houjinliang@3090server:~$ which nvcc
/mnt/houjinliang/cuda-11.3/bin/nvcc

cudann安装。cudnn的下载需要到nVidia的网站,登录账号才行,这里我就直接用之前安装的时候已经下载好的了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
(base) houjinliang@3090server:~/MyDownloadFiles$ ll
total 4062292
drwxrwxr-x 2 houjinliang houjinliang 4096 6月 26 11:56 ./
drwxr-xr-x 11 houjinliang houjinliang 4096 6月 26 11:48 ../
-rwxrwxr-x 1 houjinliang houjinliang 3158494112 5月 14 2021 cuda_11.3.1_465.19.01_linux.run*
-rw-rw-r-- 1 houjinliang houjinliang 857460936 6月 26 11:57 cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz
-rwxrwxr-x 1 houjinliang houjinliang 143808873 5月 21 02:15 Miniconda3-latest-Linux-x86_64.sh*

# cudnn 压缩包解压缩
(base) houjinliang@3090server:~/MyDownloadFiles$ tar xvJf cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz

# 查看cudnn的解压缩文件
(base) houjinliang@3090server:~/MyDownloadFiles$ cd cudnn-linux-x86_64-8.9.5.29_cuda11-archive/
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ ll
total 48
drwxr-xr-x 4 houjinliang houjinliang 4096 9月 7 2023 ./
drwxrwxr-x 3 houjinliang houjinliang 4096 6月 26 11:58 ../
drwxr-xr-x 2 houjinliang houjinliang 4096 9月 7 2023 include/
drwxr-xr-x 2 houjinliang houjinliang 4096 9月 7 2023 lib/
-rw-r--r-- 1 houjinliang houjinliang 29662 9月 7 2023 LICENSE

# 把cudnn的文件copy到cuda目录下
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cp lib/* ~/cuda-11.3/lib64/
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cp include/* ~/cuda-11.3/include
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ chmod +x ~/cuda-11.3/include/cudnn.h
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ chmod +x ~/cuda-11.3/lib64/libcudnn*

# 检查cudnn版本和验证cudnncopy是否成功
(base) houjinliang@3090server:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.5.29_cuda11-archive$ cat ~/cuda-11.3/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

Git & Github

  • 在Linux的Ubuntu发行版上一般都会默认安装了Git,所以不需要自己手动安装,拿来即用即可。
1
2
3
git config --global user.name "SSH keys Name"
git config --global user.email "SSH keys Email"
ssh-keygen -t rsa -C "Email of Github Account"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 设置用户配置
(base) houjinliang@3080server:~$ git config --global user.name 'hjl_3080server'
(base) houjinliang@3080server:~$ git config --global user.email 'cosmicdustycn@outlook.com'

# 查看用户配置
(base) houjinliang@3090server:~$ git config user.name
hjl_3090server
(base) houjinliang@3090server:~$ git config user.email
cosmicdustycn@outlook.com

# 检查本地Git与远程Github的连接
(base) houjinliang@3080server:~$ ssh-keygen -t rsa -C "cosmicdustycn@outlook.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/mnt/houjinliang/.ssh/id_rsa):
Created directory '/mnt/houjinliang/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /mnt/houjinliang/.ssh/id_rsa.
Your public key has been saved in /mnt/houjinliang/.ssh/id_rsa.pub.
  • 不需要担心Git的用户配置会对本服务器上的其他用户会产生影响。ssh-keygen生产的的用户密钥会保存在个人账号的目录下。
1
2
3
4
5
6
7
8
9
(dlpy310pth113) houjinliang@3080server:~/.ssh$ pwd
/mnt/houjinliang/.ssh
(dlpy310pth113) houjinliang@3080server:~/.ssh$ ll
总用量 20
drwx------ 2 houjinliang houjinliang 4096 11月 1 10:19 ./
drwxr-xr-x 12 houjinliang houjinliang 4096 11月 1 10:17 ../
-rw------- 1 houjinliang houjinliang 1675 11月 1 10:17 id_rsa
-rw-r--r-- 1 houjinliang houjinliang 407 11月 1 10:17 id_rsa.pub
-rw-r--r-- 1 houjinliang houjinliang 444 11月 1 10:19 known_hosts
  • 复制id_rsa.pub文件下的内容,到Github的Setting中设置SSH Keys。如下。

image-20231101105539214

image-20231102171332923

1
2
ssh -T git@github.com
Hi murphyhoucn! You've successfully authenticated, but GitHub does not provide shell access.
1
2
3
4
5
(base) houjinliang@3080server:~/userdoc$ git clone git@github.com:murphyhoucn/XXX.git
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git status
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git add .
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git commit -m "add new file"
(base) houjinliang@3080server:~/userdoc/DeepLearningforCV$ git push

GPU Usage Tools

nvidia-smi

nidia-smi

gpustat

1
2
(dlpy310pth113) houjinliang@3080server:~/userdoc$ pip install gpustat
(dlpy310pth113) houjinliang@3080server:~/userdoc$ gpustat

gpustat

nvitop

1
2
3
4
5
6
(dlpy310pth113) houjinliang@3080server:~$ pip install nvitop
Requirement already satisfied: nvitop in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (1.3.0)
Requirement already satisfied: nvidia-ml-py<12.536.0a0,>=11.450.51 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (12.535.108)
Requirement already satisfied: psutil>=5.6.6 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (5.9.5)
Requirement already satisfied: cachetools>=1.0.1 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (5.3.1)
Requirement already satisfied: termcolor>=1.0.0 in ./miniconda3/envs/dlpy310pth113/lib/python3.10/site-packages (from nvitop) (2.3.0)

nvitop

Server Comp

  • 2025.09.16

2080ti

OS Basic Info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
(base) houjinliang@ubuntu-server:~$ uname -a
Linux ubuntu-server 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
(base) houjinliang@ubuntu-server:~$ uname -s
Linux
(base) houjinliang@ubuntu-server:~$ uname -n
ubuntu-server
(base) houjinliang@ubuntu-server:~$ uname -r
5.8.0-59-generic
(base) houjinliang@ubuntu-server:~$ uname -v
#66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021
(base) houjinliang@ubuntu-server:~$ uname -p
x86_64

(base) houjinliang@ubuntu-server:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

Dev Info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(base) houjinliang@ubuntu-server:~$ pwd
/home/data/houjinliang

(base) houjinliang@ubuntu-server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

(base) houjinliang@ubuntu-server:~$ which nvcc
/home/data/houjinliang/cuda-11.3/bin/nvcc

(base) houjinliang@ubuntu-server:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 460.80 Fri May 7 06:55:54 UTC 2021
GCC version:

(base) houjinliang@ubuntu-server:~$ gcc --version
gcc (Ubuntu 8.4.0-3ubuntu2) 8.4.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

3080Server

OS Info

1
2
3
4
5
6
(base) houjinliang@3080server:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Dev Info

1
2
3
4
5
6
7
8
9
10
11
12
(base) houjinliang@3080server:~$ which nvcc
/mnt/houjinliang/cuda-11.3/bin/nvcc
(base) houjinliang@3080server:~$ pwd
/mnt/houjinliang
(base) houjinliang@3080server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
(base) houjinliang@3080server:~$ which nvcc
/mnt/houjinliang/cuda-11.3/bin/nvcc

3090Server

  • Ubuntu 20.04.06 LTS
  • cuda_11.3
  • cudnn_8.9.5

OS Info

1
2
3
4
5
6
(base) houjinliang@3090server:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal

Dev Info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(base) houjinliang@3090server:~$ which nvcc
/mnt/houjinliang/cuda-11.3/bin/nvcc

(base) houjinliang@3090server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

(base) houjinliang@3090server:~$ cat ~/cuda-11.3/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

3090Server2

  • Ubuntu 20.04.5 LTS
  • CUDA 11.3
  • cuDNN 8.9.5

OS Info

1
2
3
4
5
6
(base) houjinliang@3090server2:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal

Dev Info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(base) houjinliang@3090server2:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)

(base) houjinliang@3090server2:~$ which nvcc
/data/houjinliang/cuda-11.3/bin/nvcc

(base) houjinliang@3090server2:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

(base) houjinliang@3090server2:~$ cat ~/cuda-11.3/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

4090Server

  • Ubuntu 22.04.2 LTS
  • CUDA11.6 : cuda_11.6.2_510.47.03_linux.run
  • cuDNN 8.9.5: cudnn-linux-x86_64-8.9.5.29_cuda11-archive.tar.xz

OS Info

1
2
3
4
5
6
(base) houjinliang@4090server:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy

Dev Info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
(sr_benchmark) houjinliang@4090server:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.183.06 Wed Jun 26 06:46:07 UTC 2024
GCC version: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)


(base) houjinliang@4090server:~$ which nvcc
/mnt/houjinliang/cuda-11.6/bin/nvcc

(base) houjinliang@4090server:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

(base) houjinliang@4090server:~$ cat ~/cuda-11.6/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 5
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

之前在CUDA 11.3环境下安装的PyTorch,直接conda-pack打包过来了

至于conda env,我把之前服务器上的环境使用conda-pack打包,然后使用scp传过来,然后解压到对应文件夹下。虽然之前cuda113,torch也是113版本的,但是在cuda116的服务器上也能用(那就先用着?!

问题:Failed to initialize NVML: Driver/library version mismatch

环境正常运行了很长一段时间,但是突然有一天,在运行程序的时候出现了这样一个报错!

1
ERROR: cuda is not available, try running on CPU

这个error是我自己的程序里写得报错提示,系统的cuda不可用了?!这是咋回事?!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
(base) houjinliang@4090server:~$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.216

(base) houjinliang@4090server:~$ nvitop
NVML ERROR: RM has detected an NVML/RM version mismatch.

(base) houjinliang@4090server:~$ gpustat
Error on querying NVIDIA devices. Use --debug flag to see more details.
RM has detected an NVML/RM version mismatch.

(base) houjinliang@4090server:~$ gpustat --debug
Error on querying NVIDIA devices. Use --debug flag to see more details.
RM has detected an NVML/RM version mismatch.

Traceback (most recent call last):
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/gpustat/cli.py", line 58, in print_gpustat
gpu_stats = GPUStatCollection.new_query(debug=debug, id=id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/gpustat/core.py", line 402, in new_query
N.nvmlInit()
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/pynvml.py", line 1947, in nvmlInit
nvmlInitWithFlags(0)
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/pynvml.py", line 1937, in nvmlInitWithFlags
_nvmlCheckReturn(ret)
File "/mnt/houjinliang/miniconda3/lib/python3.12/site-packages/pynvml.py", line 899, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_LibRmVersionMismatch: RM has detected an NVML/RM version mismatch.

(sr_benchmark) houjinliang@4090server:~$ python
Python 3.8.19 (default, Mar 20 2024, 19:58:24)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/mnt/houjinliang/miniconda3/envs/sr_benchmark/lib/python3.8/site-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False

4090Server2

  • Ubuntu 22.04.3 LTS
  • gcc version 12.3.0

OS Info

1
2
3
4
5
6
(base) houjinliang@4090server2:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy

Dev Info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(base) houjinliang@4090server2:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.107.02 Wed Jul 24 23:53:00 UTC 2024
GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

(base) houjinliang@4090server2:~$ which nvcc
/data/houjinliang/cuda-12.4/bin/nvcc

(base) houjinliang@4090server2:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

(base) houjinliang@4090server2:~$ cat ~/cuda-12.4/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 7
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */
  • CUDA 12.4.1

CUDA Toolkit 12.4 Update 1 Downloads | NVIDIA Developer

1
(base) houjinliang@4090server2:~/MyDownloadFiles$ wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
  • CUDNN : cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar

https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(base) houjinliang@4090server2:~/MyDownloadFiles$ tar xvJf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz

(base) houjinliang@4090server2:~/MyDownloadFiles$ cd cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ ll
total 48
drwxr-xr-x 4 houjinliang houjinliang 4096 11月 30 2023 ./
drwxrwxr-x 3 houjinliang houjinliang 4096 10月 24 22:53 ../
drwxr-xr-x 2 houjinliang houjinliang 4096 11月 30 2023 include/
drwxr-xr-x 2 houjinliang houjinliang 4096 11月 30 2023 lib/
-rw-r--r-- 1 houjinliang houjinliang 29662 11月 30 2023 LICENSE

(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ cp lib/* ~/cuda-12.4/lib64/
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ cp include/* ~/cuda-12.4/include
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ chmod +x ~/cuda-12.4/include/cudnn.h
(base) houjinliang@4090server2:~/MyDownloadFiles/cudnn-linux-x86_64-8.9.7.29_cuda12-archive$ chmod +x ~/cuda-12.4/lib64/libcudnn*

git install

这台服务器上没有git,使用deb包安装一个

1
(base) houjinliang@4090server2:~/MyDownloadFiles$ wget http://archive.ubuntu.com/ubuntu/pool/main/g/git/git_2.34.1-1ubuntu1.11_amd64.deb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(base) houjinliang@4090server2:~/MyDownloadFiles$ cd ~

# 先创建文件夹,把git安装到这里
(base) houjinliang@4090server2:~$ mkdir git
(base) houjinliang@4090server2:~$ dpkg -x ./MyDownloadFiles/git_2.34.1-1ubuntu1.11_amd64.deb ./git

(base) houjinliang@4090server2:~$ cd git/
(base) houjinliang@4090server2:~/git$ ll
total 20
drwxr-xr-x 5 houjinliang houjinliang 4096 5月 20 20:14 ./
drwxr-x--- 14 houjinliang houjinliang 4096 10月 24 23:22 ../
drwxr-xr-x 3 houjinliang houjinliang 4096 5月 20 20:14 etc/
drwxr-xr-x 5 houjinliang houjinliang 4096 5月 20 20:14 usr/
drwxr-xr-x 3 houjinliang houjinliang 4096 5月 20 20:14 var/
1
(base) houjinliang@4090server2:~$ vim ~/.bashrc
1
2
3
4
# >>> git environment variables >>>
# murphy insert
export PATH=/data/houjinliang/git/usr/bin:$PATH
# <<< git environment variables <<<
1
(base) houjinliang@4090server2:~$ source ~/.bashrc
1
2
3
4
5
6
7
8
(base) houjinliang@4090server2:~$ which git
/data/houjinliang/git/usr/bin/git

(base) houjinliang@4090server2:~$ git --version
git version 2.34.1

(base) houjinliang@4090server2:~$ git config user.name
hjl_4090server2

conda env

虽然4090server2上面的CUDA环境是12.4,但这里还是用了在3080上配置的sr_benchmark的环境。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
(base) houjinliang@4090server2:~$ mkdir ~/miniconda3/envs/sr_benchmark
(base) houjinliang@4090server2:~$ tar -xzvf ./MyDownloadFiles/sr_benchmark.tar.gz -C ~/miniconda3/envs/sr_benchmark
(base) houjinliang@4090server2:~$ conda env list
# conda environments:
#
base * /data/houjinliang/miniconda3
sr_benchmark /data/houjinliang/miniconda3/envs/sr_benchmark

(base) houjinliang@4090server2:~$
(base) houjinliang@4090server2:~$ conda activate sr_benchmark
(sr_benchmark) houjinliang@4090server2:~$ python
Python 3.8.19 (default, Mar 20 2024, 19:58:24)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>>

# torch版本还是之前的
torch 1.10.1+cu113
torchvision 0.11.2+cu113

3080Server - MMYOLO

Overview — MMYOLO 0.6.0 documentation

1
2
3
4
5
6
7
8
9
10
11
(base) houjinliang@3080server:~$ conda create -n py38mmyolo python=3.8

(base) houjinliang@3080server:~$ conda activate py38mmyolo
(py38mmyolo) houjinliang@3080server:~$ pip config list
global.index-url='https://mirrors.aliyun.com/pypi/simple'

(py38mmyolo) houjinliang@3080server:~$ conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch

(py38mmyolo) houjinliang@3080server:~$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.11.0
True
1
2
3
4
pip install -U openmim
mim install "mmengine>=0.6.0"
mim install "mmcv>=2.0.0rc4,<2.1.0"
mim install "mmdet>=3.0.0,<4.0.0"
1
mim install "mmyolo"
1
2
3
4
5
6
7
8
9
git clone https://github.com/open-mmlab/mmyolo.git
cd mmyolo
# Install albumentations
pip install -r requirements/albu.txt
# Install MMYOLO
mim install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.
1
2
3
4
5
6
7
8
9
10
(base) houjinliang@3080server:~/userdoc/offlinefile$ wget  http://images.cocodataset.org/zips/val2017.zip
--2024-01-10 16:17:46-- http://images.cocodataset.org/zips/val2017.zip
正在解析主机 images.cocodataset.org (images.cocodataset.org)... 3.5.7.141, 52.216.215.25, 52.216.185.83, ...
正在连接 images.cocodataset.org (images.cocodataset.org)|3.5.7.141|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 815585330 (778M) [application/zip]
正在保存至: “val2017.zip”

val2017.zip 100%[===================================================================================================================>] 777.80M 3.89MB/s 用时 2m 22ss
2024-01-10 16:20:08 (5.48 MB/s) - 已保存 “val2017.zip” [815585330/815585330])

Ref Links

CUDA Toolkit and Corresponding Driver Versions

CUDA 12.6 Update 2 Release Notes

image-20241030165027775
image-20241030165047346

GCC与CUDA版本对应

image-20241023135851280

  • 3080Server - gcc 7.5.0 (Ubuntu 18.04.6 LTS)-> CUDA 11.3
  • 3090Server - gcc 7.5.0 (Ubuntu 18.04.6 LTS)-> CUDA 11.3
  • 3090Server2 - gcc 9.4.0 (Ubuntu 20.04.5 LTS)-> CUDA 11.3
  • 4090Server - gcc 11.4.0 (Ubuntu 22.04.2 LTS)-> CUDA 11.6
  • 4090Server - gcc 12.3.0 (Ubuntu 22.04.3 LTS)-> CUDA 12.4

image-20241023142730502

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

cuDNN docs

image-20241023145233186

CUDA Toolkit Archive

image-20241023145423648

CUDA Toolkit Archive | NVIDIA Developer

cuDNN Archive

image-20241024223544376

image-20241024223923702

Docker

Docker Install

需要管理员用户!

  • 使用APT安装(具体步骤参考网上教程)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Step 1
sudo apt update
sudo apt install \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release

# Step 2
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# Step 3
echo \
"deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# Step 4
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io

# Step 5
sudo systemctl enable docker
sudo systemctl start docker
  • 为了让非管理员用户也能使用docker,需要建立用户组,赋予用户组内的用户权限
1
2
3
4
5
6
7
8
9
10
11
12
# 建立 docker 组
sudo groupadd docker

# 将当前用户加入 docker 组
sudo usermod -aG docker $USER
# 将xxx用户加入 docker 组
sudo usermod -aG docker xxxxxxxx

# 查看docker用户组用户 - 方法1
getent group docker
# 查看docker用户组用户 - 方法2
grep '^docker:' /etc/group

配置docker代理

docker 代理配置需要管理员用户

上网代理,参考瞧瞧我对服务器干了些什么! - MurphyHou (cosmicdusty.cc)

一、配置镜像服务器(很多镜像服务器已经不能用了)

1
2
3
4
5
6
7
8
9
10
11
12
13
vim /etc/docker/daemon.json

# 在json配置文件中,输入以下配置
{
"registry-mirrors": [
"https://hub-mirror.c.163.com",
"https://mirror.baidubce.com"
]
}

# 然后重启docker服务
sudo systemctl daemon-reload
sudo systemctl restart docker

二、docker pull代理

1
2
3
4
5
6
7
8
9
10
11
12
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo touch /etc/systemd/system/docker.service.d/proxy.conf

# 在json配置文件中,输入以下配置 -> (7890端口号是因为clash是代理的这个端口)
[Service]
Environment="HTTP_PROXY=http://127.0.0.1:7890/"
Environment="HTTPS_PROXY=http://127.0.0.1:7890/"
Environment="NO_PROXY=localhost,127.0.0.1,.example.com"

# 然后重启docker服务
sudo systemctl daemon-reload
sudo systemctl restart docker

三、Container代理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 1、用户级代理(这个就不需要管理员用户了,使用自己的用户登录)
vim ~/.docker/config.json

# 在json配置文件中,输入以下配置 -> (7890端口号是因为clash是代理的这个端口)
{
"proxies":
{
"default":
{
"httpProxy": "http://127.0.0.1:7890",
"httpsProxy": "http://127.0.0.1:7890",
"noProxy": "localhost,127.0.0.1,.example.com"
}
}
}

测试Docker配置是否成功

Ubuntu | Docker — 从入门到实践 (gitbook.io)

1
docker run --rm hello-world

image-20241205160737158

Overleaf in My Server

上述的docker环境配置好之后,可以配置一下overleaf. 特别是得配置好网络环境,要不然Docker Image拉取不下来

Config

1
2
3
4
5
6
7
8
# Step 1:下载源码
git clone https://github.com/overleaf/toolkit.git ./overleaf-toolkit && cd overleaf-toolkit

# Step 2:初始化配置
bin/init

# Step 3:建立服务
bin/up

image-20241205161651945

1
2
3
4
5
# 启动服务
bin/start

# 结束服务
bin/stop

因为服务是在远程服务器上,为了在本地能直接方法,需要修改端口和外网访问

./config/overleaf.rc中,需要修改以下字段:

1
2
OVERLEAF_LISTEN_IP=xx.xx.xx.xx # 远程服务器IP
OVERLEAF_PORT=80 # 默认是80

Overleaf 容器启动之后,可以打开 http://xx.xx.xx.xx:xx/launchpad 注册管理员帐户。之后我们就可以用这个帐户登录 Overleaf 平台。

网上教程中还给出了一些复杂的配置,后面根据需要再配置吧。

Writen in later

因为Overleaf官网对于免费用户,只有20s的编译时间,超过时间限制则无法编译。对于这种情况,只能付费解决。如果面对我遇到这样的情况的话,我可能也会选择付费的方式。但在网上看到了可以在服务器上搭建自己的Overleaf,所以想跟着教程自己试一下。按照教程一步步走下来,最后也配置成功了。也许最后并不会使用自己配置的这个,但折腾永不停息,万一用到了呢?!

Server OS Update,Ubuntu 18.04 -> 20.04

3080server(2025.06.11)

3080Server和3090Server都是Ubuntu 18.04版本,由于Ubuntu 18.04已经“寿命将尽”,很多软件的维护都结束了,现在出现了一个很大的问题!VSCode 1.85之后的版本使用SSH-remote连不上服务器了!天杀的!Damn!

image-20250611173312204

只有两种解决方法

  • 根据对应的发行版,升级至 Ubuntu 20.04 LTS、Debian 10 或 RHEL 8
  • 降级到 VS Code 1.85

VSCode降级:November 2023 (version 1.85),安装1.85.2版本,然后关闭自动更新。(14 封私信 / 2 条消息) VSCode怎么关闭自动更新 - 知乎

先把VSCode降级之后,将就用了一段时间,后面决定,不能就这么妥协了,我得去把实验室的服务器更新一下!

系统更新比较简单,直接一行命令,然后一路Y或Enter就行了。

系统更新完了之后遇到了几个问题

  • 连不上网(这台服务器之前就有这个问题)
  • nVidia驱动。
  • ssh端口

连不上网的问题是最大的,也不记录中间怎么排查的了,就直接贴一下教程了。

image-20250611174028654

找了好久的方法,这个方式是有效的!

但是,如果按照他说起,再改回false的话,重启后的电脑还是之前连不上网络的状态,还需要重走一边教程。所以最后采取了保持true。这样重启后,会自动联网!

nVidia驱动,这个问题比较好解决,因为有GUI桌面,直接在软件更新中把驱动换上就行。

image-20250611174209655

第三个问题是ssh端口,系统更新之后又变成了默认的22端口,为了服务器安全,需要改一下。

这个问题其实很简单。(14 封私信 / 2 条消息) 安全加固指南:如何更改 SSH 服务器的默认端口号 - 知乎

直接编辑配置文件就行。

1
sudo vim /etc/ssh/sshd_config

但是!我改了半天的ssh_config!我说怎么一直不生效!改错了!!!废了好大的劲!!!

3090server(2025.06.19)

这次比较顺利,遇到的问题还是网络和驱动的问题。

首先,在还没有更新的时候查看了一下当前的网络配置,比较抽象,为什么是docker桥接的网络, 不知道咋回事,不管了!

image-20250619135329540

更新完成之后遇到一个问题,之前的网线连接的网卡指示灯不亮,我不知道是还没有缓冲好还是咋了,我直接换了另外一个网卡了,指示灯亮了。然后使用nm-connection-editornmtui就配好了PPPOE,但是这时候的IP是跟之前不一样了,算了,也不管了,能用就行。

image-20250619140341789

第二个问题是切换英伟达驱动的时候遇到的!

image-20250619135639080

1
2
3
# 两个命令解决。
sudo dpkg --configure -a
sudo apt update && sudo apt upgrade

Disk Tools

  • 磁盘查询, df。df 的全称是 “disk free”,它的核心作用是从文件系统(通常可以理解为磁盘分区)的宏观角度查看整体的使用情况。
1
2
3
4
5
6
7
8
user@3080server:~$ df -h
# -h 是 --human-readable(人类可读)的缩写。

user@3080server:~$ df -T
# df -T: 显示文件系统类型 (filesystem Type),比如 ext4, xfs, ntfs 等。

user@3080server:~$ df -hT
# df -hT: 结合起来使用,非常方便。
  • 磁盘查询,du。du 的全称是 “disk usage”,它与 df 完美互补。当 df 告诉你某个分区(比如 /)快满了,du 就能帮你找到到底是哪个目录或文件占用了大量空间
1
2
3
4
5
6
7
8
9
10
11
12
du -h /path/to/dir
# 以易读格式显示指定目录下所有子目录的大小。如果目录很深,会输出大量信息。

du -sh /path/to/dir: 这是最常用的组合!
# -s (--summarize): 只显示总计大小,不显示子目录的。
# -h: 易读格式。

sudo du -sh /mnt/houjinliang
# 示例:sudo du -sh /mnt/houjinliang 会计算并只显示 houjinliang 文件夹的总大小。

sudo du --max-depth=1 -h /mnt/ | sort -rh
# 计算 /mnt 目录下第一层所有文件夹的大小,并按从大到小排序
  • ncdu 命令 (交互式磁盘空间分析器)。这是一个强烈推荐的工具,可以看作是 du 的超级升级版。它不是系统自带的,但非常值得安装。功能: 它会扫描你指定的目录,然后提供一个可交互的、可视化的文本界面,让你能方便地按大小排序、进入子目录、甚至直接删除文件。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # - **如何安装:**
    - Debian/Ubuntu: `sudo apt update && sudo apt install ncdu`
    - CentOS/RHEL: `sudo yum install ncdu`

    # - **如何使用:**

    # 扫描 /mnt 目录
    sudo ncdu /mnt
    # 命令执行后会有一个扫描过程,扫描结束后你就可以用键盘方向键自由探索了。
  • 查看磁盘和分区结构的工具 (lsblk, fdisk)。有时候你不仅想看容量,还想了解物理磁盘的结构,比如“我的服务器上到底有几块硬盘?每块硬盘是怎么分区的?”

    1
    2
    3
    4
    5
    lsblk
    # - **`lsblk` (List Block Devices):** 以树状结构清晰地列出所有磁盘和分区,非常直观。

    sudo fdisk -l
    # - **`sudo fdisk -l` (需要 sudo):** 一个更传统、更强大的分区工具。`-l` 参数可以列出详细的分区表信息,比 `lsblk` 更底层。
  • 查看文件以及文件夹大小

1
2
3
4
5
6
7
8
9
10
11
# ll
(py38mmyolo) houjinliang@3080server:~/userdoc/offlinefile$ ll
总用量 26251480
drwxrwxr-x 6 houjinliang houjinliang 4096 1月 10 21:36 ./
drwxrwxr-x 9 houjinliang houjinliang 4096 1月 10 15:59 ../
drwxr-xr-x 5 houjinliang houjinliang 4096 8月 26 2022 coco/
-rw-rw-r-- 1 houjinliang houjinliang 6983030 1月 10 17:00 coco128.zip
-rw-rw-r-- 1 houjinliang houjinliang 48639045 1月 10 16:21 coco2017labels.zip
-rw-rw-r-- 1 houjinliang houjinliang 4372979 1月 10 14:48 curl-8.5.0.tar.gz
-rw-rw-r-- 1 houjinliang houjinliang 12353723 1月 5 16:32 pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
…………

总用量 26251480
总用量: 英文是 “total”。它表示当前目录下所有文件和目录所占用的磁盘块的总大小。
26251480: 这是一个数字,单位是块。

1
2
3
4
5
6
7
8
9
10
11
# ll -hl
(py38mmyolo) houjinliang@3080server:~/userdoc/offlinefile$ ll -hl
总用量 26G
drwxrwxr-x 6 houjinliang houjinliang 4.0K 1月 10 21:36 ./
drwxrwxr-x 9 houjinliang houjinliang 4.0K 1月 10 15:59 ../
drwxr-xr-x 5 houjinliang houjinliang 4.0K 8月 26 2022 coco/
-rw-rw-r-- 1 houjinliang houjinliang 6.7M 1月 10 17:00 coco128.zip
-rw-rw-r-- 1 houjinliang houjinliang 47M 1月 10 16:21 coco2017labels.zip
-rw-rw-r-- 1 houjinliang houjinliang 4.2M 1月 10 14:48 curl-8.5.0.tar.gz
-rw-rw-r-- 1 houjinliang houjinliang 12M 1月 5 16:32 pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
…………

命令:

1
ll -hl

等价于:

1
ls -l -h -l

不过,ls 命令的选项可以合并,所以通常写作 ls -hl。因为 ll 本身就是 ls -l,所以这个命令实际上是 ls -l -h

选项解释:

  • -l: 我们已经知道了,表示使用长格式输出。

  • -h: 代表 “human-readable”(人类可读格式)。这是一个非常实用的选项,它会自动将文件大小(以及这里的 总用量)的单位从字节或块,转换为更易于阅读的 K(千字节)、M(兆字节)、G(吉字节)等单位。

    • 总用量: 含义与之前相同,表示当前目录下所有文件和目录所占用的磁盘块的总大小。

    • 26G: 这就是 -h 选项的功劳。它将之前那个不直观的数字 26251480(单位:块)转换成了人类可读的格式。

    • 单位换算:

      • 我们之前估算过,26251480 块 ≈ 26,869,515,520 字节。
      • 26,869,515,520 字节 / 1024 = 26,243,708 KB
      • 26,243,708 KB / 1024 = 25,628 MB
      • 25,628 MB / 1024 ≈ 25.03 GB
    • 您看到的输出是26G,这与我们计算出的25.03 GB非常接近。微小的差异可能源于:

      • 系统显示的舍入25.03 GB 在显示时可能会被舍入为 26G(虽然通常 25.03 会显示为 25G,但不同系统或版本的 ls 可能有不同的舍入逻辑)。
      • 块大小的定义:在某些系统或文件系统中,1 块可能被定义为 4096 字节(4 KB)。如果按这个标准计算:26251480 块 * 4096 字节/块 ≈ 107,527,426,560 字节 ≈ 100.15 GB。这与 26G 差距很大。因此,可以断定您系统的块大小是 1024 字节
      • 最可能的原因ls 命令在计算 total 时,其内部算法可能对单位转换做了特定的处理,导致显示为 26G。我们只需理解它代表大约 26 GB 的磁盘空间占用即可。
  • 如要查看当前目录已经使用总大小及当前目录下一级文件或文件夹各自使用的总空间大小

1
2
3
4
5
6
7
8
9
10
11
12
(py38mmyolo) houjinliang@3080server:~$ du -h --max-depth=1
6.5M ./.config
8.0K ./.conda
1.1G ./.vscode-server
12G ./cuda-11.3
86G ./userdoc
8.0K ./.gnupg
16K ./.ssh
8.0K ./.nv
2.7G ./.cache
24G ./miniconda3
125G .
  • 查看文件夹大小
1
2
3
4
5
user@3080server:/mnt$ sudo du -sh /mnt/bailuqian/
16K /mnt/bailuqian/

user@3080server:/mnt$ sudo du -sh /mnt/chenzhengtao/
119G /mnt/chenzhengtao/
  • 文件夹大小与排序
1
sudo du --max-depth=1 -h /mnt/ | sort -rh
  • 批量输出目录下的文件夹的大小,脚本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash

# ==============================================================================
# 脚本名称: calculate_dir_sizes.sh
# 描述: 计算指定目录下第一层所有子目录的大小。
# 作者: Gemini
# 日期: 2025-09-04
# ==============================================================================

# 定义要扫描的目标目录
TARGET_DIR="/mnt"

# --- 主逻辑 ---

# 检查目标目录是否存在
if [ ! -d "$TARGET_DIR" ]; then
echo "错误: 目录 ${TARGET_DIR} 不存在。"
exit 1
fi

echo "正在计算目录 [${TARGET_DIR}] 下所有文件夹的大小..."
echo "--------------------------------------------------"

# 循环遍历目标目录下的每一个项目
# 使用 find 命令可以更精确地只选择目录,避免处理文件
# -mindepth 1 -maxdepth 1 确保只处理第一层子目录
find "${TARGET_DIR}" -mindepth 1 -maxdepth 1 -type d | while read -r dir; do
# 使用 sudo du -sh 计算每个目录的总大小
# -s (summarize) 表示只显示总计大小
# -h (human-readable) 表示使用易读格式 (K, M, G)
sudo du -sh "${dir}"
done

echo "--------------------------------------------------"
echo "计算完成。"

瞧瞧我对服务器干了些什么!
https://blog.cosmicdusty.cc/post/Tools/WorkingWithGPUServer/
作者
Murphy
发布于
2023年11月1日
许可协议