星期日, 四月 29, 2007

Redhat下MD设备的多路径使用问题

使用MD设备;创建文件系统之后,有时候在I/O操作的时候(主要是写),会出现I/O Error,然后文件系统被remount成ReadOnly,该问题在测试及系统上线的时候多次出现,经检查,是在创建MD虚拟设备之前创建了文件系统,下面详细记录了问题出现、处理的过程,及详细的原因。
一、错误重现
以下步骤与3月11日系统迁移中的过程一致,在6920上新映射了一个5G的LUN到主机,然后重复建立MD设备、创建文件系统及cp文件的过程,最后dmesg得出了与当天相同的日志。
1、对LUN进行分区(此前已经扫描出sd设备):
[root@gddb2 ~]# ./getwwn.sh

3600015d00004d30000000000000007e5 /dev/sda
3600015d00004d30000000000000007e5 /dev/sdd
3600015d00004d300000000000000081e /dev/sdb
3600015d00004d300000000000000081e /dev/sde
3600015d000055c000000000000000a18 /dev/sdc
3600015d000055c000000000000000a18 /dev/sdf
[root@gddb2 ~]# fdisk -l /dev/sdb

Disk /dev/sdb: 5368 MB, 5368709120 bytes
166 heads, 62 sectors/track, 1018 cylinders
Units = cylinders of 10292 * 512 = 5269504 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 1 1018 5238597 83 Linux
[root@gddb2 ~]# fdisk -l /dev/sde

Disk /dev/sde: 5368 MB, 5368709120 bytes
166 heads, 62 sectors/track, 1018 cylinders
Units = cylinders of 10292 * 512 = 5269504 bytes

Device Boot Start End Blocks Id System
/dev/sde1 1 1018 5238597 83 Linux
――――由于sdb和sde是同一个LUN的不同路径,所以对sdb进行分区就可以了
2、在sdb1上创建ext3文件系统
[root@gddb2 ~]# mkfs.ext3 /dev/sdb1
mke2fs 1.35 (28-Feb-2004)
max_blocks 1341080576, rsv_groups = 40927, rsv_gdb = 319
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
655360 inodes, 1309649 blocks
65482 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1342177280
40 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736

Writing inode tables: done
inode.i_blocks = 20424, i_size = 4243456
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 37 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

3、创建MD虚拟设备:
[root@gddb2 ~]# mdadm -C /dev/md3 -l multipath -n 2 /dev/sdb1 /dev/sde1
mdadm: /dev/sdb1 appears to contain an ext2fs file system
size=5238596K mtime=Thu Jan 1 08:00:00 1970
mdadm: /dev/sde1 appears to contain an ext2fs file system
size=5238596K mtime=Thu Jan 1 08:00:00 1970
Continue creating array? y
mdadm: array /dev/md3 started.

4、加载文件系统,并做cp操作:
[root@gddb2 ~]# mount /dev/md3 /md3/
[root@gddb2 ~]# cp /data/tablespace/* /md3/
cp: writing `/md3/mtamail21.dbf': No space left on device
cp: writing `/md3/mtamail22.dbf': No space left on device
cp: writing `/md3/mtamail23.dbf': No space left on device
cp: writing `/md3/mtamail31.dbf': No space left on device
cp: writing `/md3/mtamail32.dbf': No space left on device
cp: writing `/md3/mtamail33.dbf': No space left on device
cp: writing `/md3/mtamanager01.dbf': No space left on device
5、以上的cp命令只是报了文件系统空间满的错,但是没有报I/O错,再看看系统的日志dmesg:
[root@gddb2 ~]# dmesg
……(省略若干无关内容)
md: export_rdev(sdb1)
md: bind
md: bind
multipath: array md3 active with 2 out of 2 IO paths
kjournald starting. Commit interval 5 seconds
EXT3 FS on md3, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: initialized (dev md3, type ext3), uses xattr
attempt to access beyond end of device
md3: rw=1, want=10477064, limit=10477056
printk: 7 messages suppressed.
Buffer I/O error on device md3, logical block 1309632
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477072, limit=10477056
Buffer I/O error on device md3, logical block 1309633
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477080, limit=10477056
Buffer I/O error on device md3, logical block 1309634
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477088, limit=10477056
Buffer I/O error on device md3, logical block 1309635
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477096, limit=10477056
Buffer I/O error on device md3, logical block 1309636
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477104, limit=10477056
Buffer I/O error on device md3, logical block 1309637
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477112, limit=10477056
Buffer I/O error on device md3, logical block 1309638
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477120, limit=10477056
Buffer I/O error on device md3, logical block 1309639
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477128, limit=10477056
Buffer I/O error on device md3, logical block 1309640
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477136, limit=10477056
Buffer I/O error on device md3, logical block 1309641
lost page write due to I/O error on md3
attempt to access beyond end of device
md3: rw=1, want=10477144, limit=10477056
attempt to access beyond end of device
md3: rw=1, want=10477152, limit=10477056
attempt to access beyond end of device
md3: rw=1, want=10477160, limit=10477056
attempt to access beyond end of device
md3: rw=1, want=10477168, limit=10477056
attempt to access beyond end of device
md3: rw=1, want=10477176, limit=10477056
attempt to access beyond end of device
md3: rw=1, want=10477184, limit=10477056
attempt to access beyond end of device
md3: rw=1, want=10477192, limit=10477056
可见与3月11日系统迁移的时候报错一致。
二、错误原因分析
这一次主要的报错信息意义分析:
attempt to access beyond end of device
md1: rw=1, want=419424776, limit=419424768
Buffer I/O error on device md1, logical block 52428096
大致意思是:尝试访问超过设备结尾的块,第3行提示访问的逻辑块为52428096。
而md1设备的最大逻辑块是52428119:
[root@gddb2 tablespace]# dumpe2fs /dev/md1
dumpe2fs 1.35 (28-Feb-2004)
Filesystem volume name:
Last mounted on:
Filesystem UUID: 76ee30b6-aac9-484a-8b76-5b76a7c31253
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode filetype needs_recovery sparse_super large_file
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 26214400
Block count: 52428119
Reserved block count: 2621405
Free blocks: 38174314
Free inodes: 26214307
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16384
Inode blocks per group: 512
Filesystem created: Wed Mar 1 18:34:53 2006
Last mount time: Mon Mar 13 16:48:43 2006
Last write time: Mon Mar 13 16:48:43 2006
Mount count: 8
Maximum mount count: 32
Last checked: Wed Mar 1 18:34:53 2006
Check interval: 15552000 (6 months)
Next check after: Mon Aug 28 18:34:53 2006
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: f96d19fa-fb5c-43e4-a6d7-410470ecfc0d
Journal backup: inode blocks
可见,是在临近文件系统结尾的20多个逻辑块出了问题。而实际上这20多个逻辑块
md1设备上是不存在的;原因是先创建ext3文件系统,而没有给md设备预留superblock空间(md设备需要在物理设备尾部部分空间保存md multipath信息,见附);在创建md设备之后,物理设备的尾部分被md的superblock占用了几k到几M的空间(通常为64K),但文件系统并不知情,仍然按原来(创建md设备前)空间结构填写inode的bitmap;inode指针错位,导致最后几个逻辑块“悬空”,写的时候就会报类似“attempt to access beyond end of device”的错误。
以第一节里面的5G的LUN为例,如果在创建了md设备之后再创建ext3文件系统,则文件系统的逻辑块会比原来(创建ext3文件系统后创建md设备)少若干(几十个):
[root@gddb2 ~]# umount /md3/
[root@gddb2 ~]# dumpe2fs /dev/sdb1>sdb1
[root@gddb2 ~]# mkfs.ext3 /dev/md3
mke2fs 1.35 (28-Feb-2004)
max_blocks 1341063168, rsv_groups = 40926, rsv_gdb = 319
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
655360 inodes, 1309632 blocks
65481 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=1342177280
40 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736

Writing inode tables: done
inode.i_blocks = 20424, i_size = 4243456
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
[root@gddb2 ~]# dumpe2fs >md3
dumpe2fs 1.35 (28-Feb-2004)
Usage: dumpe2fs [-bfhixV] [-ob superblock] [-oB blocksize] device
[root@gddb2 ~]# dumpe2fs /dev/md3>md3
dumpe2fs 1.35 (28-Feb-2004)
[root@gddb2 ~]# diff md3 sdb1more
3c3
<> Filesystem UUID: 48a10ffb-d7ef-41d5-9ef6-a277f0c191f3
12,14c12,14
<> Block count: 1309649
> Reserved block count: 65482
> Free blocks: 1278313
23c23
<> Filesystem created: Mon Mar 13 18:07:59 2006
25c25
<> Last write time: Mon Mar 13 18:07:59 2006
27,28c27,28
<> Maximum mount count: 37
> Last checked: Mon Mar 13 18:07:59 2006
30c30
<> Next check after: Sat Sep 9 18:07:59 2006
37c37
<> Directory Hash Seed: 51a94659-8e0d-468d-955c-b239aef0abff
283c283
< Group 39: (Blocks 1277952-1309631)
对比可见,两种方式下创建的文件系统刚好差了1309649-1309632=17个逻辑块;由于md会保留最后的64K空间作为superblock,md3的blocksize是4096(4K),被占用的逻辑块就为:64K/4K=16个,与上面的结果相似;而在第一部分的测试中,开始报错的逻辑块刚好是1309632 (Buffer I/O error on device md3, logical block 1309632),印证了创建了md虚拟设备后的实际块设备大小(1309632个逻辑块)。

三、解决方法及建议:
在对LUN进行分区、创建md虚拟设备之后创建ext3文件系统。

1 条评论:

yin 说...

很好的实际使用经验,推荐有类似环境的管理员参考。