星期三 九月 12, 2007

(zz) Solaris Devfs

http://blogs.sun.com/szhou/entry/solaris_devfs

Solaris Devfs Solaris originated from BSD and SVR4 UNIX. Over the years, many enhancements have been made to address business needs. One area of big change is the I/O framework and device name management.
Traditional UNIX kernel configures all devices at boot time. Device access is supported via two indexed arrays, bdevsw[] and cdevsw[], for block and character devices, respectively. The array elements contain references to driver entry points compiled into the kernel. Applications access device by opening device special files, created via the mknod(2) syscall. A device special file has a type (block or char) and a device number (dev_t). The type informs the kernel whether to use bdevsw[] or cdevsw[]. The device number contains two parts, major and minor. The major number is used to index into the arrays, and minor number is used by driver only, typically to determine which device instance to access.
Solaris modified and extended the model in many ways.

  • bdevsw[] and cdevsw[] are merged into a single array, devopsp[], indexed by a major number common to both block and character drivers. The elements of devopsp[] reference driver's dev_ops structure, containing driver entry points for device autoconfiguration (probe, attach, detach), bus nexus oriented operations (bus_ops), and block/char operations (cb_ops).
  • All drivers are loadable kernel modules. The modules are loaded on-demand. During system startup, the kernel only loads those driver modules required to boot the system. As a result, a normal system boot may not initialize all hardware attached to the system. To initialize all hardware at boot time, a reconfiguration boot (boot -r) is required.
  • Devices are represented in the kernel by a tree of device information (struct dev_info) nodes. Inner (nexus) nodes represent bus controllers and adaptors while leaf nodes represent devices. Leaf nodes bind to "leaf" drivers, which implements cb_ops to handle I/O requests. Inner nodes bound to bus nexus drivers, which implements bus_ops to satisfy leaf driver requests or pass the requests to parent bus or the hardware platform. This design allows generic leaf drivers to be written without knowing the details of the transport. For example, the scsi disk driver (sd) can be used to control many types of disks such as SCSI-2, USB storage, Fibre-channel, and atapi CD-ROM.
  • A private namespace, /devices, is introduced to mirror device names in the Open Boot PROM (obp) defined in the IEEE 1275 standard. The namespace reflects the physical topology of I/O devices and bus interconnects. This namespace is controlled by a filesystem named "devfs", first introduced in Solaris 10. A key feature of devfs is that a filesystem lookup operation actually drives configuration of the specific device instance corresponding to the pathname. For example, # ls /devices/pci@0,0/pci-ide@11,1/ide@1/sd@0,0:a would cause the ATAPI cdrom drive to attach even if it is currently not configured in the kernel.
  • The public names in /dev are symbolic links to a pathname in /devices. The /dev names are created at Solaris Install time by devfsadm(1M). When new devices are added, the /dev name space is updated via devfsadmd(1M), the daemon version of devfsadm.
The current Solaris I/O framework is flexible and scales well from a single CPU system to high-end servers with 100+ CPUs and 1000+ devices. In addition, I/O devices can be reconfigured dynamically without rebooting the system. This functionality is also referred to as Dynamic Reconfiguration or hotplugging. In a future blog entry, I hope to explain in more detail the inner workings of devfs and the kernel device tree.

星期五 八月 24, 2007

Notable magic numbers


Many computer processors, operating systems, and debuggers make use of magic numbers, especially as a magic debug value.

  • 0xABADBABE ("a bad babe") is used by Apple as the "Boot Zero Block" magic number.
  • 0xBAADF00D ("bad food") is used by Microsoft's LocalAlloc(LMEM_FIXED) to indicate uninitialised allocated heap memory.
  • 0xBADDCAFE ("bad cafe") is used by 'watchmalloc' in OpenSolaris to mark allocated but uninitialized memory.
  • 0xCAFEBABE ("cafe babe") is used by both Mach-O ("Fat binary" in both 68k and PowerPC) to identify object files and the Java programming language to identify Java bytecode class files
  • 0xDEADBEEF ("dead beef") is used by IBM RS/6000 systems and Mac OS on 32-bit PowerPC processors as a magic debug value. On Sun Microsystems' Solaris, marks freed kernel memory
  • 0xDEFEC8ED ("defecated") is the magic number for OpenSolaris core dumps.
  • 0xFEEDFACE ("feed face") is used as a header for Mach-O binaries, and as an invalid pointer value for 'watchmalloc' in OpenSolaris.
kmem_flags
The setting of kmem_flags (a kernel global variable) can be very useful in
debugging problems. Here are some of the interesting values relating to
kmem_flags (from kmem_impl.h):
/*
* kernel memory allocator: implementation-private data structures
*/
#define KMF_AUDIT 0x00000001 /* transaction auditing */
#define KMF_DEADBEEF 0x00000002 /* deadbeef checking */
#define KMF_REDZONE 0x00000004 /* redzone checking */
#define KMF_CONTENTS 0x00000008 /* freed-buffer content logging */
#define KMF_STICKY 0x00000010 /* if set, override /etc/system */
#define KMF_NOMAGAZINE 0x00000020 /* disable per-cpu magazines */
#define KMF_FIREWALL 0x00000040 /* put all bufs before unmapped
pages */
#define KMF_LITE 0x00000100 /* lightweight debugging */
#define KMF_HASH 0x00000200 /* cache has hash table */
#define KMF_RANDOMIZE 0x00000400 /* randomize other kmem_flags */
#define KMF_BUFTAG (KMF_DEADBEEF | KMF_REDZONE)
#define KMF_TOUCH (KMF_BUFTAG | KMF_LITE | KMF_CONTENTS)
#define KMF_RANDOM (KMF_TOUCH | KMF_AUDIT | KMF_NOMAGAZINE)
#define KMF_DEBUG (KMF_RANDOM | KMF_FIREWALL)
#define KMEM_STACK_DEPTH 15
#define KMEM_FREE_PATTERN 0xdeadbeefdeadbeefULL
#define KMEM_UNINITIALIZED_PATTERN 0xbaddcafebaddcafeULL
#define KMEM_REDZONE_PATTERN 0xfeedfacefeedfaceULL
#define KMEM_REDZONE_BYTE 0xbb
Setting kmem_flags to 0xf will provide the necessary information

星期四 八月 16, 2007

Virtualization

Xen 初学者指南

http://www.linuxsir.org/main/?q=node/188#1 

 

LDOM (ZT)

      相信大家对虚拟化分区(virtualization and partitioning technology) 技术都不陌生了,从Sun Fire[TM] 3800服务器开始,就有了硬件分区及系统域(System Domains)的技术,这时的分区粒度(granularity)是每个系统域至少要有1个CPU/MEM板和一个I/O板。Solaris 10引入了ZONE(也称为Container)的技术,通过此技术,可以在系统上创建多个逻辑上独立的操作系统实例(instance),每个实例可以 运行其自己的程序集并且相互之间没有任何干扰。即便是单CPU的系统,也可以创建多个ZONE,只要你的系统资源足够创建并运行这么多实例。

      随着Solaris 10 11/06版的发布,一种新的虚拟化分区技术--逻辑域(Logical Domains以下简称LDOMs)呈现在大家的面前。System Domains、LDOMs和ZONE三者之间有什么关系呢?下面是一个简单的示意图:

 Zone(Container)
LDOMs
System Domains
Operating System
   
Firmware Level
   
Harware Platform




       上面的示意图简化了很多细节,主要是为了突出LDOMs与Zone和System Domains之间的关系。我们可以看到LDOMs是建立在Firmware至上的,即LDOMs不光需要操作系统的支持,也需要Firmware的支 持。那么需要什么样的Firmware支持的?

      LDOMs是靠在操作系统和硬件层之间的Firmware(flash PROM)中加入一个叫做hypervisor的软件来实现虚拟化分区的。目前支持这种hypervisor软件的平台只有Sun Fire[TM] T1000Sun Fire[TM] T2000系统(即sun4v平台体系架构服务器)。这就是为什么现在LDOMs只能用于Sun Fire[TM] T1000Sun Fire[TM] T2000系统的原因。

        为了能够正确的与hypervisor通信,操作系统必须有相关的支持。目前只有Solaris 10 11/06才能支持hypervisor(还需要相关补丁),对于Solaris 8和Solaris 9,并没有计划对sun4v的支持(Solaris 10功能如此强大,为什么强扭着Solaris 8和Solaris 9不放呢),因此在LDOMs虚拟分区安装的Guest OS也必须是Solaris 10 11/06(之前可能有很多朋友以为LDOMs支持不同版本的Solaris,现在看来是不行的)。

       Sun Fire[TM] T1000Sun Fire[TM] T2000服务器有8个core,每个core有4个thread,LDOMs技术可以将每个thread划分到一个虚拟分区。

       为了实现LDOMs,我们需要以下的条件:

       当系统满足以上条件后,就可以配置LDOMs。为了管理LDOMs,必须要先建立一个控制域Control Domain(也称为Primary Domain),有点类似于一些服务器的控制器。只有控制域建立好之后,你才能够开始其它逻辑域的创建。

        LDOMs按照其角色可以分为以下几类:

  1. Control domain -- 上面已经提到,用来创建并管理其它的逻辑域和服务,及与hypervisor的通信
  2. Service domain --  为其它逻辑域提供虚拟网络交换、虚拟磁盘服务等的���辑域
  3. I/O domain --  具有对输入/输出设备直接的物理链接,比如PCI-E卡或者网络设备等。
  4. Guest Domain --  使用Service domain和I/O domain提供的服务,并受Control domain的管理。

       LDOMs支持对CPU的动态配置(Dynamic Reconfiguration),对Memory或者其它部件,LDOMs提供延迟配置(Delayed Reconfiguration),即要等到下一次重启才生效。

        几点说明:

  1. 所有LDOM的操作系统必须是Solaris 10 11/06,并且已经安装了相关补丁
  2. 如果Control domain出现故障,则会影响到其它所有LDOM。Control domain是SPOF(Single-Point-Of-Failure)
  3. 如果提供服务的LDOM出现故障,则所有使用其服务的LDOM都会受到影响。Service LDOM是SPOF。

 

更多信息,请参见《What's New in the Solaris 10 11/06 Release》 和 SUN BLUEPRINTS《BEGINNERS GUIDE TO LDOMS: UNDERSTANDING AND DEPLOYING LOGICAL DOMAINS

 

星期二 七月 24, 2007

Register information is for SPARC systems

 

================================
The following register information is for SPARC systems.
================================
    Table 15-2 Table 15-2Sun-4 Registers $g0-$g7    Global registers
    $o0-$o7    "out" registers
    $i0-$i7      "in" registers
    $l0-$l7      "local" registers
    $fp            Frame pointer, equivalent to register $i6
    $sp           Stack pointer, equivalent to register $o6
    $y             Y register
    $psr         Processor state register
    $wim       Window invalid mask register
    $tbr         Trap base register
    $pc          Program counter
    $npc        Next program counter
    $f0-$31    FPU "f" registers
    $fsr          FPU status register
    $fq           FPU queue