Wednesday October 04, 2006 Bob Sneed
The problem is probably due to what such problems are usually due to.
For the large part...beleives root cause analysis is a waste of time and money.
Recommends Patterns and Anti patterns books.
AntiPatterns (Stuff we know doesn't work)
Description
Symptom
Consequences
How to fix it
Collecting Data
No data
Configuraiton
Appdata
Irrelevant
Mountains of Data
"Climb Every Mountain" -- Irrational Change Control Enforcement
"Monkey in the Middle" -- Uncooperative party between customer and 'right answer'
"Dueling Engagement Models"
More than I can succinctly and usefully indicate
The customer is not always right! (but we knew that)
The customer should know what is going on but the customer dictating what they think the problem
Translations
Customer will not give us requested data
"We do not have a good relationship"
"The customer will not follow our advice"
no good trust relationship
"customer will not cooperate, we've doone all we can do!"
work on soft skills
Summary: work on the relationship
Technical Antipatterns
Problems on EOL
(more I can't condense quickly)
Data Collection
Start with business problem
Check for bogus problem
Check config
Check resource
Who the Heck was William Dawes?
See Harvard Business Review 12/2005
Bob made a slide using his kids to discuss bug processing/submission.
Pass the Ball or Carry it?
Refferal
when, how to whom
Collab
when how with whom
Experts look for two things:
patterns: things you have seen before
can be a slide or a white-paper
things you have not seen before
no substitute for expertise
Bob's take:
Lowest cost sollution -- apply expert early (and often but only for new issues)
Define solution pass to assistant
Think Doctor and Physicians Assistant
A problem that can be understood can be addressed by the PA
Get an expert negotiator to convince a customer that "thing" has to happen, the 400K$ in lab work is not required to debug a simple config problem on a 1M$ system.
Resources
opensloaris.org
blogs
"Communities"
Computer Measurement Group
www.cmg.org
Oracle:
oaktable.net
hotsos.com
Cool Tools: cooltools.sunsource.net
GCCFSS gcc for sparc
Cool Stack Pre-compiled and SPARC-Optimized binaries
And More!
Books:
Solaris Internals
Solaris Performance and Tools
"Soft Skills"
"Getting to Yes"
"Optimizing Oracle Performance" Chapters 1-3
More books, not writing them all down ... Work on relationship skills (just do it)
Final Thoughts
Play position, pass the ball
Improve Skills
Work with Communities
Stay informed
Participate
Know your limits
Technorati Tags: cec2006
A discussion of Honeycomb and Thumper
It's all about the DATA.
Can't do anything without data.
Building on standardized components .
Data management
ID, Virtualize, Secure, Integrate
EVERYTHING
ID -- Build Trust
Virtualize -- Mask Complexity, Drive up Utilization and Performance
Sun has end to end virtulization available
Secure -- Security breach...tape lost, laptop stolen
Security everywhere...really everywhere...new T10000
Automatic encryption
Integrate -- Too many choices, Lack of Standards, Complex Interdependencies
We provide customer ready systems --- pre-built, racked, cabled, integrated and shipped
SAM-FS and QFS
Enterprise-Class NAS
Standard components and software...integrated!
Honeycomb...
Long term storage repositories(years)
Not OLTP, ERP or live database
Scales Billions of Objects
Standard components and software...integrated!
Storage API
Programable storage
Extensible
Horizontal scaling
Reliability through self healing
large scale repository applications
Deferred service model (fail in place...like Sun Grid, schedule work when convenient)
Expect to deploy multi-petabyte systems
RAIN -- Redundant Array of Inexpensive Nodes
Backup/NDMP
Thumper...(x4500)
High performance 4 way server, 16GB ram
Is it a server or storage?...both it depends...if you install an application it is a server.
ZFS and all related capabilities
HPC, D2D2T, Infiniband Server, iSCSI Target, SAM-FS, NFS/CIFS
information available online
Technorati Tags: cec2006
Richard Elling, PAE
Performance, Availability & Architecture Engineering
He remembered to do the session recording start
information...So far the second session I have attended to do so.
(We forgot to do it, bad presenters)
Internal changes to the process in the past few years have made a big difference in both pre-release quality and post-release support.
MTBF, Heat, Moving Parts
Starting with some
MTBSI == Mean Time Between Systems Interruption
Similar to previous usage of MTBF, now Failure does not have to mean
System/Service Interruption
More Parts --> lower MTBF (Bad)
[1] Integration == fewer parts == Higher MTBF
Most common failures Fans, Power Supplies, Disks, Memory
(glad he said that, I say it too...woot validation)
PowerSupply -- Built in surge supressor
Designed for 1M hours MTBF (Niagara and Galaxy vs PC at ~83K hours)
Use fewer Better more expensive power supplies
Fans -- Fail too often moving parts
Hard to implement fans in small systems
System policies can affect MTBSI...e.g. T1000 firmware starts
a shutdown on fan failure, not on system temperature
This may not be the preferred method but it is in the spec.
Heat Kills, run cool, run for a long time
Disk numbers from Seagate, rule of thumb
+15deg C == MTBF / 2
Smaller disks run cooler and have faster seek (seagate website)
Boot Disk Trends
Nearly 100% of Sun's customers mirror boot disks in the DC
Software raid is popular but built in HW raid is becoming more popular, ZFS boot disks coming
Long term: solid state, CF slot in CP3010 and CP3020 also Moore's law
SAS vs Parallel SCSI...system now knows if the disk disappears before it tries to access it.
In case you weren't sure More parts == lower MTBF
See 1
Memory
Memory Page Retirement -- FMA
What about mirrored memory?
Will anyone use it? Hey double memory sales!
(Probably only in the highest RAS demand environments)
See 1
Processors
More transistors -- more self test circuits -- More redundancy
See 1
OPL
Mainframe class reliability in a Solaris system
Software
Solaris FMA (Fault Management Architecture)
(Really cool, useful messages, diagnosis codes, get a system
and break things to see it
you could pull a disk to see an example however this is not Richard's
top choice in terms of display)
SMF
Before: Panic
SMF: Restart
Ability to define relationships
(See my previous posts and/or Liane
Praza's Weblog)
ZFS
Data reliability...we know if we get corrupted data before
we try to use it.
(This is also cool, I have been running ZFS at home for at
least 9 months, although I don't own any huge storage hardware for the
complex case)
Data recovery time based on data space used not size of FS.
Not yet fully integrated with FMA, depends on driver, disks and system
implementation.
Virtulization
Good
Really fast boot! == lower MTTR
Guest OS portability
Hide faults from OS
Bad
Hides faults from OS (If the OS knows there is a fault
it can respond appropriately)
Scheduling...Real Time app or time-sensitive devices
Fail
in Place
Possibly a better term "Deferred Repair
Strategy"
(We use this in Grid...with 1000s of nodes do you really need
a service call for a single failure? We say no)
From here we go from MTBF to MTBSI, non-service interrupting failure
Scrubbing
We scrub just about everything, looking for
faults and corruption before we try to use bad data.
Reduction
in Parts
X4100 Rack
78 Power supplies
E25K
6 Power supplies
Blade 8000
6 Power Supplies
(Similar for fans)
Why then would you deploy a rack of X4100s instead of Blades unless you have a specific need.
Get
More Information
OpenSolaris Forums...Participate
b.s.c/relling
Technorati Tags: cec2006