YakShaving: Shawn Ferry's Weblog
v. intr. [MIT AI Lab, after 2000: orig. probably from a Ren & Stimpy episode.] Any seemingly pointless activity which is actually necessary to solve a problem which solves a problem which, several levels of recursion later, solves the real problem you're working on.
Archives
« October 2006 »
SunMonTueWedThuFriSat
10
14
15
18
20
21
22
24
25
27
28
29
30
31
    
       
Today

 Subscribe

Search

Links
 

Today's Page Hits: 346

Locations of visitors to this page
« Previous day (Oct 3, 2006) | Main | Next day (Oct 4, 2006) »
20061004 Wednesday October 04, 2006
CEC: Avoiding Expensive Performance Disappointments

Bob Sneed

The problem is probably due to what such problems are usually due to.

For the large part...beleives root cause analysis is a waste of time and money.

Recommends Patterns and Anti patterns books.

AntiPatterns (Stuff we know doesn't work)

Description
Symptom
Consequences
How to fix it

Collecting Data

No data

Configuraiton
Appdata
Irrelevant

Mountains of Data

"Climb Every Mountain" -- Irrational Change Control Enforcement

"Monkey in the Middle" -- Uncooperative party between customer and 'right answer'

"Dueling Engagement Models"

More than I can succinctly and usefully indicate

The customer is not always right! (but we knew that)

The customer should know what is going on but the customer dictating what they think the problem

Translations

Customer will not give us requested data

"We do not have a good relationship"

"The customer will not follow our advice"

no good trust relationship

"customer will not cooperate, we've doone all we can do!"

work on soft skills

Summary: work on the relationship

Technical Antipatterns

Problems on EOL
(more I can't condense quickly)



Data Collection

Start with business problem
Check for bogus problem
Check config
Check resource

Who the Heck was William Dawes?

See Harvard Business Review 12/2005

Bob made a slide using his kids to discuss bug processing/submission.

Pass the Ball or Carry it?

Refferal

when, how to whom

Collab

when how with whom

Experts look for two things:

patterns: things you have seen before

can be a slide or a white-paper

things you have not seen before

no substitute for expertise

Bob's take:

Lowest cost sollution -- apply expert early (and often but only for new issues)

Define solution pass to assistant

Think Doctor and Physicians Assistant

A problem that can be understood can be addressed by the PA

Get an expert negotiator to convince a customer that "thing" has to happen, the 400K$ in lab work is not required to debug a simple config problem on a 1M$ system.

Resources
opensloaris.org
blogs
"Communities"

Computer Measurement Group
www.cmg.org

Oracle:
oaktable.net
hotsos.com

Cool Tools: cooltools.sunsource.net

GCCFSS gcc for sparc
Cool Stack Pre-compiled and SPARC-Optimized binaries
And More!

Books:
Solaris Internals
Solaris Performance and Tools

"Soft Skills"

"Getting to Yes"
"Optimizing Oracle Performance" Chapters 1-3
More books, not writing them all down ... Work on relationship skills (just do it)

Final Thoughts

Play position, pass the ball
Improve Skills
Work with Communities
Stay informed
Participate
Know your limits

Technorati Tags:


Oct 04 2006, 10:28:10 PM EST Permalink

CEC: plane hijinks
Paul hughes just had a book dropped on his head from the overhead bin.

We have another really packed flight.

Oct 04 2006, 05:59:00 PM EST Permalink

CEC: On the way home
unfortunately it appears that my luggage is a bit the worse for wear.

The main zipper which had already lost a track for one pull; has gotten mangled a bit since this morning.
Now I can't open it since the track itself is out of joint.
it will hopefully make the journey home.

Oct 04 2006, 02:31:00 PM EST Permalink

CEC: Final Presentation
Hal is going to wax his back, and audio only podcast it.
This is not part of our strategic vision.

Johnathan
why do we rock...we innovate
Our strategy is a ven diagram...or...not silos but solutions...we are the intersection.
We must continue to sell from the center.

Oct 04 2006, 02:15:00 PM EST Permalink

CEC: Storage: Hot Technology Cool Products

A discussion of Honeycomb and Thumper

It's all about the DATA.

Can't do anything without data.

Building on standardized components .

Data management

ID, Virtualize, Secure, Integrate

EVERYTHING

ID -- Build Trust

Virtualize -- Mask Complexity, Drive up Utilization and Performance

Sun has end to end virtulization available

Secure -- Security breach...tape lost, laptop stolen

Security everywhere...really everywhere...new T10000
Automatic encryption

Integrate -- Too many choices, Lack of Standards, Complex Interdependencies

We provide customer ready systems --- pre-built, racked, cabled, integrated and shipped
SAM-FS and QFS

Enterprise-Class NAS

Standard components and software...integrated!

Honeycomb...

Long term storage repositories(years)
Not OLTP, ERP or live database
Scales Billions of Objects

Standard components and software...integrated!
Storage API

Programable storage
Extensible
Horizontal scaling
Reliability through self healing
large scale repository applications
Deferred service model (fail in place...like Sun Grid, schedule work when convenient)
Expect to deploy multi-petabyte systems
RAIN -- Redundant Array of Inexpensive Nodes
Backup/NDMP

Thumper...(x4500)

High performance 4 way server, 16GB ram
Is it a server or storage?...both it depends...if you install an application it is a server.
ZFS and all related capabilities

HPC, D2D2T, Infiniband Server, iSCSI Target, SAM-FS, NFS/CIFS
information available online

Technorati Tags:


Oct 04 2006, 10:00:37 AM EST Permalink

CEC: RAS Trends From Andromeda to ZFS

Richard Elling, PAE
Performance, Availability & Architecture Engineering

He remembered to do the session recording start information...So far the second session I have attended to do so.
(We forgot to do it, bad presenters)

Internal changes to the process in the past few years have made a big difference in both pre-release quality and post-release support.

MTBF, Heat, Moving Parts

Starting with some
MTBSI == Mean Time Between Systems Interruption
Similar to previous usage of MTBF, now Failure does not have to mean System/Service Interruption

More Parts --> lower MTBF (Bad)

[1] Integration == fewer parts == Higher MTBF

Most common failures Fans, Power Supplies, Disks, Memory
(glad he said that, I say it too...woot validation)

PowerSupply -- Built in surge supressor
Designed for 1M hours MTBF (Niagara and Galaxy vs PC at ~83K hours)
Use fewer Better more expensive power supplies

Fans -- Fail too often moving parts
Hard to implement fans in small systems

System policies can affect MTBSI...e.g. T1000 firmware starts a shutdown on fan failure, not on system temperature
This may not be the preferred method but it is in the spec.

Heat Kills, run cool, run for a long time

Disk numbers from Seagate, rule of thumb

+15deg C == MTBF / 2

Smaller disks run cooler and have faster seek (seagate website)

Boot Disk Trends
Nearly 100% of Sun's customers mirror boot disks in the DC

Software raid is popular but built in HW raid is becoming more popular, ZFS boot disks coming

Long term: solid state, CF slot in CP3010 and CP3020 also Moore's law

SAS vs Parallel SCSI...system now knows if the disk disappears before it tries to access it.

In case you weren't sure More parts == lower MTBF
See 1

Memory
Memory Page Retirement -- FMA

What about mirrored memory?
Will anyone use it? Hey double memory sales!
(Probably only in the highest RAS demand environments)

See 1

Processors
More transistors -- more self test circuits -- More redundancy

See 1

OPL
Mainframe class reliability in a Solaris system

Software
Solaris FMA (Fault Management Architecture)
(Really cool, useful messages, diagnosis codes, get a system and break things to see it
you could pull a disk to see an example however this is not Richard's top choice in terms of display)

SMF
Before: Panic
SMF: Restart
Ability to define relationships
(See my previous posts and/or Liane Praza's Weblog)

ZFS
Data reliability...we know if we get corrupted data before we try to use it.
(This is also cool, I have been running ZFS at home for at least 9 months, although I don't own any huge storage hardware for the complex case)

Data recovery time based on data space used not size of FS.
Not yet fully integrated with FMA, depends on driver, disks and system implementation.

Virtulization

Good

Really fast boot! == lower MTTR
Guest OS portability
Hide faults from OS

Bad
Hides faults from OS (If the OS knows there is a fault it can respond appropriately)
Scheduling...Real Time app or time-sensitive devices

Fail in Place
Possibly a better term "Deferred Repair Strategy"
(We use this in Grid...with 1000s of nodes do you really need a service call for a single failure? We say no)
From here we go from MTBF to MTBSI, non-service interrupting failure

Scrubbing
We scrub just about everything, looking for faults and corruption before we try to use bad data.

Reduction in Parts
X4100 Rack

78 Power supplies

E25K

6 Power supplies

Blade 8000

6 Power Supplies

(Similar for fans)

Why then would you deploy a rack of X4100s instead of Blades unless you have a specific need.

Get More Information
OpenSolaris Forums...Participate

b.s.c/relling

Technorati Tags:


Oct 04 2006, 06:14:13 AM EST Permalink

Blog Information Profile for YakShaving