Data Processing
Valdis's Weblog
Archives
« November 2009
MonTueWedThuFriSatSun
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
      
Today
Click me to subscribe
Search

Links
 

Today's Page Hits: 6

Locations of visitors to this page
Tuesday Aug 12, 2008
Open source and free storage performance tool

After writing about storage performance and various bottlenecks the tool that we have been using in Sun vdbench and SWAT (Storage Workload Analysis Tool) is now open source and generally available to the public. If you really want to know what is going on with your disk and tape storage this is what you need. Many give you nice pictures but not many that I have used over 20yrs gives this granularity. e.g. you can see how different disk array designs work depending on their cache algorithms. Also what happens when you want IOPS (small blksizes but many I/O's), or bandwidth fewer I/O's with

You can take your application find out what blocksize it uses, the proprotion or reads vs writes and then tell vdbench to write as many as possible to see where the performance of your storage device reaches saturation. The product is very feature reach as are the outputs, you can see the results using SWAT.

This has been extremely positive with all the customers that we worked with and used this together. Top tip what we did on a large screen is have the app on one window, vdbench on another, array performance monitor on another and SWAT on the next, SAN performance monitor on another. As vdbench emulated the applciation I/O we could see how the storage, server etc behaved.

You can get them from the following links:

Swat 3.00

Vdbench 4.07

This runs on Solaris, Linux, Windows and other systems.

Please use many have asked where to get it before.

Well done Henk Vandenbergh.

Posted at 11:14PM Aug 12, 2008 by Valdis Filks in Technical  |  Comments[0]

Thursday May 22, 2008
Tuning for filesystem performance, specifically QFS

The holy grail of storage performance, here goes (this question comes up every week).

To make I/O performance perfect, a block of data needs to be transferred, unhindered and unaltered with as few dissasemblies and assemblies as possible as it travels from the CPU to the physical disks. I have explained this many times and tuned this for over 20yrs and the basic rules do not change, strange thing. Neither do Moore's Law or Amdahls law, but they do get misquoted.

So if you application writes in 16K blocks make sure that all components in the I/O path for this application work in 16K units or larger. But not too much large as you will be wasting resources.

-- Exceprt from a discussion a couple of weeks ago, when an app was writing data in 128KB blocks and we were using a shared HPC SAN fielsystem called QFS, may be useful to someone ---

Suppliers (array manufacturers), industry etc mix up segment size and stripe width. This is what I do:

Understand you disk arrays and how they transfer data.

segment size is size of block write on a individual disk (your case 128KB)
stripe width is the amount that the array controller writes to a raid vol/grp/unit, this is number of disks x segment size (your case 128KB x 4 = 512KB). Person was using 4 disks.

Now the DAU (Disk Allocation Unit) that QFS uses to write a block of data for most best practices should match this to avoid write/read miss and what we want to do is for one QFS read or write you only have one "RAID group" read/write. But you can specify the DAU to be what ever you want, within reason.

Your application is writing data in blocks of 0.5MB, So yes your DAU should be 512KB.

So you can have 4 disks of 128KB seg size, or 8 disks or 64KB seg size etc. 8 disks will give more performance than 4, and if you have a 8D+1P RAID 5 group this just happens to fit nicely. NB 1 disk is for RAID parity so you need to add this to the 8 disks for data.

Remember no matter how good a disk arrays cache system is, with the sizes of databases etc that we have nowadays the cache can get overwhelmed very quickly if you do not have enough spindles or disks as we call them. In the end performance is determined by the number of IOPS (I/O's per second) of the backend disks. Try a database load, import/export of a table and watch you disk array performance deteriorate as the cache just cannot keep up.

Now IOPS, very approximate rule is that the faster the disk spins the higher will be it's performance. However, if you can get the average seek time and rotational latency from a disk manufacturers disk sheet then you can work out IOPS. IOPS can be calculated by using the following formula;

IOPS = 1000ms/(averag seek time + rotational latency)

Now QFS stripe options can also help here, but that is an even bigger story. QFS can do round robin writes and stripe accross many disk array RAID groups/sets.

The trick is that the DAU is (most of the time I am sure there will be exceptions) the same blocksize (currency) that the app uses. e.g app writes 950KB DAU should be 1024KB. Most apps behave in the normal powers of 2 KB type (8,16,32,64,128,256,512,1024,2048) thing so you should have a close match as DAU's can be the same size.

What we try to do most of the time is to configure the system so that all the "gates" from the app to the disk raid group are the same size. The "truck" i.e. the block of data fits all the way from the app to QFS filesystem to RAID group without having to do 2 or more writes/read for one requested block for the application. Nightmare scenario is that for an app writing one block the array does many writes. e.g. app writes 128KB and stripe width = 32KB, thus everytime the app does a single read or write the controller has to ask (read/write) 4 times. This is serious I/O performance overhead and what I can make lots of money fixing.

Make sure that your block is not disassembled or assembled in it's journey from the app to the disk. OK the PCI bridges and HBA's may do this but we cannot change that. PCI lanes is getting into deep heavy tech stuff.

So I normally work this way. Find the app blocksize, then make the DAU the same, then make the stripe width the same as the DAU, then decide how many disks we want to use to get IOPS and then divide the stripe width derived from the above calculation by the number of spindles in the physical arrays raid group, to get the segment size. Now the segment sizes are mainly fixed on the arrays that we use, from 8,16,32,64,126,256KB. So we sometimes do not get a round number, to match the app blksize, DAU etc. However, I always make this "magic gate number for the blocks/trucks" larger than the DAU so to avoid 2 physical reads/writes per each application write/read, which is the crux of all application and I/O tuning.

Storage heaven is where we have full stripe writes and reads. Which is implied by the application block size, DAU fitting the stripe width accordingly.

You can check this with various tools by using vdbench (storage perfromance saviour) to do 10 writes or reads of a specific blocksize and if the array does not do the same amount of I/O's (e.g. 10 writes/reads) then you are not hitting the G spot (array Group Size) spot. So if you do 10 writes and the array did 20 your seg size quite likely is half of your DAU or app blocksize. Remember filesystems do strange things to application writes and can mutilate them in more ways than we can dream up, so we have to know and understand filesystems. A good old Unix test is the "dd" command, if you have a array with a certain number of disks in a RAID group run dd to the actual raw Lun to see what it can do. Your filesystem layout which you use later if correctly tuned should get close to this number. If you get more then you are a candidate for a Nobel prize. If widely different then something between the app and the disk is messing thinks up. No chance of a Nobel prize, maybe a Darwin prize.

Think of a truck going down a road and all the tunnels and lanes are the exact size or bigger, thus the trucks journey is never hindered and the driver does not have to unload/dissassemble, load/assemble the truck (block of data) to get it through the tunnels, lanes, toll gates.

Now can the QFS community guys check this as I have been know to write faster than I think. But have have got close to max specified speeds on 6140 and 6540 using this technique. Plus some old heritage and legacy arrays.

Now Have I put the whole storage consultancy business out of a job. Not really, take this example. A woman calls a mechanic (call him Jerry) to fix her car as the engine does not work. Then Jerry takes a look at the engine and gets a hammer, he hits a specific part of the engine and the engine starts to work. Jerry says, that will cost you $500. The woman says, you must be joking, you just hit it with a $10 hammer. Well says Jerry, the bill is for $10 for the hammer and $490 for the knowledge where to hit the engine. You pay for knowledge not the muscle.

Posted at 07:05PM May 22, 2008 by Valdis Filks in Technical  |  Comments[2]

Tuesday Apr 08, 2008
Water and electricity do not mix - water cooling CPU's is bad.

We have been trying for years to use less resources in the datacenter and use air cooling, now we are putting more water cooling into our server designs. I cannot see this as a technological improvement. Again we are curing a problem that should not have happened in the first place. We need to solve the problem in the first place, that is do not make CPU's/chips so hot that they need to be water cooled. We are just making computers so much more difficult to manage by adding water cooling within servers.

Problems with water cooling within computers.

1) Complexity (need water in addition to all other cabling within datacenter)
2) Extra costs (any addition to electicity water cooling requires additional power in a datacenter, which adds costs)
3) Safety (water and electricity are a dagerous mixture, always a risk of water leaks)
4) Increase management (need a whole extra water cooling infrastructure and pipework)

I know that history has a habit of repeating itself, however can we not improve upon designs and techniques from the last century and make cooler less hot chips.

Prevent hot chips, hot CPU's are a design flaw. A cure such as water cooling always costs an order of magnitude more than prevention of the problem in the first place.

A green datacenter should not have water cooled computers. Anyone using this water cooling to heat the building is adding immense complexity. What happens when we kick out the water cooled computer, do people in the building freeze. If we have a water pipe leakage do we have to swith off the water cooled computer while the rest of the air cooled computers continue running. Water cooling computers just have too many downsides and add so much more complexity. We can make life more simple by avoiding water cooled servers and using air cooled systems.

I started work with old IBM mainframes, Amdahl, Hitachi etc. There were always extra problems with water cooling much more to go wrong. Now do we really want to have water pipes all over the datacenter. Not really, ideally we try to have less cabling, pipes etc. That is why it is better we have protocols and cabling improvements like FC, SAS, 10GigE with increased bandwidth and hopefully less to manage. Now we add water pipes in the computer room, it was problematic and costly 20yrs ago and still is.

As all electricians know, water and electricity are a lethal mixture. Avoid water cooled computers.

Posted at 07:54PM Apr 08, 2008 by Valdis Filks in Technical  |  Comments[0]

Tuesday Nov 06, 2007
Virtualisation, comfortable or tight trousers

Still lots of confusion about why we should virtualise and how to do it right. Just to be sure, I am talking about server virtualisation, not storage virtualisation. That I may cover later.

Virtualisation for the wrong reasons:

Easiest way to describe this, it is fashionable so we buy the latest stuff. Say we need a pair of trousers with 36inch waist, but 32inch slim fit trousers are fashionable, so we buy them. We consolidate ourselves into a bad fitting pair of trousers. Works OK for a while, however when we demand more from the trousers and sit down. The seams split open. Trouser investment wasted, embarassment we cannot hide, need to spend more time rightsizing and quickly get changed and put the correctly fitting pair of trousers on. Basically blew a lot of money for no gain, maybe satisfied our vanity for a short time.

Quick list of what too look our for if you are doing it wrong, if you or your staff are doing this, beware.

You have no historic capacity planning metrics or empiracal evidence, including peak usage periods that can show or prove that virtualisation/consolidation will make money or save money.
Other people are doing it so it must be right.
Techies are bored and need something new to do.
A virtualisation project has to be on my CV.
IT department has no answers to the companies problems, and upper managements questions so we need a new idea quick.
The solution is server virtualisation, but we do not know what the problem is.
Computer software costs are hidden from the IT budget.

Virtualisation for the right reasons:

Easiest way I can describe this, we have too many correctly fitting trousers in the cupboard, some are getting old and we are not using them (utlisation is low). The old trousers/style are an embarassment to your wife and you are not allowed to show yourself in public with them. So you take your old rarely used trousers throw them out, make space in your cupboard and instead of using them you use the newer, less worn trousers that do the job just as well.

Quick list of what to do, if you staff are doing this leave them alone.

We have capacity planning tools which can scientifically and empirically prove that virtualisation will help the company grow and become more profitable, IT will give the company an competitive edge or reduce IT costs.
I have bought too many small servers and they are not being used.
Too many systems to manage.
Not enough space in the datacenter.
Peak utlisation of a large number of the servers is very low.
I have a power and cooling problem.

Now as discussed earlier, virtualisation has been available for a very long time you do not have to buy software to do this. Unix systems like Solaris can safely run many application on one server or Operating System image, it has a 20yr track record of doing so. Say you have app A on OS A, app B on OS B, and app C on OS C. Now you can run them all on one server e.g. app A, B and C on OS Solaris. Crazy, but nothing new.

Sun has been doing virtualisation, consolidation and running many apps on one server, in a highly secure, highly available manner for 20yrs. It just happened that tight trousers came back into fashion. Sun has the whole range of well fitting and tight fitting trousers too, but only use them if you can get into them. I prefer an elasticated waste, expands and contracts automatically with capacity demands.

Posted at 12:02PM Nov 06, 2007 by Valdis Filks in Technical  |  Comments[4]

Tuesday Oct 30, 2007
Virtualization - deadly when fashionable, saviour when qualifed.

Why are are we virtualising, is it to have fewer servers, then we are consolidating. Is it to improve server utilisation and efficiency.

We have systems that can run multiple applications, we have been doing this since 1980's. Ever heard of an OS scheduler.

Have we made the mistake to put each application on it's own server. Who bought all those small servers, why. Are we fixing the server sprawl problem.

So in the late 90's and early 2000's we consolidated to large servers. Then we thought that many smaller servers were cheaper and fasters. Now we have figured out that they were not. But we do not want to admit that big consolidated servers was the right idea. So we can put a a virtualisation spin on it and consolidate again.

As I have always said, rightsizing, downsizing and funsizing.

Now a couple questions that we should ask ourselves before we go down the virtualisation route.

Can we meet the peak workload requirements, if all applications need all resources simultaneously. When we consolidate we need to be careful. The finance types will probably look at average utilisation. But, do not use average utilisation as a measure, you need to be there for the peaks. Think about this, most metropolitan underground train systems are built to cope will peak loads not average passenger traffic.

Therefore virtualisation is all about capacity planning and understanding you application workload.

Now, do we also map all of our critical applications/services and make sure that they do not run on the same server, with many virtualized partitions. Lets no put all of our eggs in one basket, we have to spread our business critical apps accross servers, computer rooms sites etc.

Stragely enough we have been able to do this on single servers running one operating system where we use priority groups and memory fencing to increase utilisation for the last 30 years. What has changed ?

So if you do not want to spend lots of money to try this and if you want to try some new high performance, multiple core, low power servers. You can get virtualisation software here for free.

Virtualisation is not new, reduce your risks and costs.

http://www.sun.com/servers/coolthreads/ldoms/index.xml

Posted at 08:23PM Oct 30, 2007 by Valdis Filks in Technical  |  Comments[2]

Tuesday Oct 16, 2007
A backup is not an archive - Road to nowhere

Why is everyone getting backup and archive mixed up. We are using the wrong solution for the problem.

People are using backups to archive email, this is just a disaster waiting to happen. We need to use a archival system to do this. A backup of an email system, or any other application is not an efficient methodology to retrieve emails or any other type of data when it is required to be stored for long time periods for legal, medical, historical integrity or compliance reasons.

Just to get things straight, when to use what.

Backups: Use this to be able to restore an application or dataset (logical or related grouping of data) to a specific point in time.

Snapshots: This is a copy of specfic data files that is kept four short periods of time e.g. hours if we need it back immediately.

Archive: For this we use dedicated archive tools and sofware which is integrated into the application be it files a database, email system, audio or visual (video/pictures) archive. An archive has a metadata and indexing system so that you can search for exactly the data that you want to retrieve. Then you go direct to the media to get it. You DO NOT restore from numerous backups to hopefully find the correct data (or email) from one of those restored backups.

For example, I once worked for a Bank which would take continous snapshots of it's transactions, however when those transactions where grouped into statements those statements were archived to disk for a couple of months and then to tape for several years. If a customer needs to get a copy of a statement, recents statements e.g. last 6 months are on a disk archive but long term they are on tape. Logon to your internet Banking and have a look how far back you can get online statements or account history.

The application which runs the financial systems is backed up from the snapshot copy, normally on-site and off-site disk to disk to tape copies are taken. In case of a disaster and the application has to be re-installed. The application is the logical grouping of related files that are required to run the application. Depending on the application we would take a snapshot and journal the transactions. If an application failed a Snapshot plus journal updates to the failure time would only enable a restore back to a Recovery Point Objective in the last 24hrs.

Time periods:

Backups: This is to take a copy of an application where the age of the data is from 1 day to several months. The point to which you restore to is the Recovery Point Objective.
Backup media: For recent data, 1 to several days old that you may want to restore quickly use disk for the backup. Data that has been backed up/stored for long periods e.g months then this should be on tape. This is the DDT, disk to disk to tape architecture. Or you use a Virtual Tape Library, which transparently can do this for you. Your backup software should co-locate related data so that you can restore faster within a reasonable time period. You should not be trawling through tapes, the backup software will do this for you.

Snapshots: This takes a copy of data that is very recent and you may want it back within 1 minute to 1-2 days.
Snapshot media: This should be disk as you are copying active data. If you are keeping snapshots for longer periods of time e.g. days or weeks and the costs get too expensive and you should use backups, DDT or VTL as above.

Archives: This is what you keep for very long periods of time e.g. months to 10's or 100's of years.
Archive media: This really has to be a Virtual Tape Libary or disk based indexing system with 90% of the data sitting on tape. Archival systems should also have co-location features so that if you have to retrieve specific groups of related files, they are stored on the same tape of group of tapes. Tapes last for 10's of years disks 5yrs if you are lucky. So every 10yrs or so you need to migrate and/or defragment your tape archive. If you do not use your data, you do not pay for storing it, with disk you pay every day for the power to keep the disks spinning, powered up. MAID systems are physically heavy and are not as efficient from a storage to weight ratio e.g. GB/KG.

For those companies that see IT as a cost then they really need to move to tape for backup and archive, for those companies that see IT as a competitive tool that leverages their business they should use Virtual Tape Libraries and snapshots to disks. While keeping an eye on costs and using the storage hierarchy to put retired data on tape. Remember the largest proportion of the total cost of ownership of storage is not the purchase cost. For cost reasons you just cannot have data archived for 10's or 100yrs on disk.

From a recovery time and cost perspective. The storage hierarchy is snapshots (disk or even memory/RAM), backup (disk and tape) and archive (tape). You need to match your data availability requirements to the appropriate media taking into account cost and access/retrieval times.

Backups were designed to be used to restore data when that data was lost. Backups were NOT designed to be used as archives, microfiche was used for this and various types of optical disk technologies.

What we do not want to do as is use the wrong methodology to solve the data retention issue, if we do we will end up with inefficient and error prone procdures. Restoring emails from backups is a prime example of this. Use a email archiving application.

Now I am talking about commercial users, home PC users just take backups/archives to whatever is the most cost effective media, CDROM, tape or some type of USB disk backup device.

In the future we could do something like this with ZFS, backup and take snapshots over the Network, oh no the network is the computer again. Check these out:

http://www.markround.com/archives/38-ZFS-Replication.html

http://blogs.sun.com/constantin/entry/useful_zfs_snapshot_replicator_script

With respect to backups and tape. Have we forgotten where we came from and do not seem to know where we are going. We are using the wrong solution for the problem. As well described in the the song Road To Nowhere by the Talking Heads.

HEADS TALKING lyrics

Posted at 12:45PM Oct 16, 2007 by Valdis Filks in Technical  |  Comments[0]

Friday Sep 28, 2007
English the first language, that is open source

I was born in England, in a old English market town called Kettering. Where we had tea parties, ham sandwiches (with mustard if we wanted to be exciting), ladies wore floppy hats, fair play and manners were very important, cricket matches lasted a whole day. On the rugby field we would smash the hell out of each other, only to shake hands at the end of the match.

However, I grew up in another town nearby, Corby. Which had more of a mix, Scottish, Jamaicans, Welsh, Polish, Latvians and Londoners. It was a "new town" and pulled in people due to the large steelworks.

From an early age I listened to classic English spoken well, Glaswegians and Cockneys from the east end of London. My friends at school came from many ethnic groupings. At one stage I think I could track any English accent down to a specific town, no chance now. Due to all the computer manuals I have read, I do believe my English grammar has deteriorated, alternately it may of just been globalised.

Now I live in Sweden and work for a geographic region, where English is the official language, I listen to Finnish people discuss problems with Belgians in English. We all understand each other sometimes too well.

However, for all my colleagues at Sun and generally anyone else who is interested in how the English language became so prevalent and the defacto business language I would recommend reading the following book.

"The Adventure of English", by Melvyn Bragg.

A short description is here. http://www.madaboutbooks.com/theAdventureOfEnglish/extract.htm

Nobody should be afraid their English, the language belongs to the world.

This is different from most countries where government departments have controlled the language. As far as I know English has never been controlled by the English government and this may be why it has become so successful.

What is miss the most in Sweden is the wordplay we have in the English language, especially while sitting in a pub having a warm beer. However, while watching a childrens football match two days ago, a Swedish friend said that the other team spends more time going forth rather then back and fort, as does our team. So the English tradition of wordplay and humour is spreading with the language.

Say anything that you want to in English as long as you do not insult others or yourself, it belongs to the whole world now. English does have some rather good insults too, but most of those words we inherited from other shores. Is the term "open source" GPL v2 or GPL v3 a curse, insult or complement, English is all of these !

Posted at 03:17PM Sep 28, 2007 by Valdis Filks in Technical  |  Comments[0]

Monday Aug 27, 2007
Booting from a SAN (same as a network)

The network is only valuable if it is reliable. Same as the SAN or storage area network.

I have been working with SANs from the first days of SAN switches (pre-fabric) and boot volumes (since the IPL circular dials, in IBM m/f speak) for a long time. This is a good experience as the question of should we boot from a SAN or not still seems to come up. I thought that this was dead and buried, but due to this question being relentlessly raised by customers, here goes.

If you have good change management, DR procedures that are tested monthly, a dual san that is well managed and stable, go ahead. You also need a test SAN. Quick table of what you need in place, you should be able to check these before you start to implment SAN boot.

Stable SAN.
Qualified SAN support staff.
Dual SAN and dual HBA's in servers.
Testing environment (SAN, servers, etc).
Monthly or Quaterly, DR testing/switchover testing, rebooting.
Well implemented change management system.

Applications do not need to be taken down or to have outages to do server reboots. Sun with Sun Cluster can do dynamic failover to test the reboot of servers via a Sun cluster, where you can dynamically switch the load/services between nodes/servers, then you can test your SAN boot, by moving from one node to another. Shut down A while running on B, reboot A on SAN the let A join the cluster.

I have had a lot of companies running Sun Cluster for years, then after many changes over the years to the network, SAN etc, when they reboot the cluster no longer works. This is because lots of changes have been made around it but nobody checked to see if they affected the cluster. Sun gets the call whether the outage was caused by Sun or not. Thats what you pay service for.

This is not that much to do with technology, more to do with systems adminstration and management.

Other issues, business related that you need to be aware of.

3rd party or aftermarket disk storage suppliers will always recommend SAN boot, so that they can justify consolidating storage in one area. This is in their nature as to get the storage business they promise and say yes to everything. Caveat emptor, when things start failing the 3rd party disk suppliers will be very "shy to be seen and aloof" and ask the server vendor to fix any problems. They will log into the "boot disk array" say "all is OK with my tin box, must be the server". Now do you really want to be in this situation. You need a supplier who can has system-wide experts, storage and server (including network, OS, cluster and application). Sun will also often propose consolidation if appropriate, it's is good management to tidy up now and again.

System companies (ones that supply the whole stack) will not want to boot from the SAN unless the support and maintenance is watertight. They will recommend SAN boot in well managed and/or mature environments.

Alternatively, 3rd party storage vendors will rightly say that this is FUD from the system companies. Now we are into spin city. Time to make a decision. If you have a 3rd pary storage supplier with good OS and server skills then your risks may be reduced.

NB the only system companies left are Sun and IBM, HP are just about staying in there, but as a colleague mentions, they do not look like they want to stay in the server and OS business. See Dave Levy's article http://blogs.sun.com/DaveLevy/entry/the_future_for_hp_ux

I worked with HP servers and HP-UX, so I used to be part of that tribe, no problems with it, just worried where it is going.

Summary, booting from a SAN is fine, if you can tick all the boxes in the table above. People, processes and procedures are more important here than the technology. If you do not know what I am talking about do not boot from a SAN.

Posted at 11:16AM Aug 27, 2007 by Valdis Filks in Technical  |  Comments[5]

Tuesday May 29, 2007
My 10yr old reliable home PC

Built this 10yrs ago, upgraded the CPU from 400MHz to 800Mhz about 6yrs ago.
It ran Win98 until 6 months ago when I decided to upgrade to WinXP, also runs JDS in another partition.
I replaced the old 80GB disk with 2 new disks 6 months ago

Processor: 800Mhz Pentium III
RAM: 320MB
Screen:17” CRT Iiyama, Vision Pro Master Pro
Disks: 1 x 250GB, 1 x 150GB ATA
Graphic Card: ATI Radeon 7500

We surf the web and write the occasional doc, clean out the cache and temp files weekly, works like a dream.

NB, My wife uses a Apple iBook, and for work I use a SunRay in the Office. When travelling a 5 yr old Toshiba Laptop running Java Desktop System 3, Staroffice 8 and Mozilla 1.7.

In all honestly it is getting a bit old, will probably upgrade in next 2yrs, would like to utilise dual core CPU's, SAS drives, PCIe busses and whatever else will be around for the next 10yrs. Not sure if WinXP or Vista can utilise dual core CPU's.

Posted at 09:16PM May 29, 2007 by Valdis Filks in Technical  |  Comments[0]

Thursday May 24, 2007
The dupe in de-duplication

Encryption and de-duplication can they co-exist.

These two methodologies that exploit various new technologies/algorithms such as encryption and de-duplication have been worrying me for a while. Individually they sound good, however combined they do not seem to complement each other. Can they complement each other or destroy/negate each other.

For example if we encrypt data being sent to the storage device will de-duplication still work. The de-dupe device receives encrypted data which is rarely unique. Blocks of data that are originally the same but then encrypted will rarely be the same, thus pattern matching will not detect duplicate blocks. Reducing any duplication savings. Could we get de-dupe collisions with two different blocks of data that when encrypted are the same. Thus the de-dupe device stores one of these and delete/loses the other block(s) of data. You get duped.

Various scenarios

1) Encrypt at source, de-dupe fails

If we encrypt at the source or the network and before the de-dupe device then de-dupe fails. It cannot detect duplicate source data, as it is encrypted.

2) De-dupe first, encrypt afterwards

If we do not encrypt at source data is sent in the clear and not secure. If we put de-dupe before the encryption device do we need a de-dupe unit for every server (initiator) and we get de-dupe sprawl.

3) Hybrid, use what works and is available

Encrypting at source requires lots of server CPU resource, taking it away from applications, there may be a solution in the near future. So de-duplicate on a VTL, then store on tape that is encrypted and managed securely with a Key Managemet Station (KMS).

I think the real answer is manage your storage more efficiently then we do not need de-duplication e.g. Prevent the problem rather than cure it once you are suffering. Have storage management policies and an ILM strategy. However, human nature does not work like this so I believe de-duplication will be popular where people do not have time to prevent problems but would rather solve them.

I am sure that the computer industry will solve this dichotomy but at the moment there is a schism. For early adoptors caveat emptor, buyer beware.

If you want to save energy do not use it. If you want to save on storage costs delete it, or store on tape.

Posted at 02:56PM May 24, 2007 by Valdis Filks in Technical  |  Comments[4]

Monday Feb 12, 2007
Improving I/O throughput for T2000 servers
If you are using T2000 servers :

While tuning & benchmarking an application on T2000, Sol 10, using QFS 4.5 and ST6140 we doubled I/O throughput (MB/s) and therefore application performance by 100%. The Sun StorageTek 6140 has a theoretical bandwidth of 800 MB/s, by increasing the pci-max-read-request and tuning filesystem and disk volume blocksizes the T2000 was able to exploit our storage to about 700MB/s. This was a I/O intensive app, thus the same performance gains may not be seen in other environments. I think that it could be useful anyway in general applications if we increase it from the default. However, do not increase to it's maximum value as this could do more harm than good.

A relatively unknown, "qlc.conf" parameter, specifically the "pci-max-read-request" has a very low default setting, this restricts I/O performance and therefore makes the T2000 server slow and also does not exploit the full capabilities of Sun storage. This can be fixed by setting the parameter to a higher value. This is only applicable to T2000 using PCIe.

Details below:

My recommendation: Set pci-max-read-request=2048 on all T2000 servers.

Parameter explanation:

Set in " /sysconfig/drv/qlc.conf "

#Name: PCI max read request override;
#Type: Integer, bytes; Range: 128, 256, 512, 1024, 2048, 4096
#Usage: This field specifies the value to be used for the PCI max read request, overriding the value programmed by the system.
#NOTE: The minimum value is 128 bytes; if this variable does not exist or is not equal to 128, 256, 512, 1024, 2048 or 4096, the ISP2xxx
# defaults to values specified by the system.
Background:
When you write data to a target, you are reading from PCI. The issue is that T2K defaults this to 128 bytes. To get best performance this has to be set to 4096. But its not that simple. SUN PCI folks warn that if we set this too high then we are eating up resources from other devices on the bus such as ethernet. There is currently a project going on in PCI team to make this process automatic. In the meantime folks have set this to 2048 without seeing any issues.

Other things to check for anyone optimizing I/O throughput on general Sun servers. These are the Solaris parameters that need attention if you want to improve throughput. Only change and test during development, not on production systems. As always use good project management and change control procedures when implementing any kernel changes.

Allow Solaris and Solaris disk drivers to perform the maximum size I/Os you anticipate. Setting these numbers lower than what the application and file system prefer, results in fragmentation of I/Os. Setting these numbers higher consumes more memory and may cause some very old disk or channel hardware to become unstable, but for recent large memory footprint machines and disks this is typically not an issue. The system must be booted for these changes to take effect:

Set the maximum I/O size for Solaris by editing the /etc/system file.
# /etc/system
# this sets maximum physical I/O size to 8MB
set maxphys = 0x800000

Set the maximum I/O size for fibre channel disks by editing the /kernel/drv/ssd.conf file:
ssd_max_xfer_size = 0x800000;

You have to set up /kernel/drv/sd.conf for individual SCSI disks:
name="sd" class="scsi"
sd_max_xfer_size=0x800000
target=3 lun=0;

Solaris Queue depth, how many SCSI commands can be queued to a LUN.
Set this too low and you are not getting max performance, too high and you can get device overruns, which then cause recovery processes to redrive/retry I/O and causes a slowdown, so normally stay with 64 (NB check with storage supplier as 3rd party equipment often is not optimized for multithreading as is Sun Storage).

ssd_max_throttle=64
(change sd_max_throttle for SCSI disks)

There may be other configuration files and option settings for non-Sun host bus adapters and drivers. See the third-party documents for more.

Caveat, 3rd party arrays often do not multi-thread and scale as well in performance as does Sun Storage. It is possible with large queues to “overrun” 3rd party storage. We have often had to reduce the queue depth (ssd_max_throttle) for slower 3rd party devices. I will not get into this as 3rd party or aftermarket suppliers will often disagree with me.

This is a personal recommendation, which will hopefully avoid me from documenting elsewhere and having to explain this several times a month.

This is not a Sun Engineering, or official Sun Solaris patch, fix etc.
Posted at 10:06PM Feb 12, 2007 by Valdis Filks in Technical  |  Comments[15]