P
Phil Lewis
Apologies as this turned out to be rather lengthy....
I run a development engineering lab for a financial services company and we
are running into a rather peculiar but very troubling problem on some of our
performance servers when copying, moving or backing up very large files,
i.e. files greater than about 52 Gigabytes in size (yes GB not MB).
Primarily these are SQL Server database files but we also see the same
problems copying or relocating large Virtual Server Hard Drive files of that
size or larger.
The problem is actually an old one that I think has not been dealt with, the
end result for us on Windows Server 2000 and Windows Server 2003 R2 is that
after copying about 52 Gbytes of the file, Windows starts reporting "Windows
delayed write" errors and at that point the file copy collapses and stops.
Although the system reports the copy is still running, no further data is
being successfully copied. All the file IO and other windows processes
slowdown considerably (more on this in a little bit). In each case where a
delayed write error is generated, Event Viewer shows the first error as
being event ID 50 and or Event 26. The problem is seen when using Drive
letters and UNC paths and we asked about hotfixes for Server 2003 SP2 but
were told by support there were none as the fix described in KB Article
890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into SP2 and
did not apply to our issue.
We have only recently begun to see these problems because until recently
most of our performance testing model used fairly small working sets for
data (typically under 100GB total) and thus each file-group in our databases
was less than 40GB so we never really saw a problem. I've seen this problem
on ALL versions of Windows including Server 2008. In the server 2008 case,
the O/S collapses completely and cannot be shutdown.In most cases we have to
power-off the server to get the problem system to recover, no delayed write
error is reported on server 2008.
I've tested this on a variety of servers (listed next) and in as many cases
as possible I tested on multiple servers with the same config and with
different O/S editions. I also tested one of the servers that was
experiencing the problems first, using small files (files from 1byte up to
40 GigaBytes in size). The test set I use is approx 870GB in total size and
has been continuously copying for about 40 days now continuously on this
server, I think at last check it had copied around 2,973 Terabytes of data
on this server, all without error.
Primary Test Servers and configurations
HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB SAN
Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage 4.2TB on an
EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
SP2 x86 and x64, Win Svr 2008 x64.
DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB on an
EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
SP2 x86 and WinSvr2008 x86
Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB of local
storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box via iSCSI WinSvr
2003R2 SP2 x64
With the exception of the ML570 I've tested on multiple servers of the same
type. All servers have succeeded in copying large volumes of files 40Gb and
smaller.
On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We don't mix
spindle speeds in raid groups. On our Dells, all local drives are SAS, on
the EMC AX150 all drives are are SATA and on the SATABoy all drives are
SATA. On all systems file copies using small files up to 0 - 40GB are all
successful.
All systems are running the latest BIOS and we have seen the same behavior
on prior BIOS versions. All disk controller firware is updated to the latest
version and like the BIOSes the same behavior existed on earlier versions.
We have checked and updated hard disk firmware, where new versions are
available. Same issues as for controller firmware. On local Hard drives we
run Raid 1 or Raid 5 to get best performance or max capacity. Both Raid
modes exhibit the same behavior. I have tested on the HP's with no Raid at
all and the same results occurred.
I try not to do specialty O/S builds for our lab environment. I build a
straight default Windows O/S configurations, fully patch it with microsoft
Patches and burn in test the system, then go test for this problem. I do not
tweak system settings or apply registry hacks until I get baseline test
data. In all cases here for the file copy tests I have not tweaked the
system settings or registry at all. Our servers are set for background
performance for the system cache, although we tested with 'foreground' set
without success too. We've tried large and small pagefiles and have moved
pagefiles to seperate disk spindles to see if it made a difference.
I've tried all manner of file copy and file sync tools, but what it comes
down to that if the file being copied is written on the systems' storage
system (local or SAN), the system will collapse and file copying will fail
somewhere around 52 - 58GB being copied. Windows Server 2008 has given me my
best look into the problem and what appears to be happening is that the
system cache keeps expanding until all physical memory is used and the paged
pool keeps growing until it hits around 380MB and non paged hits about 82MB
(I think the latter is right). What I then see is the CPU goes flat line as
does the Total Disk byes written in Perfmon but the Physical memory usage
history in task manager suddenly starts ramping until it gets out around
70GB and then everything is done and either the system hangs (server2008) or
delayed write errors occur.
One place where I do not seem to see the problem is SAN Drive to SAN Drive
Copies on the EMC SAN. I always see this problem on the SATABOY SAN with
large files when copying large files to the SAN volumes regardless of Cache
Settings. I can backup the files to tape but due to their size its an
expensive option both in media cost and time to backup and restore the data,
my preferred option is to backup to removeable Hard Disk (External USB -
SATA), sure its slower but it offers operating efficiencies right now I
cannot get with Tape (if it worked). Some testing has centered on using
external drives, but most of the testing on my systems has been to copy or
move the files from one volume to another on the server. It doesn't matter
whether I turn caching off or on for external USB drives or local drives. I
have had very occasional success on the EMC SAN ensuring that windows O/S
disk caching is off. Success using this method has been spotty and limited
to servers with 32GB or more of RAM. One very telling test setup was to
populate the Server with 128GB RAM and run windows Server 2008 x64. In that
case almost all file copies were successful, although they became painfully
slow after about 60GB was copied and it took over 5 hours to copy the last
~35GB of a 95.8 GB file.
I tried using Backup Exec and MsBackup to backup the files to a hard disk
but it failed everytime, when I run the same backup to tape it is
successful.
This is leading me to think the problem is generated in the lower level file
system filter drivers. I've tested with and without AntiVirus software in
the mix and have likewise tested systems that are built raw with no patches
at all and see the same problems. I also tried splitting the file into
chunks and copying the pieces and while I can split the file, I cannot join
it again as the processes all seem to rely on creating a temp file and the
process of copying the large temp file always results in the delayed write
errors being generated. I've also tried zipping up the file to reduce its
size (database files compress really well) but that process likewise
requires a file copy of a large file, and at some point that fails. It
should go without saying I've tried using SQL Db Backup writing to disk
storage and it fails everytime, tape is successful. In fact it was this very
act that caused me to begin investigating the problem in the first place.
In days gone by there were loads of users seeing this problem on Win XP
copying much smaller files and there are some other people seeing this
problem today on Windows Server, Microsoft are very quiet on the subject for
windows server, I think in no small part because very few people are seeing
the problem and there is no demand to identify the problem or to fix it. I
cannot believe though I'm one of the first people to see the problem.
I'll quite happily accept its a configuration issue if someone can tell me
how to fix the problem! All attempt to tweak a system has not yielded any
success. What also sucks is that you typically have to wait ~30 mins to find
out that the problem will manifest.
I'm about ready to escalate this issue to Microsoft, I think I now have
enough test data to do so, but thought I would bounce this off others to see
if anyone else has a solution or guidance first.
Phil
Checkfree:
I run a development engineering lab for a financial services company and we
are running into a rather peculiar but very troubling problem on some of our
performance servers when copying, moving or backing up very large files,
i.e. files greater than about 52 Gigabytes in size (yes GB not MB).
Primarily these are SQL Server database files but we also see the same
problems copying or relocating large Virtual Server Hard Drive files of that
size or larger.
The problem is actually an old one that I think has not been dealt with, the
end result for us on Windows Server 2000 and Windows Server 2003 R2 is that
after copying about 52 Gbytes of the file, Windows starts reporting "Windows
delayed write" errors and at that point the file copy collapses and stops.
Although the system reports the copy is still running, no further data is
being successfully copied. All the file IO and other windows processes
slowdown considerably (more on this in a little bit). In each case where a
delayed write error is generated, Event Viewer shows the first error as
being event ID 50 and or Event 26. The problem is seen when using Drive
letters and UNC paths and we asked about hotfixes for Server 2003 SP2 but
were told by support there were none as the fix described in KB Article
890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into SP2 and
did not apply to our issue.
We have only recently begun to see these problems because until recently
most of our performance testing model used fairly small working sets for
data (typically under 100GB total) and thus each file-group in our databases
was less than 40GB so we never really saw a problem. I've seen this problem
on ALL versions of Windows including Server 2008. In the server 2008 case,
the O/S collapses completely and cannot be shutdown.In most cases we have to
power-off the server to get the problem system to recover, no delayed write
error is reported on server 2008.
I've tested this on a variety of servers (listed next) and in as many cases
as possible I tested on multiple servers with the same config and with
different O/S editions. I also tested one of the servers that was
experiencing the problems first, using small files (files from 1byte up to
40 GigaBytes in size). The test set I use is approx 870GB in total size and
has been continuously copying for about 40 days now continuously on this
server, I think at last check it had copied around 2,973 Terabytes of data
on this server, all without error.
Primary Test Servers and configurations
HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB SAN
Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage 4.2TB on an
EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
SP2 x86 and x64, Win Svr 2008 x64.
DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB on an
EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
SP2 x86 and WinSvr2008 x86
Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB of local
storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box via iSCSI WinSvr
2003R2 SP2 x64
With the exception of the ML570 I've tested on multiple servers of the same
type. All servers have succeeded in copying large volumes of files 40Gb and
smaller.
On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We don't mix
spindle speeds in raid groups. On our Dells, all local drives are SAS, on
the EMC AX150 all drives are are SATA and on the SATABoy all drives are
SATA. On all systems file copies using small files up to 0 - 40GB are all
successful.
All systems are running the latest BIOS and we have seen the same behavior
on prior BIOS versions. All disk controller firware is updated to the latest
version and like the BIOSes the same behavior existed on earlier versions.
We have checked and updated hard disk firmware, where new versions are
available. Same issues as for controller firmware. On local Hard drives we
run Raid 1 or Raid 5 to get best performance or max capacity. Both Raid
modes exhibit the same behavior. I have tested on the HP's with no Raid at
all and the same results occurred.
I try not to do specialty O/S builds for our lab environment. I build a
straight default Windows O/S configurations, fully patch it with microsoft
Patches and burn in test the system, then go test for this problem. I do not
tweak system settings or apply registry hacks until I get baseline test
data. In all cases here for the file copy tests I have not tweaked the
system settings or registry at all. Our servers are set for background
performance for the system cache, although we tested with 'foreground' set
without success too. We've tried large and small pagefiles and have moved
pagefiles to seperate disk spindles to see if it made a difference.
I've tried all manner of file copy and file sync tools, but what it comes
down to that if the file being copied is written on the systems' storage
system (local or SAN), the system will collapse and file copying will fail
somewhere around 52 - 58GB being copied. Windows Server 2008 has given me my
best look into the problem and what appears to be happening is that the
system cache keeps expanding until all physical memory is used and the paged
pool keeps growing until it hits around 380MB and non paged hits about 82MB
(I think the latter is right). What I then see is the CPU goes flat line as
does the Total Disk byes written in Perfmon but the Physical memory usage
history in task manager suddenly starts ramping until it gets out around
70GB and then everything is done and either the system hangs (server2008) or
delayed write errors occur.
One place where I do not seem to see the problem is SAN Drive to SAN Drive
Copies on the EMC SAN. I always see this problem on the SATABOY SAN with
large files when copying large files to the SAN volumes regardless of Cache
Settings. I can backup the files to tape but due to their size its an
expensive option both in media cost and time to backup and restore the data,
my preferred option is to backup to removeable Hard Disk (External USB -
SATA), sure its slower but it offers operating efficiencies right now I
cannot get with Tape (if it worked). Some testing has centered on using
external drives, but most of the testing on my systems has been to copy or
move the files from one volume to another on the server. It doesn't matter
whether I turn caching off or on for external USB drives or local drives. I
have had very occasional success on the EMC SAN ensuring that windows O/S
disk caching is off. Success using this method has been spotty and limited
to servers with 32GB or more of RAM. One very telling test setup was to
populate the Server with 128GB RAM and run windows Server 2008 x64. In that
case almost all file copies were successful, although they became painfully
slow after about 60GB was copied and it took over 5 hours to copy the last
~35GB of a 95.8 GB file.
I tried using Backup Exec and MsBackup to backup the files to a hard disk
but it failed everytime, when I run the same backup to tape it is
successful.
This is leading me to think the problem is generated in the lower level file
system filter drivers. I've tested with and without AntiVirus software in
the mix and have likewise tested systems that are built raw with no patches
at all and see the same problems. I also tried splitting the file into
chunks and copying the pieces and while I can split the file, I cannot join
it again as the processes all seem to rely on creating a temp file and the
process of copying the large temp file always results in the delayed write
errors being generated. I've also tried zipping up the file to reduce its
size (database files compress really well) but that process likewise
requires a file copy of a large file, and at some point that fails. It
should go without saying I've tried using SQL Db Backup writing to disk
storage and it fails everytime, tape is successful. In fact it was this very
act that caused me to begin investigating the problem in the first place.
In days gone by there were loads of users seeing this problem on Win XP
copying much smaller files and there are some other people seeing this
problem today on Windows Server, Microsoft are very quiet on the subject for
windows server, I think in no small part because very few people are seeing
the problem and there is no demand to identify the problem or to fix it. I
cannot believe though I'm one of the first people to see the problem.
I'll quite happily accept its a configuration issue if someone can tell me
how to fix the problem! All attempt to tweak a system has not yielded any
success. What also sucks is that you typically have to wait ~30 mins to find
out that the problem will manifest.
I'm about ready to escalate this issue to Microsoft, I think I now have
enough test data to do so, but thought I would bounce this off others to see
if anyone else has a solution or guidance first.
Phil
Checkfree: