Tag Archive: Tips


Ever wanted to increase your skills but don’t have the $ to go back to college? Check out MIT’s OpenCourseWare – http://bit.ly/axEBzr

In working on a few of my projects, I had a few issues dealing with extremely large XML files. More specifically, every computer that I tried to open them on failed miserably. I did a lot of searching and scouring of the internet to find a program that did this automatically. The best program I could find was called “Large Text File Viewer”, and that’s far from what I was looking for, although I was still happy to find a program that I could use to even just simply open the large files.

I started to try and “chunk” the files myself by hand, and after creating a few files thought to myself “there has to be an easier way to do this”. I got back on and kept searching for a solution. After a lot more digging, I finally came across a simple script doing exactly what I wanted it to do. I took the script and modified it a bit to follow OOP standards a bit more (before all addresses, URLS, etc. were hard coded).

Here is the class that I ended up creating:

class xmlChunk
{
function xmlChunk(){
}
/*$basefilename // the base file name for the chunks
$xmlfile // the xml file name to be processed
$xmldatadelimiter // core data delimiter
$xmlitemdelimiter // record delimiter
$chunksize = 2000; // number of records in each chunk file
$dir // path to where splits will be stored
*/
function doChunk( $basefilename, $xmlfile, $xmldatadelimiter, $xmlitemdelimiter, $chunksize=2000, $dir= "/var/www/public_html"){
//initialize vars
$begin=time(); // script start time
$start = time(); // last gate time
$interval=time(); // current gate time
$minutes=1; // intervals for gates
$filenum = 1; // start chunk file number at 1
$recordnum = 1; // start at record 1

$xmlstring =''."\n";
$xmlstring.="<$xmldatadelimiter>\n";
// xmlchunk file header
//dirs and files

$exportfile = "$dir"."/splits/$basefilename-$filenum.xml";

//start processing
echo "Processing (".$dir."/$xmlfile)\n";

$handle = @fopen($dir."/$xmlfile","r");

if ($handle) {

while (!feof($handle)) {

$buffer = fgets($handle, 4096);
// if item delimiter reached
// increment record number iterator
if (ereg("",$buffer)==true) {
$recordnum++;
}
//write line to chunk file
error_log("$buffer",3,$exportfile);
// if chunk limit reached then start to
// close the file with well formed xml
if ($recordnum>$chunksize) {

// post feed end tag
error_log("",3,$exportfile);

// and increment file number to start new log file chunk
//reset record counter number for new chunk file
$recordnum=0;
$filenum++;

//update export file name
$exportfile = "$dir"."/splits/$basefilename-$filenum.xml";

//echo status report to STDOUT
echo"Segment $filenum. Record ".($chunksize*$filenum).".\n";

// write new chunk xml file header
error_log($xmlstring,3,$exportfile);
}
//put in a catch so that script doesn't run riot and
//will die after X number of cycles
if ($filenum>5000) {
die();
}

if (($interval-$start)>60) {
$minutes++;
echo $minutes." Minutes so far.\n";
$start=time();
} else {
$interval = time();
}
}
fclose($handle);
} else {
echo"Unable to open file! (".$dir."$xmlfile\")\n";
}
$procend = time();

echo "\n####\n";
echo "Split Complete (".floor((($procend-$begin)/60))." Minutes)\n";
}

}

So, in order to utilize this class, just create a script that calls the previous class file as follows:

require('xml_chunk.class.php');
$basefilename = "filenameChunked";
$filename = "/path/to/xmlfile/file.xml";
echo "Creating ".$basefilename." Splits
";
$chunk = new xmlChunk();

$chunk -> doChunk( $basefilename, $filename,' baseXMLtag', 'itemXMLtag',2000/*limit*/, "/path/to/directory" /*directory must contain folder named "splits"*/);
unset($chunk);