Back

Sorting 1 billion numbers

Sun, Jul 25 20213 min read
Nathaniel
Today we are going to try to sort 1 billion numbers. To be exact, they are 1 billion 32bit signed integers. Which have a size of around 4GB. Loading all of them into memory at once is possible, but not really the point here. I'm trying to practice external sorting, a way of sorting data that are too large to fit inside memory.

Beginning

The easiest way of external sorting is merge sort. We can use the large-capacity external storage to divide the 1 billion numbers into small pieces, loading and sorting them one by one, then write each piece into the external memory after sorting.
After the segmentation is completed, these small pieces are merged and sorted. At the same time, in the process of merging and sorting, the maximum value obtained will be written into the file in real time, so that low memory usage can be guaranteed.

Divide

Suppose we have a file named billion, which contains 1 billion 32 bit signed int binary data. We need to cut it up into a number of pieces. Here I divide it into 100 parts, with each part containing 10 million numbers. Each one of them need to be sorted. Here we use qsort to sort part of the file.
static __int32 piece[PIECESIZE];
int comp( const void *a, const void *b )
{
return *(__int32*)a - *(__int32*)b;
}
FILE *billion = fopen( "billion", "rb" );
int i = 0;
while( i < TOTAL / PIECESIZE ){
fseek( billion, PIECESIZE * i, SEEK_SET );
fread( piece, sizeof( *piece ), PIECESIZE, billion );
qsort( piece, PIECESIZE, sizeof( *piece ), comp );
char fileName[300];
snprintf( fileName, sizeof(fileName), "pieces/piece%d", i );
FILE *outFile = fopen( fileName, "wb" );
fwrite( piece, sizeof( *piece ), PIECESIZE, outFile );
fclose( outFile );
++i;
}

SORT

After 1 billion integers have been divided into 100 blocks, and these blocks were already sorted, we can read these blocks, apply merge sort, and write the results to the file in real time. During this period, memory consumption will remain low.
FILE *outFile = fopen( "out", "wb" );
FILE *fileList[FILEAMOUNT];
int i = 0;
for( i = 0; i < FILEAMOUNT; ++i ){
char filePath[200];
snprintf( fileName, sizeof(fileName), "pieces/piece%d", i );
fileList[i] = fopen( filePath, "rb" );
}
int numbers[ FILEAMOUNT ];
for( i = 0; i < FILEAMOUNT; ++i ){
fread( numbers + i, sizeof( __int32 ), 1, fileList[i] );
}
int n = 0;
while( 1 ){
int minIndex = MinIndex( numbers );
if( minIndex == -1 ) break;
fwrite( numbers + minIndex, sizeof( __int32 ), 1, outFile );
++n;
fread( numbers + minIndex, sizeof( __int32 ), 1, fileList[minIndex] );
if( feof( fileList[minIndex] ) ){
numbers[minIndex] = -1;
}
}
for( i = 0; i < FILEAMOUNT; ++i ){
fclose( fileList[i] );
}
fclose( outFile );
Among them, the MinIndex function gets the subscript of the smallest value in the array, and it will skip when it encounters -1, I used number -1 as end of file. If the MinIndex function returns -1, it means that all files have been read.
int MinIndex( int *arr )
{
int i, index = -1;
for( i = 0; i < FILEAMOUNT; ++i ){
if( arr[i] == -1 ) continue;
if( index == -1 || arr[index] > arr[i] ) index = i;
}
return index;
}

Conclusion

According to the timer, it takes 240s to split + sort each block and 433s to merge. We can see that the hard disk IO performance is the bottleneck here.

Comments(0)

Continue with
to comment