New topics: Your Pet, IOU, Baby IQ, The Poisons, Birther II, Games, Future Power

Wikipedia parsing progress, and LINQ memory leaks and performance

Skip to end of sidebar
Go to start of sidebar
Skip to end of metadata
Go to start of metadata

.

Finally found the memory leak problem with my wikipedia parsing. It was a quirk of the LINQ database stuff keeping things in memory. The program is now happily processing 100 articles every 0.4 seconds, which is amazingly better than the 10 seconds it was taking to process 100 articles yesterday morning. Occassionally it speeds up and processes 100 articles in only 0.1 seconds.... And not having all the extra memory sitting around being held means the program is not gradually slowing down either.

How I structured the program:

  • There is now one thread parsing the XML file, sending its results to a background thread
  • The backgroudn thread does some extra parsing, and then background thread queues the objects ready for the database insert
  • two database insert threads grab 250 articles at a time from the queue and batch insert them into the SQL Express database.

The XML parsing thread occassionally sleeps because it is much faster than the database threads, I dont want more than about 2500 articles sitting in memory waiting to be dumped into the database....

With the program current speed, it might take maybe 400 minutes to process entire wikipedia dump, down from maybe 6 days for the version of the program yesterday morning.

Some lessons learned:

  • It is cheap to dispose of DataContexts with LINQ and create new datacontexts for each batch of database access. This gets rid of memory hanging around. This might not be the case for connecting with a remote database, the time to establish a new SQL Server connection to a remote server might be significant, this needs to be tested. But it is possible that multiple database threads could make up for this.
  • Creating Regex objects takes a lot of time. Once a regular expression Regex is proven, move it outside of loops to be a global variable with the RegexOptions Compiled set. This alone resulted in a 50% speedup of my program.
  • Despite using four threads, this program still only uses 12%-20% of an I7 processor. But now the disk drive light is on solid from all the database inserts. Occassionally the CPU usage jumps to 36% for short bursts, and the program seems to process records faster. Not sure why can't get the speed consistently up. More analysis could be done on the XML file parsing thread, but extra analysis results would increase the amount of stuff needing to be sent to the database, so it is a tradeoff.
  • Having background threads for long running processing allows the UI to stay responsive.

A couple of useful articles:

This concerns revision 484 of [wminfra:WikipediaToConfluence] project.

Note: A PC can still become quite unusable over time if a program is running and doing a lot of updates to a database on the same machine. This can resemble a memory leak, but it doesnt show up in task manager. The machine may be almost unusable, unable to popup menus, programs may have strange errors, etc. Stopping and starting, restarting, the SQL Server service will restore the machine to proper operation. If SQL Server is not limited in the amount of memory it can use, it will try to use all the memory in the machine, but it uses it in a way that doesn't show up in task manager.

Labels:
wikipedia wikipedia Delete
linq linq Delete
performance performance Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.