.
Finally found the memory leak problem with my wikipedia parsing. It was a quirk of the LINQ database stuff keeping things in memory. The program is now happily processing 100 articles every 0.4 seconds, which is amazingly better than the 10 seconds it was taking to process 100 articles yesterday morning. Occassionally it speeds up and processes 100 articles in only 0.1 seconds.... And not having all the extra memory sitting around being held means the program is not gradually slowing down either.
How I structured the program:
- There is now one thread parsing the XML file, sending its results to a background thread
- The backgroudn thread does some extra parsing, and then background thread queues the objects ready for the database insert
- two database insert threads grab 250 articles at a time from the queue and batch insert them into the SQL Express database.
The XML parsing thread occassionally sleeps because it is much faster than the database threads, I dont want more than about 2500 articles sitting in memory waiting to be dumped into the database....
With the program current speed, it might take maybe 400 minutes to process entire wikipedia dump, down from maybe 6 days for the version of the program yesterday morning.
Some lessons learned:
- It is cheap to dispose of DataContexts with LINQ and create new datacontexts for each batch of database access. This gets rid of memory hanging around. This might not be the case for connecting with a remote database, the time to establish a new SQL Server connection to a remote server might be significant, this needs to be tested. But it is possible that multiple database threads could make up for this.
- Creating Regex objects takes a lot of time. Once a regular expression Regex is proven, move it outside of loops to be a global variable with the RegexOptions Compiled set. This alone resulted in a 50% speedup of my program.
- Despite using four threads, this program still only uses 12%-20% of an I7 processor. But now the disk drive light is on solid from all the database inserts. Occassionally the CPU usage jumps to 36% for short bursts, and the program seems to process records faster. Not sure why can't get the speed consistently up. More analysis could be done on the XML file parsing thread, but extra analysis results would increase the amount of stuff needing to be sent to the database, so it is a tradeoff.
- Having background threads for long running processing allows the UI to stay responsive.
A couple of useful articles:
- LINQ - My various notes on linq, such as Answers to Miscellaneous questions about LINQ, Troubleshooting memory leaks in LINQ
- My experience 6/26/2010: Noticed in WikipediaToConfluence application: Written with background threads creating their own SQLMetal generated DataContext, and recreating a new DataContext every hundred records or so. Even without ever increasing memory usage, the Windows Vista system eventually gets very slow, and appears to run out of resources. Things like menus can not be selected in Visual Studio, other things requiring memory allocations fail, eventhough the system says it has 1-2GB of free ram after extended runtime of the application.
- http://stackoverflow.com/questions/123057/how-do-i-avoid-a-memory-leak-with-linq-to-sql - DataContext should have short lifetime, and be recreated often.
- Error handling for primary index violations, learn how to rollback for SubmitChanges failure.
This concerns revision 484 of [wminfra:WikipediaToConfluence] project.
Note: A PC can still become quite unusable over time if a program is running and doing a lot of updates to a database on the same machine. This can resemble a memory leak, but it doesnt show up in task manager. The machine may be almost unusable, unable to popup menus, programs may have strange errors, etc. Stopping and starting, restarting, the SQL Server service will restore the machine to proper operation. If SQL Server is not limited in the amount of memory it can use, it will try to use all the memory in the machine, but it uses it in a way that doesn't show up in task manager.