WebFaction
Community site: login faq

To initialize one of my applications, I need to read over a large (~78MB) XML file and populate a database table. This XML file has about 300,000 entries of interest. I've put the initialization code in my application's __init__.py file. The transaction management is rather crude, but it seemed to work OK during development.

@transaction.commit_manually
def init():
    """
    Read data.xml and populate the table
    """
    from models import Entry

    # If an entry exists already, our task is (presumably) done.
    if Entry.objects.exists():
        transaction.commit()
        return

    # Otherwise, dig in.
    import os
    import xml.etree.ElementTree as ET
    data_path = '/home/.../webapps/.../data.xml'

    # ... gunk ... #

    c = connection.cursor()

    # These are used to track the size of the query and the number of
    # 'insert' statements since the last transaction.
    #  data_count:   number of items per insert
    #  insert_count: number of inserts per commit

    data_count = 0
    insert_count = 0

    query_string = '''INSERT INTO dict_entry
                        (word, data)
                         VALUES (%s,%s)'''
    query_list = []

    context = ET.iterparse(monier)
    for event, elem in context:
        if elem.tag == 'entry':
            # -- Handle the element -- #
            word = elem.find('word').text
            data = ET.tostring(elem)
            query_list.append((word, data))

            # Insert as necessary.
            data_count += 1
            if data_count == 10:
                data_count = 0
                insert_count += 1

                c.executemany(query_string, query_list)
                query_list = []
                if insert_count == 200:
                    transaction.commit()
                    insert_count = 0

            # -- Dispose of the element -- #
            elem.clear()

            # # The lines below, mysteriously, don't work.
            # while elem.getprevious() is not None:
            #    del elem.getparent()[0]
    del context

    # Wrap up.
    c.executemany(query_string, query_list)
    transaction.commit()

While executing, this code peaked at a whopping 242MB of memory, more than three times the size of the input file! [Edit: and the process persisted until I manually killed it.]

PID   RSS COMMAND
17872 242372 /home/.../webapps/.../httpd.worker -f /home/.../webapps/.../httpd.conf -k start

Thankfully, I'll never need to run this function again. But I might need to run functions like it, especially if I'm parsing more XML in the future. Quite simply, my question is: why is this happening? This is what I've considered so far:

  • DEBUG = True. Nope, that wasn't the issue. DEBUG is and was False.
  • Bad transaction management. I was using commit_on_success earlier, but the process definitely used the version above, which uses commit_manually and commits on every 200 transactions (= every 2000 entries).

Otherwise, I don't know what could be causing this. I'd really appreciate some help. Thanks for reading!

asked 18 Oct '11, 03:14

phokai's gravatar image

phokai
313
accept rate: 0%

edited 18 Oct '11, 03:31


It's hard to say why the process is taking so much memory. There could be a memory leak somewhere in the code. I'd recommend reading our docs on reducing memory consumption, if you haven't already done so.

Our system will kill a process that's consuming too much memory if it has been running more than 5 minutes. So, if you only need to run this once, and it takes less than 5 minutes, it should be ok.

permanent link

answered 18 Oct '11, 03:59

todork's gravatar image

todork
1.2k4
accept rate: 34%

Call me naive, but I didn't realize that Python could have memory leaks. What sorts of code could cause memory leaks?

(18 Oct '11, 10:29) phokai phokai's gravatar image
1

That's a broad question, so here is a broad answer: Python memory leaks :)

(18 Oct '11, 12:57) seanf ♦♦ seanf's gravatar image

I had already tried that exact query and found three possible causes, according to the first result:

""" 1. some low level C library is leaking 2. your Python code have global lists or dicts that grow over time, and you forgot to remove the objects after use 3. there are some reference cycles in your app """

I was hoping for some advice more specific to Django development, but perhaps there isn't any.

(18 Oct '11, 19:43) phokai phokai's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×888
×105
×82
×2
×1

question asked: 18 Oct '11, 03:14

question was seen: 2,650 times

last updated: 18 Oct '11, 19:43

WEBFACTION
REACH US
SUPPORT
AFFILIATE PROGRAM
LEGAL
© COPYRIGHT 2003-2016 SWARMA LIMITED - WEBFACTION IS A SERVICE OF SWARMA LIMITED
REGISTERED IN ENGLAND AND WALES 5729350 - VAT REGISTRATION NUMBER 877397162
5TH FLOOR, THE OLD VINYL FACTORY, HAYES, UB3 1HA, UNITED KINGDOM