XML 읽기(파싱)

ElementTree는 외부라이브러리로 존재하다가 파이썬 2.5부터 통합되었다.

XML 샘플 데이타 ( http://goo.gl/VAWy4t )

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

XML 로드 : 파일

import xml.etree.ElementTree as ET

# parse xml file
tree = ET.parse(file_name)

# get root node
root = doc.getroot()

XML 로드 : 문자열

root = ET.fromstring(country_data_as_string)

XML 태그 구성

tag : 태그의 이름
text : 태그의 Text
attrib : 노드의 attribute 맵 (key, value)

테스트 소스

import xml.etree.ElementTree as ET

country_data_as_string = """<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
"""

root = ET.fromstring(country_data_as_string)
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    print(name, rank)

특정 태그 찾기

# root 하위에 "country"와 일치하는 첫번째 태그를 찾아서 리턴한다. 없으면 None을 리턴한다.
country_tag = root.find("country")

# root 하위에 "country"와 일치하는 모든 태그를 리스트로 리턴한다.
country_tags = root.findall("country")

# root 하위에 "country"와 일치하는 첫번째 태그를 찾아서 해당 태그의 text를 리턴한다.
country_text = root.findtext("country")

# findtext는 find().text와 동일하다
country_text = root.find("country").text

# year 태그는 country의 자식이지, root의 자식이 아니다
country_tags = root.findall("year")

특정 태그를 찾은 뒤 text, 속성 출력

>>> for country in root.findall('country'):
...   rank = country.find('rank').text
...   name = country.get('name')
...   print(name, rank)
...

Liechtenstein 1
Singapore 4
Panama 68

관심있는 요소 찾기

# root 태그에서도 iter("neighbor")만 모두 순회가 가능하다
for neighbor in root.iter("neighbor"):
    print neighbor.attrib

자식 순회 1

for child in root.iter()
    print(child.tag)

# root 이하 country 태그들에 대해 태드명을 프린트한다
for country in root.iter("country")
    print(country.tag)

자식 순회 2

>>> for child in root:
...     print(child.tag, child.attrib)
... 

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

인덱스로 자식 접근하기

>>> root[0][1].text
'2008'

모두 출력

# 모든 country에 대해
for country in root.iter("country"):
    print("=" * 60)

    # country의 name attribute 출력
    print("Country : ", country.attrib["name"])

    # country의 child "rank" 출력
    print("Rank : ", country.findtext("rank"))

    # country의 child "year" 출력
    print("Year : ", country.findtext("year"))

    # country의 모든 child "neighbor" 출력
    for neighbor in country.iter("neighbor"):
        # neighbor의 attribute map 출력
        print("Neighbor : ", neighbor.attrib)

XML 편집

ElementTree는 간단하게 XML 문서를 수정하는 방법을 제공한다.

ElementTree.write() 메서드이다.

태그변경 1

>>> for rank in root.iter('rank'):
...   new_rank = int(rank.text) + 1
...   rank.text = str(new_rank)
...   rank.set('updated', 'yes')
...
>>> tree.write('output.xml')

태그 변경2

for rank in root.iter("rank"):
    # 기존의 rank값을 정수형으로 변환한 뒤 1 더해서 변수에 대입하고
    new_rank = int(rank.text) + 1
    # 이를 rank 태그의 text로 갱신
    rank.text = str(new_rank)
    # 그리고, rank 태그에 {"updated":"yes"} attribute를 추가한다    
    rank.attrib["updated"] = "yes"
    # 위 attribute 추가는 아래과 같이 할 수도 있다.
    # rank.set("updated", "yes")

# dump 함수는 인자로 넘어온 tag 이하를 print 해준다
ET.dump(root)

태그 삭제

for country in root.findall('country'):
    rank = int(country.find('rank').text)
    # rank가 50보다 크면
    if rank > 50:
        # 해당 태그를 삭제한다
        root.remove(country)

# print
ET.dump(root)

태그 추가

import datetime

# 방법 1 : ElementTree.Element + append
# 모든 county에 대해...
for country in root.iter('country'):
    e = datetime.datetime.now()
    # last_updated 엘리먼트를 만들고
    last_updated = ET.Element("last_updated")
    # last_updated의 text를 지정한다
    last_updated.text = str(e)
    # 그리고 last_updated 엘리먼트를 country 태그에 child로 추가한다
    country.append(last_updated)

# 방법 2 : ElementTree.SubElement
for country in root.findall('country'):
    e = datetime.datetime.now()
    # country의 서브 엘리먼트 last_updated를 만들고
    last_updated = ET.SubElement(country, "last_updated")
    # text를 지정한다
    last_updated.text = str(e)

ET.dump(root)

파일로 쓰기 1

>>> for rank in root.iter('rank'):
...   new_rank = int(rank.text) + 1
...   rank.text = str(new_rank)
...   rank.set('updated', 'yes')
...
>>> tree.write('output.xml')

파일로 쓰기 2

import xml.etree.ElementTree as ET

# parse xml file
doc = ET.parse(file_name)

# get root node
root = doc.getroot()

# XML 엘리먼트 수정 또는 삭제 등등

# 첫번째 인자는 출력할 파일명
# encoding : 출력할 xml 파일의 인코딩 지정
# xml_declaration : True
# encoding 지정이 있고 xml_declaration이 True여야만
# xml 선언 헤더인 <?xml version='1.0' encoding='utf-8'?>이 파일에 써진다
doc.write("output.xml", encoding="utf-8", xml_declaration=True)

참고자료

문서 : https://docs.python.org/2/library/xml.etree.elementtree.html
튜토리얼 : http://effbot.org/zone/element-index.htm
lxml 모듈 : http://lxml.de/
xpath : http://www.w3.org/TR/xpath/

XML

XML 읽기(파싱)

XML 샘플 데이타 ( http://goo.gl/VAWy4t )

XML 로드 : 파일

XML 로드 : 문자열

XML 태그 구성

테스트 소스

특정 태그 찾기

특정 태그를 찾은 뒤 text, 속성 출력

관심있는 요소 찾기

자식 순회 1

자식 순회 2

인덱스로 자식 접근하기

모두 출력

XML 편집

태그변경 1

태그 변경2

태그 삭제

태그 추가

파일로 쓰기 1

파일로 쓰기 2

참고자료

results matching ""

No results matching ""