Post Image

String Parsing in Python

String parsing is the process of taking alphanumeric data and extracting data from it for further processing. String parsing heavily relies on delimiters such as newline characters, spaces, and other various characters to extract the needed data. In this post I will walk you through parsing a string to extract the meaningful data from it and putting it in a format for later processing and at the end I include the completed script for your reference.

 

Unparsed String

Below is the unparsed string we will be using to extract data from.

Interface    Identifier     Method  Domain  Status Fg Session ID
Gi1/0/19     c9b6.ac99.7e7b N/A     UNKNOWN Auth      0A0A050B0000003200473593
Gi1/0/6      5ed3.5db9.5452 N/A     UNKNOWN Auth      0A0A050B00000138049B7DF0
Gi1/0/10     245b.b8f0.3020 N/A     UNKNOWN Auth      0A0A050B00000028004734DD
Gi1/0/17     7237.f07d.19e6 N/A     UNKNOWN Auth      0A0A050B0000002F00473569
Gi1/0/1      24a1.e879.6db8 N/A     UNKNOWN Auth      0A0A050B0000044D8576847A
Gi1/0/12     22f5.37d4.b45a N/A     UNKNOWN Auth      0A0A050B0000002B00473538
Gi1/0/11     f271.b86a.df2e N/A     UNKNOWN Auth      0A0A050B00000027004734DA
Gi1/0/7      6b76.76bf.7561 N/A     UNKNOWN Auth      0A0A050B000000260047347F
Gi1/0/20     9984.ee4d.dd10 N/A     UNKNOWN Auth      0A0A050B0000003100473573
Gi1/0/13     3406.3adb.f423 N/A     UNKNOWN Auth      0A0A050B0000002C00473538
Gi1/0/18     b5c2.db87.996f N/A     UNKNOWN Auth      0A0A050B000000CECD520C0E
Gi1/0/14     9848.4d10.9195 N/A     UNKNOWN Auth      0A0A050B0000002D0047353B
Gi1/0/16     35cb.7af3.bdb6 N/A     UNKNOWN Auth      0A0A050B0000002A00473535
Gi1/0/3      25f6.13fb.b710 N/A     UNKNOWN Auth      0A0A050B00000024004733E8
Gi1/0/15     f736.612b.4f92 N/A     UNKNOWN Auth      0A0A050B0000003000473569
Gi1/0/9      ebdd.385f.093a N/A     UNKNOWN Auth      0A0A050B00000029004734DD

Session count = 16

Key to Session Events Blocked Status Flags:

  A - Applying Policy (multi-line status for details)
  D - Awaiting Deletion
  F - Final Removal in progress
  I - Awaiting IIF ID allocation
  N - Waiting for AAA to come up
  P - Pushed Session
  R - Removing User Profile (multi-line status for details)
  U - Applying User Profile (multi-line status for details)
  X - Unknown Blocker

 

 

Analyze the String

I find that before writing any code you first need to first identify the data you wish to parse out, identify patterns within that data that you can use to extract what is needed, and what data structure to use to store the data so you can properly process it later.

 

The data I wish to extract from the string above is what is contained in the rows that look like the example below

Gi1/0/19     c9b6.ac99.7e7b N/A     UNKNOWN Auth      0A0A050B0000003200473593

 

Now looking for patterns I see the following.

  • All of the lines I am looking for have 6 "columns"
    • None of the data in the output has spaces in the data of each column meaning I can interpret the spaces as seperators between the columns
  • Each line of data starts with "Gi" 
    • This will allow me to key off of the beginning of the line in my parser to determine if I should extract data from it or ignore it
    • This has a plus that the header of the table will not be interpreted along with any unwanted data
  • All data I need to extract per "entry" is located on a single line.

 

Lastly to identify the data structure I would like to use, because this is a table and each column is a certain "type" of data, I can store each entry of the table in a dictionary so I can key off of the column name to access the part of data I want. Each line will be its own dictionary all of which will be stored in a list so I can iterate over all of them.

 

All in all, this will be pretty straight forward to parse with some general rules that can exclude all unwanted data, and accurately parse out the needed data in.

 

Building a Parser

So now that we have identified the key components we can begin writing some code. First off in this script I will assume that the data listed above is stored in a variable data and it is stored as a string, not split or processed at all.

 

The first thing I will do is define the caracters that our "lines of interest" could start with, In the example above they all start with "Gi" however they could have different beginnings.

line_beginnings = ['Fa', 'Gi', 'Te', 'Fo']

Next i define a list to store all of our parsed data in.

parsed_data = []

 

The next 6 lines is what parses the data, I will explain below the code snippet how it works.

for line in data.splitlines():
    if any([line.startswith(beginning) for beginning in line_beginnings]):
        line_split = line.split()
        parsed_data.append(
            {'interface': line_split[0], 'identifier': line_split[1], 'domain': line_split[3], 'status': line_split[4]}
        )
  1. Splits our string by newline character and begins iterating over each line one at a time.
  2. Checks if our line starts with any of the line beginnings that we defined. If it starts with any one of the line endings, it returns True and will continue to be processed
  3. Splits the line by spaces (multiple consecutive spaces are treated as 1 and subsequent spaces are ignored)
  4. Lines 4-6 take the elements from the line that we want, store them into a dictionary with their corresponding keys, and append them to our parsed_data list.

 

At this point we now have a list of dictionaries of all of the data that we wish to extract. To verify you can add another for loop to iterate over each dictionary and print it out.

for dictionary in parsed_data:
    print(dictionary)

And if you run this parser script you will see the output below which can be verified with the raw table above.

$ python parser_example.py
{'interface': 'Gi1/0/19', 'identifier': 'c9b6.ac99.7e7b', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/6', 'identifier': '5ed3.5db9.5452', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/10', 'identifier': '245b.b8f0.3020', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/17', 'identifier': '7237.f07d.19e6', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/1', 'identifier': '24a1.e879.6db8', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/12', 'identifier': '22f5.37d4.b45a', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/11', 'identifier': 'f271.b86a.df2e', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/7', 'identifier': '6b76.76bf.7561', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/20', 'identifier': '9984.ee4d.dd10', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/13', 'identifier': '3406.3adb.f423', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/18', 'identifier': 'b5c2.db87.996f', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/14', 'identifier': '9848.4d10.9195', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/16', 'identifier': '35cb.7af3.bdb6', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/3', 'identifier': '25f6.13fb.b710', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/15', 'identifier': 'f736.612b.4f92', 'domain': 'UNKNOWN', 'status': 'Auth'}
{'interface': 'Gi1/0/9', 'identifier': 'ebdd.385f.093a', 'domain': 'UNKNOWN', 'status': 'Auth'}

$

 

And that is it. A simple parser that is able to extract data from a raw string. The key takeaway here is to use rules in your parser that are as general as possible that exclude unwanted data and include data regardless of minor formatting (number of spaces, case, etc).

An example, I could have counted the number of columns each line had in order to decide if I wanted to parse it, but if I did that, I would have a greater chance of hitting a false positive, whereas the line beginnings not only pointed out the exact data that I wanted, but it also implicitly excluded data that I didn't want.

 

Full Script

Below is the full script that I built in the writing of this article if you wish to copy and paste to have a known good example.

data = '''Interface    Identifier     Method  Domain  Status Fg Session ID
Gi1/0/19     c9b6.ac99.7e7b N/A     UNKNOWN Auth      0A0A050B0000003200473593
Gi1/0/6      5ed3.5db9.5452 N/A     UNKNOWN Auth      0A0A050B00000138049B7DF0
Gi1/0/10     245b.b8f0.3020 N/A     UNKNOWN Auth      0A0A050B00000028004734DD
Gi1/0/17     7237.f07d.19e6 N/A     UNKNOWN Auth      0A0A050B0000002F00473569
Gi1/0/1      24a1.e879.6db8 N/A     UNKNOWN Auth      0A0A050B0000044D8576847A
Gi1/0/12     22f5.37d4.b45a N/A     UNKNOWN Auth      0A0A050B0000002B00473538
Gi1/0/11     f271.b86a.df2e N/A     UNKNOWN Auth      0A0A050B00000027004734DA
Gi1/0/7      6b76.76bf.7561 N/A     UNKNOWN Auth      0A0A050B000000260047347F
Gi1/0/20     9984.ee4d.dd10 N/A     UNKNOWN Auth      0A0A050B0000003100473573
Gi1/0/13     3406.3adb.f423 N/A     UNKNOWN Auth      0A0A050B0000002C00473538
Gi1/0/18     b5c2.db87.996f N/A     UNKNOWN Auth      0A0A050B000000CECD520C0E
Gi1/0/14     9848.4d10.9195 N/A     UNKNOWN Auth      0A0A050B0000002D0047353B
Gi1/0/16     35cb.7af3.bdb6 N/A     UNKNOWN Auth      0A0A050B0000002A00473535
Gi1/0/3      25f6.13fb.b710 N/A     UNKNOWN Auth      0A0A050B00000024004733E8
Gi1/0/15     f736.612b.4f92 N/A     UNKNOWN Auth      0A0A050B0000003000473569
Gi1/0/9      ebdd.385f.093a N/A     UNKNOWN Auth      0A0A050B00000029004734DD

Session count = 16

Key to Session Events Blocked Status Flags:

  A - Applying Policy (multi-line status for details)
  D - Awaiting Deletion
  F - Final Removal in progress
  I - Awaiting IIF ID allocation
  N - Waiting for AAA to come up
  P - Pushed Session
  R - Removing User Profile (multi-line status for details)
  U - Applying User Profile (multi-line status for details)
  X - Unknown Blocker
'''

line_beginnings = ['Fa', 'Gi', 'Te', 'Fo']

parsed_data = []
for line in data.splitlines():
    if any([line.startswith(beginning) for beginning in line_beginnings]):
        line_split = line.split()
        parsed_data.append(
            {'interface': line_split[0], 'identifier': line_split[1], 'domain': line_split[3], 'status': line_split[4]}
        )

for dictionary in parsed_data:
    print(dictionary)

 

I hope this gave you some basic ideas about building parsers, the unfortunate thing is because each problem is so unique there is rarely a "one size fits all" solution. The key is to focus on the methodology not the exact mechanics.

 

 

 

 



Comments (0)
Leave a Comment