Jan. 30, 2021

Read and write lists with Pickle, Array and Pandas

List data need to be serialized before it can be written to a file. Simple uniform data can be stored using arrays, basic text read and write and efficiently with Pandas. Pickled files provide a temporary storage for data, but are not for database applications.


Page contents

Pickle and store list to file

Converting a list into serial flat data for storage.

Py3: Pickling for files

import pickle

original = [1,2,3,[3,4,5,[6,7,8]]]

#   write pickled data to file
with open('piclist.pkl', 'wb') as f:
    pickle.dump(original, f)

#   reload pickled data from file
with open('piclist.pkl', 'rb') as f:
    loaded = pickle.load(f)

print(original, id(original))
print(loaded, id(loaded))
#= [1, 2, 3, [3, 4, 5, [6, 7, 8]]] 43392328
#  [1, 2, 3, [3, 4, 5, [6, 7, 8]]] 43393864
#   data and structure is restored
#   loaded variable is a deep copy

Notes: Serializing or flattening

In order to store data within a list, it needs to be flattened into a serial format. The serialized data is written out in a binary format along with the structure to recreate the original list. The file format is not secure, and can be tampered with. So it cannot be used as a database. However a pickled file can be used as temporary storage for very large lists. Shelve is like pickle, but uses a dictionary format for variables. Marshal also can be used to store lists, only for simple structures. For temporary storage, pickle is widely supported and backward compatible. reduce and reduce_ex methods provide support in pickling.

Numeric list to binary encoded file with array

Homogeneous numeric data written to file using arrays.

Py3: Binary read/write using arrays

import array

original = [1, 2, 3, 4, 5]

#   list to array: signed short
arr_org = array.array('h', original)

#   save array
with open('numbers.byt', 'wb') as f:
    arr_org.tofile(f)

#   load array, 3 elements
arr_rel = array.array('h')
with open('numbers.byt', 'rb') as f:
    arr_rel.fromfile(f, 3)

#   array to list
reloaded = arr_rel.tolist()

print(id(original), original)
#= 4870536 [1, 2, 3, 4, 5]
print(id(reloaded), reloaded)
#= 4871048 [1, 2, 3]

Notes: Convert to arrays for read write

The arrays module provides a handy way to write homogeneous numeric data to file as binary byte encoded data. Arrays only handle numbers and that also the numbers need to be integers or floats. No mix and matches are allowed. The uniform set of array numbers are written as bytes to a binary file, and read back into an array with same type format. The module also provides a quick way to convert back and forth to lists.

Text write for flat list

Writing data to comma separated text file.

Py3: Basic csv

separator = ','
original = [1, 2.2, '3', 'abc']

#   save data as text
with open('linear.csv', 'wt') as f:
    f.write(separator.join([str(i) for i in original]))

#   load data as text
with open('linear.csv', 'rt') as f:
    reloaded = f.read().split(separator)

print(original)
#= [1, 2.2, '2', 'abc']
print(reloaded)
#= ['1', '2.2' '3', 'abc']

Notes: Single column of mixed data

For a linear list of numbers and strings, the data is already serialized. Nested structures will prevent the data from being written out correctly. The separator can be chosen not to interfere with data. The data elements are converted to text, and joined with the separator to form a long string which is stored as text. The reloaded data is also in the text format, and can be split using the separator. It is to be noted that the reloaded data is in string format. It can be parsed into the mixed format, with added checks for each element.

To csv for multicolumn lists

Multiple columns stored side by side, without headers.

Py3: Two column save to csv

sep = ','
col1 = [1, 2, 3, 4]
col2 = [1.1, 2.2, 3.3, 4.4]

#   pair data columns and write out each text row
#   need to add new line after each line
with open('test.csv', 'wt') as f:
    for row in zip(col1, col2):
        f.write(sep.join(map(str,row)))
        f.write('\n')

#   read lines, strip off newline
with open('test.csv', 'rt') as f:
    reloaded = [row.strip() for row in f]

#   alternate multiline write and read
#   ----------------------------------
#   lines = [sep.join(map(str,row))+'\n' for row in zip(col1, col2)]
#   with open('test.csv', 'wt') as f:
#       f.writelines(lines)
#
#   with open('test.csv', 'rt') as f:
#       reloaded = [line.strip() for line in f.readlines()]


#   reloaded data is a list of strings
print(reloaded)
#= ['1,1.1', '2,2.2', '3,3.3', '4,4.4']

#   split each item into component sequences
r_col1, r_col2 = list(zip(*[item.split(sep) for item in reloaded]))
print(r_col1)       #= ('1', '2', '3', '4')
print(r_col2)       #= ('1.1', '2.2', '3.3', '4.4')
#   tuple of text, can be parsed as int or float into list

Notes: Multiple column csv

Multiple columns need to be first paired with each other to create rows of data. The rows are joined using data separator to form strings, which are written out as text lines to file with a newline character after each line. The data read back is a list of text, which needs to be segmented into individual lists for each column.

To csv with pandas

Pandas efficient and simple csv read and write.

Py3: Pandas dataframe

import pandas as pd

col1 = [1, 2, 3, 4]
col2 = [1.1, 2.2, 3.3, 4.4]

#   convert to pandas dataframe and write to csv
df = pd.DataFrame(zip(col1, col2), columns=['col1','col2'])
df.to_csv('pandatext.csv', sep=',', index=False, header=True)

#   read from csv
df_r = pd.read_csv('pandatext.csv')

#   convert columns
r_col1 = df_r['col1'].values.tolist()
r_col2 = df_r['col2'].values.tolist()

print(r_col1)   #= [1, 2, 3, 4]
print(r_col2)   #= [1.1, 2.2, 3.3, 4.4]

Notes: Handling data formating with Pandas read_csv

Pandas module has a built-in csv writer and reader that can be fully customized for header, separator and indexing. Pandas also guesses the best format for each column. So writing and reading data is simply to generate a dataframe object with multiple columns of data and generating a text file. The generated file can be easily read with Pandas automatically determining the data types, header. The individual columns can be converted back to list columns, the whole process being simple and efficient.

Popular content

python programming

Read and write lists with Pickle, Array and Pandas

python programming

Flatten nested list or generate blocks of nested lists

python programming

For loop and control statements

python programming

Clear list using inplace and standard methods

python programming

List comprehension with nested conditions

python programming

Concatenate list elements using add, append, extend

python programming

Enumerate and custom counters like skip and loop

python programming

Count number of elements, and memory allocated

python programming

Remove duplicate list elements

python programming

Statistics with numeric lists of integers, fractions, and decimals.

New content

python programming

Read and write lists with Pickle, Array and Pandas

python programming

Is element in list?

python programming

Dictionary merge common key groups

python programming

Packaging loops with zip

python programming

Concatenate list elements using add, append, extend

python programming

List comprehension with nested conditions

python programming

Flatten nested list or generate blocks of nested lists

python programming

Enumerate and custom counters like skip and loop

python programming

Range integer sequences

python programming

For loop and control statements