Introduction

In this post we will consider the effect of various methods of parsing (date)time strings, and localizing and converting timezones in pandas DataFrames. We will be first and foremost interested in performance.

The study

Let us consider a particular dataset, the futures dataset from FirstRate Data. This dataset comes in the form of zip files, which need to be unpacked. We will use the Python zipfile model to that end:

import os
import zipfile

import pandas as pd

data_dir = '/home/paubil/dev/datafeeds/firstrate_data/data/2024-05-08'

unadjusted_file_name = 'futures_full_1min_continuous_UNadjusted_qtw90ph.zip'
adjusted_file_name = 'futures_full_1min_continuous_adjusted_kzvua9e.zip'

ticker = 'CL'


print('Reading the data from a zipped file...')
start_time = time.time()
with zipfile.ZipFile(os.path.join(data_dir, adjusted_file_name)) as azf:
    data_file_name = f'{ticker}_1min_continuous_adjusted.txt'
    with azf.open(data_file_name) as f:
        adjusted_df = pd.read_csv(f, header=None)
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

So far so good. It takes about 4 seconds on my machine to read in adjusted_df, which has approximately five million rows.

We then need to rename the columns:

adjusted_df.columns = ['time', 'open', 'high', 'low', 'close', 'volume']

This takes negligible time. Next, we need to parse the times, which are strings of the form “%Y-%m-%d %H:%M:%d” to datetimes. We have several options:

import datetime as dt
import time
print('Converting strings to datetimes: apply datetime.strptime...')
start_time = time.time()
parsed_times_apply = adjusted_df['time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

We could also try applying np.vectorize:

print('Converting strings to datetimes: np.vectorize and apply datetime.strptime...')
start_time = time.time()
parse = np.vectorize(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
parsed_times_vectorize_apply = adjusted_df['time'].apply(parse)
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

Finally, since the strings are in the standard form, we could use pd.to_datetime:

print('Converting strings to datetimes: pd.to_datetime...')
start_time = time.time()
parsed_times_to_datetime = pd.to_datetime(adjusted_df['time'])
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

Let’s look at the timings:

Converting strings to datetimes: apply datetime.strptime...
Took 24.86 seconds
Converting strings to datetimes: np.vectorize and apply datetime.strptime...
Took 123.86 seconds
Converting strings to datetimes: pd.to_datetime...
Took 0.96 seconds

We conclude from this that we should be using pd.to_datetime to parse (date)time strings whenever we can.

We can now set

adjusted_df['time'] = parsed_times_to_datetime

In fact, we can set the index to these (parsed) times,

adjusted_df.index = adjusted_df['time']

and delete the “time” column; we no longer needed since now the parsed times are in the index:

adjusted_df.drop(columns=['time'], inplace=True)

All these operations take negligible time.

But now we need to localize the times as America/New_York (which incorporates the New York daylight saving time for relevant dates). Here we again have several choices:

import pendulum
import pytz

print('Localizing the times in index as America/New_York using pytz...')
start_time = time.time()
tz = pytz.timezone('America/New_York')
localized_index_pytz = [tz.localize(x) for x in adjusted_df.index]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

print('Localizing the times in index as America/New_York using pendulum...')
start_time = time.time()
tz = pendulum.timezone('America/New_York')
localized_index_pendulum = [pendulum.instance(x, tz) for x in adjusted_df.index]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

print('Localizing the times in index as America/New_York natively...')
start_time = time.time()
localized_index_native = adjusted_df.index.tz_localize('America/New_York')
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

It turns out that localizing with pytz takes 665 seconds, with pendulum 24 seconds, and the native method takes only about 1.28 seconds.

We then need to convert these times to UTC. Again, here are the choices:

print('Converting the times in index to UTC using pytz...')
start_time = time.time()
tz = pytz.utc
converted_index_pytz = [x.astimezone(tz) for x in localized_index_pytz]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

print('Converting the times in index to UTC using pendulum...')
start_time = time.time()
tz = pendulum.timezone('UTC')
converted_index_pendulum = [tz.convert(x) for x in localized_index_pendulum]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

print('Converting the times in index to UTC natively...')
start_time = time.time()
converted_index_native = localized_index_native.tz_convert('UTC')
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

The pytz method has taken about 7.18 seconds, the pendulum method 54.73 seconds, whereas the native method took a negligible amount of time (close to 0 seconds).

Let’s do some verification:

print('converted_index_native[0]:', converted_index_native[0])
print('converted_index_pytz[0]:', converted_index_pytz[0])
print('converted_index_pendulum[0]:', converted_index_pendulum[0])

This prints out

converted_index_native[0]: 2008-01-01 23:00:00+00:00
converted_index_pytz[0]: 2008-01-01 23:00:00+00:00
converted_index_pendulum[0]: 2008-01-01 23:00:00+00:00

as expected.

Further,

assert np.all(converted_index_native == converted_index_pytz)
assert np.all(converted_index_native == converted_index_pendulum)

adjusted_df.index = converted_index_native

There is one more thing that we need to do. The bars in FirstRate Data are timestamped with the start of the bar, not the end of the bar. We would like to be causal in our data analysis, and so we preventively timestamp the bars with the ends of the bars:

print('Shifting the times in index into the future by 1 minute...')
start_time = time.time()
adjusted_df.index = adjusted_df.index + pd.Timedelta(minutes=1)
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')

print('DataFrame length (row count):', len(adjusted_df))
print('DataFrame head:')
print(adjusted_df.head())

We end up with

Shifting the times in index into the future by 1 minute...
Took 0.03 seconds
DataFrame length (row count): 5644533
DataFrame head:
                             open    high     low   close  volume
time
2008-01-01 23:01:00+00:00  152.52  152.61  152.52  152.55      14
2008-01-01 23:02:00+00:00  152.57  152.66  152.56  152.65      41
2008-01-01 23:03:00+00:00  152.66  152.77  152.66  152.73      27
2008-01-01 23:04:00+00:00  152.68  152.68  152.68  152.68       1
2008-01-01 23:05:00+00:00  152.67  152.67  152.60  152.60      40

Conclusion

In conclusion, if performance matters:

  • prefer pd.to_datetime when parsing (date)time strings;
  • prefer df.index.tz_localize(‘America/New_York’) when localizing (date)times instead of using pytz or pendulum directly;
  • prefer df.index.tz_localize(‘America/New_York’).tz_convert(‘UTC’) when converting timezones instead of using pytz or pendulum directly.

Appendix: System information

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            10
    CPU max MHz:         4600.0000
    CPU min MHz:         800.0000
    BogoMIPS:            6399.96
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pd
                         pe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 moni
                         tor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
                          rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_a
                         d fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dth
                         erm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled

$ sudo dmidecode --type 17
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x003A, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0039
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: ChannelA-DIMM0
        Bank Locator: BANK 0
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 904B159F
        Asset Tag: 9876543210
        Part Number: M378A2K43CB1-CRC
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: 1.2 V

Handle 0x003B, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0039
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: ChannelA-DIMM1
        Bank Locator: BANK 1
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 904B1A34
        Asset Tag: 9876543210
        Part Number: M378A2K43CB1-CRC
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: 1.2 V

Handle 0x003C, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0039
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: ChannelB-DIMM0
        Bank Locator: BANK 2
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 904B159C
        Asset Tag: 9876543210
        Part Number: M378A2K43CB1-CRC
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: 1.2 V

Handle 0x003D, DMI type 17, 40 bytes
Memory Device
        Array Handle: 0x0039
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 16 GB
        Form Factor: DIMM
        Set: None
        Locator: ChannelB-DIMM1
        Bank Locator: BANK 3
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 2400 MT/s
        Manufacturer: Samsung
        Serial Number: 904B19F9
        Asset Tag: 9876543210
        Part Number: M378A2K43CB1-CRC
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: Unknown
        Maximum Voltage: Unknown
        Configured Voltage: 1.2 V

Leave a Reply

Your email address will not be published. Required fields are marked *