Introduction
In this post we will consider the effect of various methods of parsing (date)time strings, and localizing and converting timezones in pandas DataFrames. We will be first and foremost interested in performance.
The study
Let us consider a particular dataset, the futures dataset from FirstRate Data. This dataset comes in the form of zip files, which need to be unpacked. We will use the Python zipfile model to that end:
import os
import zipfile
import pandas as pd
data_dir = '/home/paubil/dev/datafeeds/firstrate_data/data/2024-05-08'
unadjusted_file_name = 'futures_full_1min_continuous_UNadjusted_qtw90ph.zip'
adjusted_file_name = 'futures_full_1min_continuous_adjusted_kzvua9e.zip'
ticker = 'CL'
print('Reading the data from a zipped file...')
start_time = time.time()
with zipfile.ZipFile(os.path.join(data_dir, adjusted_file_name)) as azf:
data_file_name = f'{ticker}_1min_continuous_adjusted.txt'
with azf.open(data_file_name) as f:
adjusted_df = pd.read_csv(f, header=None)
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
So far so good. It takes about 4 seconds on my machine to read in adjusted_df, which has approximately five million rows.
We then need to rename the columns:
adjusted_df.columns = ['time', 'open', 'high', 'low', 'close', 'volume']
This takes negligible time. Next, we need to parse the times, which are strings of the form “%Y-%m-%d %H:%M:%d” to datetimes. We have several options:
import datetime as dt
import time
print('Converting strings to datetimes: apply datetime.strptime...')
start_time = time.time()
parsed_times_apply = adjusted_df['time'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
We could also try applying np.vectorize:
print('Converting strings to datetimes: np.vectorize and apply datetime.strptime...')
start_time = time.time()
parse = np.vectorize(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
parsed_times_vectorize_apply = adjusted_df['time'].apply(parse)
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
Finally, since the strings are in the standard form, we could use pd.to_datetime:
print('Converting strings to datetimes: pd.to_datetime...')
start_time = time.time()
parsed_times_to_datetime = pd.to_datetime(adjusted_df['time'])
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
Let’s look at the timings:
Converting strings to datetimes: apply datetime.strptime...
Took 24.86 seconds
Converting strings to datetimes: np.vectorize and apply datetime.strptime...
Took 123.86 seconds
Converting strings to datetimes: pd.to_datetime...
Took 0.96 seconds
We conclude from this that we should be using pd.to_datetime to parse (date)time strings whenever we can.
We can now set
adjusted_df['time'] = parsed_times_to_datetime
In fact, we can set the index to these (parsed) times,
adjusted_df.index = adjusted_df['time']
and delete the “time” column; we no longer needed since now the parsed times are in the index:
adjusted_df.drop(columns=['time'], inplace=True)
All these operations take negligible time.
But now we need to localize the times as America/New_York (which incorporates the New York daylight saving time for relevant dates). Here we again have several choices:
import pendulum
import pytz
print('Localizing the times in index as America/New_York using pytz...')
start_time = time.time()
tz = pytz.timezone('America/New_York')
localized_index_pytz = [tz.localize(x) for x in adjusted_df.index]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
print('Localizing the times in index as America/New_York using pendulum...')
start_time = time.time()
tz = pendulum.timezone('America/New_York')
localized_index_pendulum = [pendulum.instance(x, tz) for x in adjusted_df.index]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
print('Localizing the times in index as America/New_York natively...')
start_time = time.time()
localized_index_native = adjusted_df.index.tz_localize('America/New_York')
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
It turns out that localizing with pytz takes 665 seconds, with pendulum 24 seconds, and the native method takes only about 1.28 seconds.
We then need to convert these times to UTC. Again, here are the choices:
print('Converting the times in index to UTC using pytz...')
start_time = time.time()
tz = pytz.utc
converted_index_pytz = [x.astimezone(tz) for x in localized_index_pytz]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
print('Converting the times in index to UTC using pendulum...')
start_time = time.time()
tz = pendulum.timezone('UTC')
converted_index_pendulum = [tz.convert(x) for x in localized_index_pendulum]
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
print('Converting the times in index to UTC natively...')
start_time = time.time()
converted_index_native = localized_index_native.tz_convert('UTC')
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
The pytz method has taken about 7.18 seconds, the pendulum method 54.73 seconds, whereas the native method took a negligible amount of time (close to 0 seconds).
Let’s do some verification:
print('converted_index_native[0]:', converted_index_native[0])
print('converted_index_pytz[0]:', converted_index_pytz[0])
print('converted_index_pendulum[0]:', converted_index_pendulum[0])
This prints out
converted_index_native[0]: 2008-01-01 23:00:00+00:00
converted_index_pytz[0]: 2008-01-01 23:00:00+00:00
converted_index_pendulum[0]: 2008-01-01 23:00:00+00:00
as expected.
Further,
assert np.all(converted_index_native == converted_index_pytz)
assert np.all(converted_index_native == converted_index_pendulum)
adjusted_df.index = converted_index_native
There is one more thing that we need to do. The bars in FirstRate Data are timestamped with the start of the bar, not the end of the bar. We would like to be causal in our data analysis, and so we preventively timestamp the bars with the ends of the bars:
print('Shifting the times in index into the future by 1 minute...')
start_time = time.time()
adjusted_df.index = adjusted_df.index + pd.Timedelta(minutes=1)
end_time = time.time()
print(f'Took {end_time - start_time:.02f} seconds')
print('DataFrame length (row count):', len(adjusted_df))
print('DataFrame head:')
print(adjusted_df.head())
We end up with
Shifting the times in index into the future by 1 minute...
Took 0.03 seconds
DataFrame length (row count): 5644533
DataFrame head:
open high low close volume
time
2008-01-01 23:01:00+00:00 152.52 152.61 152.52 152.55 14
2008-01-01 23:02:00+00:00 152.57 152.66 152.56 152.65 41
2008-01-01 23:03:00+00:00 152.66 152.77 152.66 152.73 27
2008-01-01 23:04:00+00:00 152.68 152.68 152.68 152.68 1
2008-01-01 23:05:00+00:00 152.67 152.67 152.60 152.60 40
Conclusion
In conclusion, if performance matters:
- prefer pd.to_datetime when parsing (date)time strings;
- prefer df.index.tz_localize(‘America/New_York’) when localizing (date)times instead of using pytz or pendulum directly;
- prefer df.index.tz_localize(‘America/New_York’).tz_convert(‘UTC’) when converting timezones instead of using pytz or pendulum directly.
Appendix: System information
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
CPU family: 6
Model: 158
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 10
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
BogoMIPS: 6399.96
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pd
pe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 moni
tor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_a
d fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dth
erm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 192 KiB (6 instances)
L2: 1.5 MiB (6 instances)
L3: 12 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerabilities:
Gather data sampling: Mitigation; Microcode
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Meltdown: Mitigation; PTI
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Mitigation; Microcode
Tsx async abort: Mitigation; TSX disabled
$ sudo dmidecode --type 17
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
Handle 0x003A, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0039
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: ChannelA-DIMM0
Bank Locator: BANK 0
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 904B159F
Asset Tag: 9876543210
Part Number: M378A2K43CB1-CRC
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0039
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: ChannelA-DIMM1
Bank Locator: BANK 1
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 904B1A34
Asset Tag: 9876543210
Part Number: M378A2K43CB1-CRC
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Handle 0x003C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0039
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: ChannelB-DIMM0
Bank Locator: BANK 2
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 904B159C
Asset Tag: 9876543210
Part Number: M378A2K43CB1-CRC
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Handle 0x003D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0039
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: ChannelB-DIMM1
Bank Locator: BANK 3
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 904B19F9
Asset Tag: 9876543210
Part Number: M378A2K43CB1-CRC
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Leave a Reply