I am using an API to download all crypto OHLCV data on a periodical basis using multiple timeframes. But there is a bug in the API: it downloads previous timeframe’s data along with the current one, a glitch which raises the duplication issue in the data.
Duplicate data in a dataset is not uncommon and it constitutes one of the most ubiquitous data cleaning tasks. Duplicate data distorts the whole data analysis process and more often than not leads to highly erroneous results. Fortunately, in all of the popular programming languages, or at least in those which are commonly utilised in data analysis, like Python and SQL, there are multiple methods to remove the duplicates.
Removing exact duplicates is relatively easy. However, there may be cases where data analysts may encounter more qualified versions of duplicates, as is the case in the following:
Here, the API creates duplicated data. There are two lines of data for 4h OHLCV data, one of which needs removal. Otherwise, the algorithms lead to misleading results. One of the rows is actual 4 hours timeframe and the other occurred in the following timeframe. We assume that the number of trades would be invariably higher in the real 4h timeframe and build our logic on this assumption.
1. The SQL Method:
Previously, I was using cx_Oracle library to send the data to an Oracle database and then removing the duplicates using an SQL query. The query is as follows:
Using SQL to remove duplicates but tkeep row with highest column value
select * from (SELECT TIMESTAMP, (TO_DATE(SUBSTR(TIMESTAMP, 1, 19), 'YYYY-MM-DD hh24:mi:ss') + INTERVAL '3' HOUR) hourly, max(trades) over (PARTITION by coin, TIMESTAMP order by coin, TIMESTAMP) ordered, VOLUME, TRADES, OPEN, HIGH, LOW, CLOSE, COIN FROM ADMIN.ALL1H order by TIMESTAMP desc) where ordered=TRADES
This query selects and keeps the row with maximum trades compared to other duplicates and removes the other duplicates. It doesn’t matter whether or not the number of duplicates is two or more. Here, I am using max() analytic function of Oracle.
2. The Python Method:
The Python method is more concise.
df=df.sort_values(["coin","timestamp","trades"], ascending = True).drop_duplicates(subset=["close_time", "coin"], keep="last")
We simply sort the rows according to symbol name, timestamp and the number of trades in ascending order and then simply keep the uppermost row, deleting the other duplicates based on the close time and the crypto symbol.
The Pythonic way is simpler and faster and I am currently using it.