While preparing the data in a machine learning model, the best approach is to separate our dataset into two distinct parts from the beginning:
We will simply pass this data as if it were data that we have never seen before. This is known as step training, testing the ML model, and measuring the performance of our model on this data. This is also called held-out data, to emphasize that it is not to be touched before the end of the process to make sure that the model works.
It's up to you to define the proportion of the dataset that you want to allocate to each part. In general, the data is typically separated as per the following proportions: 80% for the training set and 20% for the testing set.