Prolog for information science – Emir’s weblog
I reveal a broadly relevant sample which integrates Prolog as a essential
part in an information science evaluation. Analytic strategies are used to generate
properties in regards to the information underneath examine and Prolog is used to motive in regards to the
information by way of the generated properties. The publish consists of some examples of piece-wise
regression on timeseries information by symbolic reasoning. I additionally focus on the overall
sample of software a bit.
Given some information, the majority of “information science” for me is the examine of what the info
implies and whether or not it may be coerced into the context particular position normally
determined by somebody aside from me. Depending on the context, the aim could also be to
discover help for a speculation, to forecast, to foretell a latent variable, to
clarify a phenomena, and so forth. Nearly all of the worth I carry rests in
taking as enter the info and the position it’s anticipated to play, and producing as
output a computationally tractable downside the answer to which allows that
position for the info.
In my very own apply – at the very least in its mature kind – I discover myself principally relying
on reasoning with easy strategies akin to Linear Regression, kNN, kMeans and PCA.
I believe that is principally as a result of I can perceive what the outcomes suggest for the
“form” of the info, in the meantime reasoning in regards to the outcomes does the heavy lifting.
Regardless of the significance of reasoning in my work, I by no means made it express, nor did
I consciously separate it from its material. Once I turned conscious of this,
it appeared to me that I ought to make my reasoning express at the very least in order that I might
(1) motive extra precisely, and (2) immediately study to motive higher.
It’s simpler to reveal what it means to make reasoning express and the
impact it has on the construction of an evaluation. What follows are a few
illustrative examples. I take advantage of SWI Prolog because the symbolic reasoning engine and
logic language, and Python for every thing else.
I’ll begin with a contrived instance meant for instance the separation between
the reasoning factor and the remaining. Beneath (black line) is a few mock 2D information
constructed by becoming a member of collectively 5 randomly chosen traces right into a collection. The
x-axis is simply an index which runs from 0-999. The colored segments are a
piecewise linear becoming derived symbolically utilizing Prolog. I’ll subsequent describe
how.
Say I have no idea something in regards to the development of the info collection above besides
that it’s made up of traces and the goal of the evaluation is to seek out all of the linear
sections. My normal scheme can be to make use of Python to create symbols which
signify details in regards to the information, after which to make use of Prolog to motive about these
symbols. One broadly relevant manner of producing symbols is to partition the
information at random alongside the explanatory variables (the (x) on this case) and match
linear regressions to every partition section. That is then repeated many instances
as a way to seize views of the info throughout many partitions. The result’s tons
of overlapping segments for which I can file info akin to:
- The place the section begins and ends.
- The becoming parameters.
- Correlation with the info.
Given some information (X,Y), the code beneath will produce rounds of random partitions
(4-10) with segments at the very least 20 models large, till there are at the very least (N)
qualifying (completely linear) segments.
import numpy as np
from sklearn.linear_model import LinearRegression
i = 0
np.random.seed(10)
whereas i < N:
knots = make_knots(
np.random.randint(4,10),
20,
np.min(X),
np.max(X))
for begin, cease in zip(knots[:-1], knots[1:]):
cond = X[(X>=start) & (X<=stop)]
x, y = X[cond].reshape(-1,1), Y[cond]
m = LinearRegression().match(x,y)
p = m.predict(x)
i += 1
coef = np.spherical(m.coef_[0],4)
rating = m.rating(x,y)
if rating ==1.0:
print(f"section(i{i},{begin},{cease},{coef},{rating}).")
Word that the code prints out strings within the following format:
section(Id,Begin,Finish,Gradient,Rating).
I copy/paste these immediately into Prolog for demonstration functions, however there
are much less handbook methods.
The easy process above produces loads of symbols to motive about. For
instance, segments could also be separate, overlapping, subsuming, adjoining, and so forth.
Gradients could also be equal, comparable, opposing, and so forth. For these issues,
I’ll solely want to make use of a fraction of the data obtainable. To supply the
segmentation illustrated within the graph above, it is sufficient to word that if
segments are completely linear they usually overlap, then they should be the identical line.
I can use this assertion as a manner of merging segments collectively as a way to create
larger steady linear spans. In Prolog I might specific this as follows:
overlap(I1,I2) :-
section(I1,Start1,End1,_,_),
section(I2,Start2,End2,_,_),
Start2 > Start1,
End1 > Start2,
End2 > End1.
:- desk attain/2.
attain(I1,I2) :-
overlap(I1,I2).
attain(I1,I2) :-
overlap(I1,X),
attain(X,I2).
That’s to say, section (I1) overlaps with section (I2) if it begins exterior of
(I2) however ends within it. A section (I2) will be “reached” from section (I1)
iff there’s a daisy chain of overlapping segments connecting them. Primarily based on
the relationships above, the next question would produce all reachable spans
(daisy chains of segments):
?- setof((I1,I2),attain(I1,I2),Spans).
Prolog digression: overlap is deliberately non-commutative to forestall
backtracking from looping. I take advantage of
tabling on the attain
predicate as a result of SLD decision may be very inefficient on this case.
To calculate an optimum piece-wise linear becoming, I want to seek out the subset of
all of the non-overlapping spans generated by attain which outcome within the maximal
protection of the info. This may be so simple as sorting the spans by size in
descending order and including them to an inventory while ensuring new additions do
not overlap with the outdated ones. In Prolog:
use_module(library(apply)).
use_module(library(pairs)).
span_overlap((I1,I2),(I3,I4)) :-
section(I1,Start1,_,_,_),
section(I2,_,End1,_,_),
section(I3,Start2,_,_,_),
section(I4,_,End2,_,_),
((End1 >= Start2, End2 >= End1)
;(End2 >= Start1, End1 >= End2)
;(Start1 >= Start2, End2 >= Start1)
;(Start2 >= Start1, End1 >= Start2)).
not_span_overlap((I1,I2),(I3,I4)) :-
+ span_overlap((I1,I2),(I3,I4)).
span_length((I1,I2),R) :-
section(I1,Begin,_,_,_),
section(I2,_,Finish,_,_),
R is Begin-Finish.
span_set([],R,R).
span_set([(X,Y)|T],A,R) :-
maplist(not_span_overlap((X,Y)),A),
(X=Y;attain(X,Y)),!,
span_set(T,[(X,Y)|A],R).
span_set([_|T],A,R) :-
span_set(T,A,R).
spans([],R,R).
spans([(I1,I2)|T],A,R):-
section(I1,Begin,_,_,_),
section(I2,_,Finish,_,_),
spans(T,[(Start,End)|A],R).
doable(X,Y) :-
section(X,Start1,_,_,_),
section(Y,Start2,_,_,_),
Start2 >= Start1.
find_cover(SortSpan) :-
findall((X,Y),doable(X,Y),Potential),
map_list_to_pairs(span_length,Potential,Pairs),
keysort(Pairs,Sorted),
pairs_values(Sorted,BySpan),
span_set(BySpan,[],SpanById),
spans(SpanById,[],Span),
kind(Span,SortSpan).
The span_set predicate does the vital work, while the find_cover
predicate is the entry level. I might have used attain immediately, nevertheless it
massively over generates as a result of there are normally some ways of getting
between two segments. It seems to be rather more environment friendly to record out
the doable spans and take a look at whether or not they’re reachable immediately as above.
Lastly, I can simply question for the quilt:
?- find_cover(Cowl).
which outputs the segmentation within the plot above:
Cowl = [
(0, 110), (112, 201), (204, 336),
(343, 647), (653, 758), (760, 998)].
The above instance is contrived and simple to resolve by extra direct means however
hopefully it made the method pretty clear. On this instance I apply a lot the
identical technique however this time to carry out a piece-wise isotonic becoming of the
FTSE 100 inventory index. The graph beneath reveals the final 5 years of the FTSE-100
index at weekly intervals. The segmentation was created by reasoning about
random segments simply as within the instance above, besides an Isotonic Regression
was fitted as an alternative of a Linear Regression.
The next info was recorded:
section(Id,Begin,Cease,Course,Match)
I made the next minimal adjustment to the Prolog code:
same_direction(I1,I2) :-
section(I1,_,_,G1,_),
section(I2,_,_,G2,_),
G1 = G2.
:- desk attain/2.
attain(I1,I2) :-
same_direction(I1,I2),
overlap(I1,I2).
attain(I1,I2) :-
same_direction(I1,X),
same_direction(X,I2),
overlap(I1,X),
attain(X,I2).
I’ve added the same_direction predicate which requires that the route of
the isotonic regression is identical in overlapping segments for them to be
reachable. I’ve additionally relaxed the random partitioning situation to a correlation
of (ge 0.9). I can now question for the quilt as earlier than:
?- find_cover(Cowl).
which outputs the segmentation within the plot above:
Cowl = [(1, 22), (23, 88), (97, 260)].