« | October 2025 | » | 日 | 一 | 二 | 三 | 四 | 五 | 六 | | | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | |
| 公告 |
数据仓库&数据挖掘
对某一件事需要坚持方能真正完成这件事
薛 峰
2009.02.03 |
Blog信息 |
blog名称:数据仓库与数据挖掘 日志总数:85 评论数量:14 留言数量:0 访问次数:723495 建立时间:2005年3月17日 |

| |
[数据仓库]IBM DB2 学习笔记  网上资源
薛 峰 发表于 2005/6/30 8:53:42 |
【IBM DB2 学习笔记一】
【彭建军】
注意:在 IBM DB2 中,与 MS SQL Server 2000 中相同的语法或者概念,这里就不一一列出了。
一、【DB2 SQL 概述】
1、 【模式】
1.1、模式是已命名的对象(如表和视图)的集合。模式提供了数据库中对象的逻辑分类。
1.2、当在数据库中创建对象的时候,系统就隐性的创建了模式。当然,也可以使用 CREATE SCHEMA 显式的创建模式。
1.3、当命名对象的时候,需要注意对象的名称有两个部分,即,模式.对象名称,形如:pjj.TempTable1。如果不显示指定模式,则系统使用默认模式(默认用户的ID)。
2、 【数据类型】
定长字符串 CHAR(x) x值域(1~254) 一个字节序列
变长字符串 VARCHAR(X)
|
|
[数据挖掘]数据挖掘在电信欺诈侦测中的应用 随笔, 心得体会
薛 峰 发表于 2005/6/28 10:03:29 |
摘 要:电信领域欺诈现象比较突出,本文对数据挖掘技术在电信欺诈侦测中的应用进行研究,并利用某移动运营商的真实数据进行有效性验证。具体通过商业理解、数据理解、数据准备、模型生成、模型应用等几个步骤完成欺诈的侦测。在模型生成阶段利用聚类算法中的Kohonen神经网络算法,Kohonen是一种自组织学习算法。 关键字:数据挖掘;欺诈侦测;kohonen算法;CRISP-DM 1 引言 随着移动业务的迅猛发展,移动通信业的收入日益增长。但是,随之而来的移动网络的欺诈行为也不断涌现,全球移动通信业都广泛面临着无线欺诈的严重问题,从而导致电信运营商的收入受到损失,额外支出的增加,进而致使利润下降,而移动用户的合法权益也受到损害,电信运营商的信誉无法得到保障。 无线欺诈类型可以简单的分为四类:(1)时间欺诈:占用了移动通信的时长而不付费用,该类欺诈可以分为两类,一是技术型欺诈(包括码机、魔术电话等),另一类是用户欺诈(漫游、滥用补充业务以及善意的欺诈行为); (2)内部欺诈:运营商内部人员利用职权非法牟利;(3)手机欺诈:进行非 |
|
数据挖掘部分算法的matlab实现 id3 随笔, 读书笔记
薛 峰 发表于 2005/6/27 14:22:13 |
function D = ID3(train_features, train_targets, params, region)
% Classify using Quinlan´s ID3 algorithm % Inputs: % features - Train features % targets - Train targets % params - [Number of bins for the data, Percentage of incorrectly assigned samples at a node] % region - Decision region vector: [-x x -y y number_of_points] % % Outputs % D - Decision sufrace
[Ni, M] = size(train_features);
%Get parameters [Nbins, inc_node] = process_params(params); inc_node = inc_node*M/100;
%For the decision region N = region(5); mx = ones(N,1) * linspace (region(1),region(2),N); my = linspace (region(3),region(4),N)´ * ones(1,N); flatxy = [mx(:), my(:)]´;
%Preprocessing [f, t, UW, m] = PCA(train_features, train_targets, Ni, region); train_features = UW * (train_features - m*ones(1,M));; flatxy = UW * (flatxy - m*ones(1,N^2));;
%First, bin the data and the decision region data [H, binned_features]= high_histogram(train_features, Nbins, region); [H, binned_xy] = high_histogram(flatxy, Nbins, region);
%Build the tree recursively disp(´Building tree´) tree = make_tree(binned_features, train_targets, inc_node, Nbins);
%Make the decision region according to the tree disp(´Building decision surface using the tree´) targets = use_tree(binned_xy, 1:N^2, tree, Nbins, unique(train_targets));
D = reshape(targets,N,N); %END
function targets = use_tree(features, indices, tree, Nbins, Uc) %Classify recursively using a tree
targets = zeros(1, size(features,2));
if (size(features,1) == 1), %Only one dimension left, so work on it for i = 1:Nbins, in = indices(find(features(indices) == i)); if ~isempty(in), if isfinite(tree.child(i)), targets(in) = tree.child(i); else %No data was found in the training set for this bin, so choose it randomally n = 1 + floor(rand(1)*length(Uc)); targets(in) = Uc(n); end end end break end %This is not the last level of the tree, so: %First, find the dimension we are to work on dim = tree.split_dim; dims= find(~ismember(1:size(features,1), dim));
%And classify according to it for i = 1:Nbins, in = indices(find(features(dim, indices) == i)); targets = targets + use_tree(features(dims, :), in, tree.child(i), Nbins, Uc); end %END use_tree
function tree = make_tree(features, targets, inc_node, Nbins) %Build a tree recursively
[Ni, L] = size(features); Uc = unique(targets);
%When to stop: If the dimension is one or the number of examples is small if ((Ni == 1) | (inc_node > L)), %Compute the children non-recursively for i = 1:Nbins, tree.split_dim = 0; indices = find(features == i); if ~isempty(indices), if (length(unique(targets(indices))) == 1), tree.child(i) = targets(indices(1)); else H = hist(targets(indices), Uc); [m, T] = max(H); tree.child(i) = Uc(T); end else tree.child(i) = inf; end end break end
%Compute the node´s I for i = 1:Ni, Pnode(i) = length(find(targets == Uc(i))) / L; end Inode = -sum(Pnode.*log(Pnode)/log(2));
%For each dimension, compute the gain ratio impurity delta_Ib = zeros(1, Ni); P = zeros(length(Uc), Nbins); for i = 1:Ni, for j = 1:length(Uc), for k = 1:Nbins, indices = find((targets == Uc(j)) & (features(i,:) == k)); P(j,k) = length(indices); end end Pk = sum(P); P = P/L; Pk = Pk/sum(Pk); info = sum(-P.*log(eps+P)/log(2)); delta_Ib(i) = (Inode-sum(Pk.*info))/-sum(Pk.*log(eps+Pk)/log(2)); end
%Find the dimension minimizing delta_Ib [m, dim] = max(delta_Ib);
%Split along the ´dim´ dimension tree.split_dim = dim; dims = find(~ismember(1:Ni, dim)); for i = 1:Nbins, indices = find(features(dim, :) == i); tree.child(i) = make_tree(features(dims, indices), targets(indices), inc_node, Nbins); end
|
|
[数据挖掘]数据挖掘部分算法的matlab实现 C4_5 网上资源, 随笔
薛 峰 发表于 2005/6/27 14:21:09 |
function D = C4_5(train_features, train_targets, inc_node, region)
% Classify using Quinlan´s C4.5 algorithm % Inputs: % features - Train features % targets - Train targets % inc_node - Percentage of incorrectly assigned samples at a node % region - Decision region vector: [-x x -y y number_of_points] % % Outputs % D - Decision sufrace
%NOTE: In this implementation it is assumed that a feature vector with fewer than 10 unique values (the parameter Nu) %is discrete, and will be treated as such. Other vectors will be treated as continuous
[Ni, M] = size(train_features); inc_node = inc_node*M/100; Nu = 10;
%For the decision region N = region(5); mx = ones(N,1) * linspace (region(1),region(2),N); my = linspace (region(3),region(4),N)´ * ones(1,N); flatxy = [mx(:), my(:)]´;
%Preprocessing %[f, t, UW, m] = PCA(train_features, train_targets, Ni, region); %train_features = UW * (train_features - m*ones(1,M));; %flatxy = UW * (flatxy - m*ones(1,N^2));;
%Find which of the input features are discrete, and discretisize the corresponding %dimension on the decision region discrete_dim = zeros(1,Ni); for i = 1:Ni, Nb = length(unique(train_features(i,:))); if (Nb <= Nu), %This is a discrete feature discrete_dim(i) = Nb; [H, flatxy(i,:)] = high_histogram(flatxy(i,:), Nb); end end
%Build the tree recursively disp(´Building tree´) tree = make_tree(train_features, train_targets, inc_node, discrete_dim, max(discrete_dim), 0);
%Make the decision region according to the tree disp(´Building decision surface using the tree´) targets = use_tree(flatxy, 1:N^2, tree, discrete_dim, unique(train_targets));
D = reshape(targets,N,N); %END
function targets = use_tree(features, indices, tree, discrete_dim, Uc) %Classify recursively using a tree
targets = zeros(1, size(features,2));
if (tree.dim == 0) %Reached the end of the tree targets(indices) = tree.child; break end %This is not the last level of the tree, so: %First, find the dimension we are to work on dim = tree.dim; dims= 1:size(features,1);
%And classify according to it if (discrete_dim(dim) == 0), %Continuous feature in = indices(find(features(dim, indices) <= tree.split_loc)); targets = targets + use_tree(features(dims, :), in, tree.child(1), discrete_dim(dims), Uc); in = indices(find(features(dim, indices) > tree.split_loc)); targets = targets + use_tree(features(dims, :), in, tree.child(2), discrete_dim(dims), Uc); else %Discrete feature Uf = unique(features(dim,:)); for i = 1:length(Uf), in = indices(find(features(dim, indices) == Uf(i))); targets = targets + use_tree(features(dims, :), in, tree.child(i), discrete_dim(dims), Uc); end end %END use_tree
function tree = make_tree(features, targets, inc_node, discrete_dim, maxNbin, base) %Build a tree recursively
[Ni, L] = size(features); Uc = unique(targets); tree.dim = 0; %tree.child(1:maxNbin) = zeros(1,maxNbin); tree.split_loc = inf;
if isempty(features), break end
%When to stop: If the dimension is one or the number of examples is small if ((inc_node > L) | (L == 1) | (length(Uc) == 1)), H = hist(targets, length(Uc)); [m, largest] = max(H); tree.child = Uc(largest); break end
%Compute the node´s I for i = 1:length(Uc), Pnode(i) = length(find(targets == Uc(i))) / L; end Inode = -sum(Pnode.*log(Pnode)/log(2));
%For each dimension, compute the gain ratio impurity %This is done separately for discrete and continuous features delta_Ib = zeros(1, Ni); split_loc =
(下面还有2838字) |
|
[数据挖掘]市场细分——企业成功的一大法宝 随笔, 读书笔记, 心得体会
薛 峰 发表于 2005/6/27 14:13:21 |
企业经营者必须通过市场调研,根据消费者对商品的不同欲望与需求、不同的购买行为与习惯,把消费者整体市场划分为具有一定的类似性特征的若干子市场。
市场细分的重要作用
一、有利于企业确定自己的目标市场。目标市场能否正确选择,直接决定着企业今后一系列发展战略的确定,决定了企业今后若干年发展后劲的“先天条件”。所以企业必须在深入进行市场细分化的基础上,寻找一个理想的目标市场。如广东江门市有一专门经营空调产品的公司——江门市时尚冷气公司,原叫供销综合贸易公司,主要经营土产、日杂商品,由于商品附加价值低,经营单位多,竞争激烈,企业经济效益每况愈下,如继续经营,将难以为继。于是在1988年,该公司决策部门经过对企业内外部环境条件的认真分析,决定抓住当时空调机供不应求的大好时机,立足江门市区,辐射珠江三角洲等经济发达地区,充分发挥供销企业的传统优势,努力争取货源,利用靠近港澳的优势,直接经营进口名牌空调,减少中间环节,降低流通费用。结果由于经营方向对路,目标市场选择恰当,充分抓住了目标市场上对空调
(下面还有4580字) |
|
|